<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en_US"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://www.fratepietro.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.fratepietro.com/" rel="alternate" type="text/html" hreflang="en_US" /><updated>2026-05-08T12:05:37+02:00</updated><id>https://www.fratepietro.com/feed.xml</id><title type="html">Antonello Fratepietro</title><subtitle>Cloud engineer and developer passionate about building scalable Cloud and AI infrastructure. Writings on software architecture, distributed systems, and engineering leadership.</subtitle><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><entry><title type="html">Cognitora: A Datacenter-Scale, Open-Source LLM Inference Orchestrator (NVIDIA Dynamo Alternative)</title><link href="https://www.fratepietro.com/2026/cognitora-inference-llm-orchestration/" rel="alternate" type="text/html" title="Cognitora: A Datacenter-Scale, Open-Source LLM Inference Orchestrator (NVIDIA Dynamo Alternative)" /><published>2026-05-03T00:00:00+02:00</published><updated>2026-05-03T00:00:00+02:00</updated><id>https://www.fratepietro.com/2026/cognitora-inference-llm-orchestration</id><content type="html" xml:base="https://www.fratepietro.com/2026/cognitora-inference-llm-orchestration/"><![CDATA[<p>If you have spent any time pushing an LLM into production, the shape of the problem is familiar. A single H100 with <a href="https://github.com/vllm-project/vllm">vLLM</a> serves Llama 3 8B at impressive throughput. A single node with <a href="https://github.com/sgl-project/sglang">SGLang</a> handles structured generation beautifully. <a href="https://github.com/NVIDIA/TensorRT-LLM">TensorRT-LLM</a> wrings every last token-per-second out of an NVL72. None of those are a <em>cluster</em>. The moment the workload outgrows one box — or the moment the prefill phase wants different hardware than the decode phase, or the moment a hot prefix shows up on the wrong replica — you are back to writing a routing layer, a KV-cache layer, and a deployment layer yourself.</p>

<p>The two reference points the rest of the industry agrees on are <a href="https://github.com/ai-dynamo/dynamo"><strong>NVIDIA Dynamo</strong></a> (Rust core, Python frontend, Kubernetes-first) and a long tail of generic ML serving stacks: <a href="https://docs.ray.io/en/latest/serve/"><strong>Ray Serve</strong></a>, <a href="https://kserve.github.io/website/"><strong>KServe</strong></a>, <a href="https://github.com/triton-inference-server/server"><strong>NVIDIA Triton Inference Server</strong></a>, <a href="https://github.com/bentoml/BentoML"><strong>BentoML</strong></a>, and the <a href="https://github.com/vllm-project/production-stack"><strong>vLLM Production Stack</strong></a>. Each makes a different tradeoff between “specialized for LLM inference” and “general purpose,” between “one click on a managed cloud” and “I can run it on bare metal in a datacenter I own.”</p>

<p><a href="https://github.com/antonellof/cognitora-inference"><strong>Cognitora</strong></a> is an open-source LLM inference orchestration layer that lands in a deliberately specific spot in that design space: <strong>bare-metal-first, Rust-only, engine-agnostic, KV-cache-aware</strong>. It does not replace vLLM or SGLang — it coordinates them into a cluster. It is distributed as six statically-linked binaries with no Python control plane, no JVM operator, and no hard Kubernetes dependency. The same artifacts run as systemd units on a rack of servers, as recipes or <code class="language-plaintext highlighter-rouge">docker compose</code> on a single host, via a Helm chart on Kubernetes, or as Terraform-provisioned VMs across AWS, GCP, Azure, and Hetzner.</p>

<p>As of <strong><a href="https://github.com/antonellof/cognitora-inference/releases/tag/v0.3.0">v0.3.0</a></strong> (May 2026), the OpenAI-compatible surface includes <strong><code class="language-plaintext highlighter-rouge">/v1/chat/completions</code></strong>, <strong><code class="language-plaintext highlighter-rouge">/v1/completions</code></strong>, <strong><code class="language-plaintext highlighter-rouge">/v1/embeddings</code></strong> (real round-trip to the engine, not synthetic vectors), and <strong><code class="language-plaintext highlighter-rouge">/v1/models</code></strong>. The admin CLI <strong><code class="language-plaintext highlighter-rouge">cgn-ctl</code></strong> reads and writes <strong>etcd</strong> for cluster state (<code class="language-plaintext highlighter-rouge">cluster nodes</code>, cordon/drain, <code class="language-plaintext highlighter-rouge">model load/unload</code>), and <strong><code class="language-plaintext highlighter-rouge">cgn-ctl install --target single-node --apply</code></strong> can render <code class="language-plaintext highlighter-rouge">cognitora.toml</code> plus <code class="language-plaintext highlighter-rouge">compose.yaml</code> and bring the stack up with one command. <strong><code class="language-plaintext highlighter-rouge">cgn-metrics</code></strong> exposes Prometheus federation at <strong><code class="language-plaintext highlighter-rouge">/federate</code></strong> so upstream Prometheus scrapes a single endpoint with per-component labels. There is also a <strong>single-manifest Kubernetes quickstart</strong> (etcd + llama.cpp engine + router + agent + metrics in one Pod, LoadBalancer on port 80) that has been exercised end-to-end on <strong>GKE Autopilot</strong>—details in the “Try it” section below.</p>

<p>This post is the long-form version of <em>why that combination of choices</em>, and what falls out of them.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># One-line install — six static binaries, no runtime deps (pin a release for reproducibility)</span>
curl <span class="nt">-fsSL</span> https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | <span class="nv">CGN_VERSION</span><span class="o">=</span>v0.3.0 sh

<span class="c"># Bring up Llama-3.1 8B on a single GPU with vLLM</span>
bash recipes/llama3-8b/vllm/agg/up.sh
<span class="c"># Equivalent:</span>
<span class="c"># cgn-ctl recipe up llama3-8b/vllm/agg</span>

<span class="c"># Same model, prefill/decode disaggregated across two GPUs</span>
bash recipes/llama3-8b/vllm/disagg-single-node/up.sh

<span class="c"># Single-node Docker install (writes cognitora.toml + compose.yaml, optional --apply)</span>
cgn-ctl <span class="nb">install</span> <span class="nt">--target</span> single-node <span class="nt">--model</span> llama3-8b <span class="nt">--engine</span> vllm <span class="nt">--apply</span>
</code></pre></div></div>

<h2 id="the-thing-inference-engines-dont-do">The thing inference engines don’t do</h2>

<p>A modern inference engine is a token factory. Hand it a request, get tokens back. The contract is “one process, one model, one node.” Everything outside that contract — which replica should serve this request, which replica already has the system prompt cached on-GPU, when to spill cold KV blocks to RAM or SSD, when to migrate a long-context request from a prefill-optimized box to a decode-optimized one, how to weight a thermally-throttled GPU in the routing decision — is, from the engine’s point of view, somebody else’s problem.</p>

<p>Historically <em>somebody else</em> was a stack of glue: an Nginx in front, a Redis for KV metadata, a Python scheduler reading Prometheus, a Kubernetes operator reconciling deployments, a custom autoscaler. That stack works. It is also five processes in five languages with five failure modes, and it is the part of the system that has the worst observability story precisely when an SRE needs it most — at 03:00 on the first day a new model is in production.</p>

<p>The Cognitora bet is that the orchestration layer should be <strong>one runtime, in Rust, with a small surface area</strong>. Six binaries, each one statically linked, each one with a well-defined responsibility:</p>

<table>
  <thead>
    <tr>
      <th>Binary</th>
      <th>Job</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-router</code></td>
      <td>OpenAI-compatible HTTP gateway + KV-aware routing</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-agent</code></td>
      <td>Per-node engine supervisor + NVML telemetry</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-kvcached</code></td>
      <td>Tiered KV cache daemon (GPU / RAM / SSD) + QUIC/RDMA peer fetch</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-metrics</code></td>
      <td>Prometheus aggregator + federation <code class="language-plaintext highlighter-rouge">/federate</code>; Redfish/IPMI/DCGM where available</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-ctl</code></td>
      <td>Admin CLI — etcd-backed <code class="language-plaintext highlighter-rouge">cluster</code>/<code class="language-plaintext highlighter-rouge">model</code>, single-node <code class="language-plaintext highlighter-rouge">install</code>, PKI, bench</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-operator</code></td>
      <td>Optional Kubernetes operator (kube-rs)</td>
    </tr>
  </tbody>
</table>

<p>The <code class="language-plaintext highlighter-rouge">cgn-operator</code> is optional on purpose. If you run on bare metal, the systemd path is a first-class citizen rather than the “deprecated, please use the Helm chart” path that most cloud-native projects eventually push you toward.</p>

<h2 id="kv-aware-routing-the-routing-decision-is-the-product">KV-aware routing: the routing decision is the product</h2>

<p>The single biggest lever in multi-node LLM inference is <em>not</em> token throughput. It is <strong>KV cache reuse</strong>. If a request shares a 4,000-token system prompt with a request that finished 200 ms ago on replica B, sending the new request to replica B saves 4,000 tokens of prefill compute. Sending it to replica A — because round-robin said so — burns a GPU-second to recompute something that already exists in HBM somewhere else in the cluster. At fleet scale that decision dominates everything else.</p>

<p>There are two common ways to encode “which replica has which prefix”:</p>

<ol>
  <li><strong>Radix trees over chained block hashes.</strong> This is what Dynamo’s KV-aware router uses. Each block of KV is hashed; the cluster maintains a radix tree keyed on those hashes; the router descends the tree to find the deepest match. Fast, memory-efficient, the canonical structure.</li>
  <li><strong>Sequence-chained BLAKE3 digests with longest-prefix overlap.</strong> This is Cognitora’s choice. Each block’s digest is chained from the previous block’s digest, so the digest at position <code class="language-plaintext highlighter-rouge">i</code> summarizes the entire prefix <code class="language-plaintext highlighter-rouge">[0..i]</code>. Routing becomes “which replica reports the deepest prefix match against this digest sequence?”</li>
</ol>

<p>The two approaches are close cousins. The motivating difference for Cognitora is <strong>positional correctness on interleaved requests</strong>. With a chained digest, two requests that share tokens out of order — same content, different positions — produce different chains, so the router does not falsely claim a cache hit that would force the engine to recompute or, worse, return a positionally incorrect KV. On real-world traces with heavy system-prompt sharing the practical hit ratio sits at <strong>≥ 0.55</strong>, and the routing decision itself is sub-millisecond:</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Target</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-router</code> routing decision p99</td>
      <td>&lt; 500 µs / vCPU</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-router</code> HTTP overhead vs direct engine</td>
      <td>&lt; 3 ms p99</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-kvcached</code> warm tier hit</td>
      <td>&lt; 200 µs</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">cgn-kvcached</code> cold tier hit (SSD)</td>
      <td>&lt; 5 ms</td>
    </tr>
    <tr>
      <td>Cross-node QUIC fetch (1 MiB block, 10 GbE)</td>
      <td>&lt; 12 ms</td>
    </tr>
    <tr>
      <td>Representative cache hit ratio</td>
      <td>≥ 0.55</td>
    </tr>
    <tr>
      <td>Energy efficiency vs round-robin baseline</td>
      <td>≥ 1.4×</td>
    </tr>
  </tbody>
</table>

<p>Worth flagging the obvious caveat: those are the project’s stated targets, not numbers measured on your traffic. The shape of the metrics matters more than the absolute values — sub-millisecond routing, single-digit-millisecond HTTP overhead, sub-200-µs warm hits. If any of those numbers grew by an order of magnitude the architecture would fall apart, so they are useful as a sanity envelope.</p>

<h2 id="disaggregation-prefill-and-decode-want-different-hardware">Disaggregation: prefill and decode want different hardware</h2>

<p>Prefill is compute-bound. Decode is memory-bandwidth-bound. Running both on the same SKU is a compromise — you either over-provision compute for the decode phase or starve memory bandwidth for the prefill phase. The fix, popularized by <a href="https://arxiv.org/abs/2401.09670">DistServe</a> and now standard in production stacks, is <strong>disaggregated inference</strong>: prefill on one pool of GPUs, decode on another, with the KV blocks streamed between them.</p>

<p>Cognitora handles disaggregation through the <a href="https://github.com/ai-dynamo/nixl">NIXL</a> connector — the same NVIDIA-developed transport library Dynamo uses — and exposes the choice as a single TOML knob:</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">[</span><span class="n">engine</span><span class="k">]</span>
<span class="n">kv_offload</span> <span class="o">=</span><span class="w"> </span><span class="s">"nixl"</span>   <span class="c"># one of: none | nixl | lmcache | hicache | kvbm</span>
</code></pre></div></div>

<p>That single knob renders the right engine argv for vLLM, SGLang, or TensorRT-LLM. The recipe folders ship the topologies most people actually want — <code class="language-plaintext highlighter-rouge">recipes/llama3-8b/vllm/agg/</code>, <code class="language-plaintext highlighter-rouge">recipes/llama3-8b/vllm/disagg-single-node/</code>, <code class="language-plaintext highlighter-rouge">recipes/llama3-70b/vllm/agg/</code> — so you do not have to translate “I want 70B FP8 on 4×H100 with TP=4 and disaggregation off” into engine-specific flags.</p>

<p>Engine support matrix:</p>

<table>
  <thead>
    <tr>
      <th>Engine</th>
      <th>KV routing</th>
      <th>Disaggregation</th>
      <th>KV offload backends</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>vLLM</td>
      <td>yes</td>
      <td>NIXL</td>
      <td>LMCache, KVBM, multi-tier</td>
    </tr>
    <tr>
      <td>SGLang</td>
      <td>yes</td>
      <td>NIXL</td>
      <td>HiCache, multi-tier</td>
    </tr>
    <tr>
      <td>TensorRT-LLM</td>
      <td>yes</td>
      <td>NIXL</td>
      <td>KVBM (WIP), multi-tier</td>
    </tr>
    <tr>
      <td>llama.cpp</td>
      <td>yes</td>
      <td>n/a</td>
      <td>multi-tier</td>
    </tr>
    <tr>
      <td>OpenAI-compat</td>
      <td>yes</td>
      <td>n/a</td>
      <td>n/a (Ollama, hosted APIs)</td>
    </tr>
  </tbody>
</table>

<p>llama.cpp and OpenAI-compatible servers as <strong>first-class engines</strong> — not “we tolerate them, here is a config flag” — is a genuine differentiator. It means the same router that fronts your H100 fleet can also fan requests out to a developer’s Ollama on a Mac mini, or to a hosted Anthropic / OpenAI / Together endpoint when on-prem capacity is saturated. That changes what cluster topologies are reasonable to consider.</p>

<h2 id="tiered-kv-cache--cross-cluster-federation">Tiered KV cache + cross-cluster federation</h2>

<p>The KV cache is a hierarchy in any non-trivial deployment: GPU HBM is hot and small, system RAM is warm and bigger, SSD is cold and effectively unbounded. <code class="language-plaintext highlighter-rouge">cgn-kvcached</code> materializes that hierarchy as one daemon with explicit tier latencies (sub-200 µs warm, sub-5 ms cold) and a <strong>QUIC peer-fetch path</strong> between nodes — so a cache miss on node A that lives in node B’s RAM does not become a recompute, it becomes a 12 ms cross-node fetch.</p>

<p>The federation piece is the part that surprised me on first read. Cognitora’s router can form a <strong>federation across clusters</strong>, not just nodes — meaning a hot prefix that exists in your Frankfurt region can serve a request that landed on your Virginia router, if the prefill cost amortized across the network round-trip beats recomputing locally. Most production stacks do not even attempt this; they treat each cluster as an island. Whether that capability is <em>worth the operational complexity</em> depends entirely on your traffic shape, and Cognitora makes the right call by leaving it off by default.</p>

<p>Separately, <strong><code class="language-plaintext highlighter-rouge">cgn-metrics</code></strong> solves a smaller but universal ops problem: <strong>in-cluster Prometheus federation</strong>. Configure scrape targets in TOML and the daemon unions every target’s <code class="language-plaintext highlighter-rouge">/metrics</code> text, injects a <code class="language-plaintext highlighter-rouge">cgn_target="&lt;name&gt;"</code> label on each line, and serves the combined exposition at <strong><code class="language-plaintext highlighter-rouge">/federate</code></strong> — one scrape for your central Prometheus, without parsing the full metric stream twice. That is observability plumbing, not cross-region routing; both use the word “federation” but they are different mechanisms.</p>

<h2 id="energy-aware-scheduling">Energy-aware scheduling</h2>

<p>The bit of the design I like most aesthetically is also the one with the least proven impact: routing decisions can incorporate <strong>power telemetry from Redfish, IPMI, and DCGM</strong>. A GPU that is thermally throttled, or a node whose PSU is drawing closer to its budget than its neighbors, gets weighted down in admission control. The stated efficiency target — <strong>≥ 1.4× over a round-robin baseline</strong> — is plausible on workloads where the cluster is power-limited rather than compute-limited, which is increasingly the situation in modern racks where power per rack-U is the binding constraint.</p>

<p>I would not buy a system <em>because</em> of energy-aware scheduling alone. I would treat it as a strong tiebreaker if it is otherwise the right shape — and it is one of the explicit gaps in Dynamo today, which doesn’t surface power telemetry into routing.</p>

<h2 id="cognitora-vs-nvidia-dynamo">Cognitora vs NVIDIA Dynamo</h2>

<p>Dynamo is the obvious comparison and the most capable alternative. The two projects share a lot of DNA — Rust core, KV-aware routing, NIXL-based disaggregation, Prometheus telemetry — and disagree on a small number of important things.</p>

<table>
  <thead>
    <tr>
      <th>Aspect</th>
      <th>Cognitora</th>
      <th>NVIDIA Dynamo</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Runtime artifact</td>
      <td>Six static Rust binaries (no Python control plane)</td>
      <td>Rust core + Python frontend</td>
    </tr>
    <tr>
      <td>First-class engines</td>
      <td>vLLM, SGLang, TRT-LLM, <strong>llama.cpp</strong>, <strong>OpenAI-compat</strong></td>
      <td>vLLM, SGLang, TRT-LLM</td>
    </tr>
    <tr>
      <td>KV routing signal</td>
      <td>Sequence-chained BLAKE3 + longest-prefix overlap</td>
      <td>Radix tree on chained block hashes</td>
    </tr>
    <tr>
      <td>KV offload selection</td>
      <td>Single TOML knob (<code class="language-plaintext highlighter-rouge">none/nixl/lmcache/hicache/kvbm</code>)</td>
      <td>KVBM + LMCache + FlexKV (separate scripts)</td>
    </tr>
    <tr>
      <td>Multi-tier KV</td>
      <td>RAM + SSD + cross-cluster QUIC peer fetch</td>
      <td>Full G1–G4 (KVBM owns GPU/Host/SSD/remote)</td>
    </tr>
    <tr>
      <td>Cross-cluster federation</td>
      <td>yes — QUIC peer fetch + router federation</td>
      <td>single cluster only</td>
    </tr>
    <tr>
      <td>Multi-model cascade</td>
      <td>yes — SLM→LLM logprob gating</td>
      <td>partial</td>
    </tr>
    <tr>
      <td>Energy / power telemetry</td>
      <td>yes — Redfish + IPMI + DCGM</td>
      <td>not yet</td>
    </tr>
    <tr>
      <td>Deployment surfaces</td>
      <td>Bare metal, Docker Compose, K8s manifest or Helm (chart local path today), Terraform (modules WIP)</td>
      <td>Kubernetes-first (operator + CRDs)</td>
    </tr>
    <tr>
      <td>Multimodal / video pipelines</td>
      <td>not yet</td>
      <td>yes — Image E/P/D, FastVideo, SGLang Diffusion</td>
    </tr>
    <tr>
      <td>Gang scheduling</td>
      <td>basic (node selectors)</td>
      <td>Grove (NVL72-aware)</td>
    </tr>
    <tr>
      <td>Install surface</td>
      <td>one curl line, six static binaries</td>
      <td>pip, container, or operator</td>
    </tr>
  </tbody>
</table>

<p>Reading that table honestly: <strong>if your workload is multimodal, video, or NVL72-shaped, pick Dynamo today.</strong> That is where NVIDIA’s investment is showing through. If your workload is text-only LLM serving on heterogeneous hardware (mix of H100 / L40S / older Ampere / on-prem llama.cpp / hosted API fallback), if you do not want a Python control plane in the hot path, if you care about cross-cluster federation, or if you operate a power-constrained rack and want telemetry to feed the scheduler — Cognitora is the closer fit.</p>

<p>The <em>llama.cpp + OpenAI-compat</em> line is the one I would emphasize most to anyone considering this for a real deployment. It changes what “the cluster” can include. A company-internal cluster that is allowed to burst to a hosted API during a traffic spike has a very different cost curve than a cluster that has to provision for peak.</p>

<h2 id="cognitora-vs-the-rest-of-the-field">Cognitora vs the rest of the field</h2>

<p>Dynamo is the closest comparison; the broader field is worth a paragraph each because the alternatives genuinely have different jobs.</p>

<p><strong><a href="https://github.com/vllm-project/production-stack">vLLM Production Stack</a></strong> is the most natural alternative if you are vLLM-only and Kubernetes-native. It ships a router, autoscaler, and observability stack tuned for vLLM. Cognitora is the right pick if “vLLM-only” is not a constraint you want to commit to — most production fleets end up running at least two engines (vLLM + SGLang for structured output, or vLLM + TRT-LLM for the largest models) and the multi-engine story is easier on Cognitora’s side.</p>

<p><strong><a href="https://docs.ray.io/en/latest/serve/">Ray Serve</a></strong> is the right answer if you already run Ray, or if your inference workload is genuinely heterogeneous (LLM + classical ML + Python preprocessing + tool calls all in one DAG). Ray’s strength is composability across arbitrary Python workloads. Cognitora’s strength is being narrowly excellent at the LLM-serving slice — no Python in the data path, no Ray cluster to operate, no actor model to reason about.</p>

<p><strong><a href="https://kserve.github.io/website/">KServe</a></strong> is the Kubernetes-native, model-serving-CRD answer. It is the right fit when “this is one of forty model deployments my platform team manages, and they all need to look the same in the cluster.” If LLM inference is the workload your platform team primarily exists to serve, the abstraction layer KServe imposes starts costing more than it saves.</p>

<p><strong><a href="https://github.com/triton-inference-server/server">NVIDIA Triton Inference Server</a></strong> is still excellent for <em>non-LLM</em> inference — vision, audio, classical models — and increasingly for LLMs via the TensorRT-LLM backend. If your fleet is mostly non-LLM with LLM as a side workload, Triton is the centerpiece. If LLM is the workload, the LLM-specific systems (Cognitora, Dynamo, vLLM Production Stack) are a better starting point because the things they specialize in — KV routing, prefill/decode disaggregation, prefix sharing — are not a thing Triton optimizes for at the platform level.</p>

<p><strong><a href="https://github.com/bentoml/BentoML">BentoML</a></strong> lives at a different altitude. It is excellent at “package this Python model + preprocessing + business logic into a deployable artifact.” It is not, and does not try to be, a multi-node KV-aware orchestrator. The two compose: BentoML for service packaging, Cognitora (or Dynamo) for cluster-level orchestration of LLM-specific concerns.</p>

<p>The honest summary is that LLM serving has bifurcated into two layers that used to be one. The lower layer is “given a request and a replica, generate tokens efficiently” — vLLM, SGLang, TRT-LLM, llama.cpp own this. The upper layer is “given a fleet, route requests so KV is reused and disaggregation pays off” — Dynamo and Cognitora are the two open-source projects that take this layer seriously as a standalone product. Generic ML serving stacks (KServe, Triton, Ray Serve, BentoML) cover the upper layer for general workloads but do not optimize for the LLM-specific signals that turn out to dominate cost.</p>

<h2 id="multi-model-cascades">Multi-model cascades</h2>

<p>One smaller capability worth calling out because it has outsized cost impact: <strong>multi-model cascades with logprob gating</strong>. The idea is old — route easy queries to a small model, fall back to a large model only when the small one is uncertain — but the orchestrator has to support it natively or it becomes a Python-in-the-hot-path workaround. Cognitora exposes it as a routing policy:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>small_model = "qwen3-7b"
large_model = "llama3-70b"
gate        = "logprob"     # escalate when small-model logprob &lt; threshold
</code></pre></div></div>

<p>For workloads where ~70% of requests are genuinely simple (classification-shaped, lookup-shaped, short-answer chat), this cuts cost dramatically without touching tail quality. Dynamo has partial support; in Cognitora it is a first-class router policy.</p>

<h2 id="honest-limits">Honest limits</h2>

<p>A few things I would want to know before betting a production deployment on this:</p>

<ul>
  <li><strong>Pre-1.0.</strong> The OpenAI-compatible HTTP surface is stable; internal gRPC APIs and the TOML config surface may still shift in minor releases. Pin <strong><code class="language-plaintext highlighter-rouge">CGN_VERSION=v0.3.0</code></strong> (or newer) on the install script and read the <a href="https://github.com/antonellof/cognitora-inference/blob/main/CHANGELOG.md">changelog</a>.</li>
  <li><strong>Helm chart maturity.</strong> The chart under <code class="language-plaintext highlighter-rouge">deploy/kubernetes/helm/cognitora/</code> exists and <code class="language-plaintext highlighter-rouge">helm lint</code> passes in CI, but <strong>there is no published OCI chart at <code class="language-plaintext highlighter-rouge">oci://ghcr.io/…</code> yet</strong>—you install from a <strong>local chart path</strong> or use the <strong>quickstart manifest</strong> below until the chart ships optional engine sidecars and a simpler dev-default TLS story. Terraform modules for cloud VMs are still thin stubs; the credible cloud path today is <strong>bring your own cluster</strong> + quickstart or Helm from a git checkout.</li>
  <li><strong>No multimodal/video.</strong> If your roadmap includes image generation or video diffusion serving, Dynamo is ahead. The Cognitora architecture has no in-principle obstacle here, but the engine integrations are not shipped today.</li>
  <li><strong>Gang scheduling is basic.</strong> Node selectors, not <a href="https://github.com/NVIDIA/grove">Grove</a>-style NVL72-aware co-scheduling. If you operate NVL72 racks and need topology-aware placement, Dynamo is the better fit until this lands.</li>
  <li><strong>The performance numbers are targets, not benchmarks on your traffic.</strong> The architecture supports them; whether your specific workload realizes them depends on prefix sharing, request shape, and hardware mix. The right move on a new deployment is to A/B against round-robin on a slice of real traffic and measure. CI runs a <strong>soft</strong> perf gate (<code class="language-plaintext highlighter-rouge">cargo bench</code> on routing/prefix paths, non-blocking on PRs); hard regression gating is planned once baselines stabilize.</li>
  <li><strong>The cross-cluster federation story is powerful and operationally heavy.</strong> Turn it on only when you actually have multi-region traffic that benefits from it. The defaults are sensibly conservative.</li>
</ul>

<h2 id="try-it">Try it</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Install — six static binaries, no runtime deps (pin the release)</span>
curl <span class="nt">-fsSL</span> https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/installer/install.sh | <span class="nv">CGN_VERSION</span><span class="o">=</span>v0.3.0 sh

<span class="c"># Bring up Llama-3.1 8B on a single GPU with vLLM</span>
bash recipes/llama3-8b/vllm/agg/up.sh

<span class="c"># Disaggregated prefill/decode on two GPUs in one node</span>
bash recipes/llama3-8b/vllm/disagg-single-node/up.sh

<span class="c"># Llama-3.3 70B FP8 on 4×H100 with TP=4</span>
<span class="nv">HF_TOKEN</span><span class="o">=</span>… bash recipes/llama3-70b/vllm/agg/up.sh

<span class="c"># Admin: inspect nodes / desired models in etcd (needs etcd endpoints in cognitora.toml)</span>
cgn-ctl cluster nodes
cgn-ctl model <span class="nb">ls</span>
</code></pre></div></div>

<p><strong>Kubernetes — fastest path to a public URL</strong> (CPU demo: TinyLlama via llama.cpp in-cluster; no GPU quota required). Validated on GKE Autopilot; same manifest works on other clouds or local clusters (use port-forward if LoadBalancer stays pending):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl apply <span class="nt">-f</span> https://raw.githubusercontent.com/antonellof/cognitora-inference/main/deploy/kubernetes/quickstart/cognitora-cpu.yaml
kubectl <span class="nt">-n</span> cognitora <span class="nb">wait</span> <span class="nt">--for</span><span class="o">=</span><span class="nv">condition</span><span class="o">=</span>ready pod <span class="nt">-l</span> <span class="nv">app</span><span class="o">=</span>cognitora <span class="nt">--timeout</span><span class="o">=</span>10m
<span class="nv">IP</span><span class="o">=</span><span class="si">$(</span>kubectl <span class="nt">-n</span> cognitora get svc cognitora-router <span class="nt">-o</span> <span class="nv">jsonpath</span><span class="o">=</span><span class="s1">'{.status.loadBalancer.ingress[0].ip}'</span><span class="si">)</span>
curl <span class="nt">-sS</span> <span class="s2">"http://</span><span class="nv">$IP</span><span class="s2">/v1/chat/completions"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s1">'Content-Type: application/json'</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{"model":"tinyllama","messages":[{"role":"user","content":"What is 2+2?"}]}'</span>
</code></pre></div></div>

<p><strong>Kubernetes — Helm from a git checkout</strong> (production-shaped chart; wire your own engine / GPU pool—the chart assumes mTLS material unless you adjust values):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/antonellof/cognitora-inference.git <span class="o">&amp;&amp;</span> <span class="nb">cd </span>cognitora-inference
helm <span class="nb">install </span>cognitora ./deploy/kubernetes/helm/cognitora <span class="se">\</span>
  <span class="nt">--namespace</span> cognitora <span class="nt">--create-namespace</span> <span class="se">\</span>
  <span class="nt">--set</span> router.replicas<span class="o">=</span>2 <span class="se">\</span>
  <span class="nt">--set</span> models.llama3-70b.tp<span class="o">=</span>4
</code></pre></div></div>

<p>An <strong><code class="language-plaintext highlighter-rouge">oci://ghcr.io/antonellof/charts/cognitora</code></strong> one-liner is <strong>not</strong> published yet; track it in the repo’s <code class="language-plaintext highlighter-rouge">plan.md</code>. Until then, local chart path or the quickstart manifest above.</p>

<p>From source (if you want to read the routing code, which I recommend — it is the most interesting part):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/antonellof/cognitora-inference.git
<span class="nb">cd </span>cognitora-inference
cargo build <span class="nt">--release</span> <span class="nt">--no-default-features</span> <span class="se">\</span>
  <span class="nt">-p</span> cgn-router <span class="nt">-p</span> cgn-agent <span class="nt">-p</span> cgn-kvcached <span class="se">\</span>
  <span class="nt">-p</span> cgn-metrics <span class="nt">-p</span> cgn-ctl <span class="nt">-p</span> cgn-operator
</code></pre></div></div>

<p>Once a router is up, point any OpenAI-compatible client at it. The wire protocol is the lingua franca, so existing application code does not change. Use <strong><code class="language-plaintext highlighter-rouge">/v1/embeddings</code></strong> only when the loaded model is an embedding model—chat checkpoints will correctly surface as errors through the stack rather than fake vectors.</p>

<h2 id="why-this-shape-of-system-now">Why this shape of system, now</h2>

<p>The closing observation is meta. For two years the LLM-inference field has tolerated a stack where Python is in the request path, Kubernetes is the only first-class deployment target, and “the orchestrator” is whatever combination of Nginx, Redis, and homegrown schedulers a given team has glued together. That stack works at startup scale. It does not work at datacenter scale, where a 1% efficiency gain is worth more than a feature, where the difference between sub-500-µs and 5-ms routing decisions shows up as a line item, and where the operations team would prefer a single binary they can <code class="language-plaintext highlighter-rouge">strace</code>.</p>

<p>Cognitora is one answer to <em>“what would the orchestrator look like if it were designed today, for that scale, in one language, with KV cache reuse as the centerpiece rather than an afterthought?”</em> NVIDIA Dynamo is another. Both are credible; they make different bets on the runtime shape (six small Rust binaries vs Rust+Python), the deployment surface (bare-metal-first vs Kubernetes-first), and the engine ecosystem (broad including llama.cpp/OpenAI-compat vs the three industrial engines). Which one fits depends on what your fleet actually looks like — and the fact that there are two well-engineered open-source choices in this layer at all is a meaningful change from where the field was twelve months ago.</p>

<p><strong>Links:</strong></p>

<ul>
  <li><a href="https://github.com/antonellof/cognitora-inference">Cognitora repository</a> — Apache-2.0, Rust 1.89+</li>
  <li><a href="https://github.com/antonellof/cognitora-inference/blob/main/CHANGELOG.md">CHANGELOG</a> · <a href="https://github.com/antonellof/cognitora-inference/releases/tag/v0.3.0">v0.3.0 release</a></li>
  <li><a href="https://github.com/ai-dynamo/dynamo">NVIDIA Dynamo</a> — the closest comparable system</li>
  <li><a href="https://github.com/vllm-project/production-stack">vLLM Production Stack</a> — vLLM-only alternative</li>
  <li><a href="https://github.com/ai-dynamo/nixl">NIXL</a> — the disaggregation transport both projects build on</li>
  <li><a href="https://arxiv.org/abs/2401.09670">DistServe</a> — the original prefill/decode disaggregation paper</li>
  <li><a href="https://docs.ray.io/en/latest/serve/">Ray Serve</a> · <a href="https://kserve.github.io/website/">KServe</a> · <a href="https://github.com/triton-inference-server/server">Triton Inference Server</a> · <a href="https://github.com/bentoml/BentoML">BentoML</a> — generic ML-serving alternatives</li>
</ul>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="Systems" /><category term="LLM Inference" /><category term="vLLM" /><category term="SGLang" /><category term="TensorRT-LLM" /><category term="NVIDIA Dynamo" /><category term="KV Cache" /><category term="Disaggregated Inference" /><category term="Rust" /><category term="Kubernetes" /><category term="GPU Orchestration" /><summary type="html"><![CDATA[Inference engines like vLLM, SGLang, and TensorRT-LLM are excellent at saturating one node. They are not, by themselves, a multi-node serving system. Cognitora is a Rust-only orchestration layer shipped as six static binaries—no Python control plane, no Kubernetes-only runtime—that turns those engines into a KV-aware, disaggregated, energy-conscious cluster. This post walks through the architecture, the routing model, and how it stacks up against NVIDIA Dynamo, Ray Serve, KServe, Triton, and the vLLM Production Stack (updated for the v0.3.0 release).]]></summary></entry><entry><title type="html">s0-cli: A Self-Optimizing Security Scanner via Meta-Harness</title><link href="https://www.fratepietro.com/2026/s0-cli-meta-harness-security-scanner/" rel="alternate" type="text/html" title="s0-cli: A Self-Optimizing Security Scanner via Meta-Harness" /><published>2026-04-19T00:00:00+02:00</published><updated>2026-04-20T00:00:00+02:00</updated><id>https://www.fratepietro.com/2026/s0-cli-meta-harness-security-scanner</id><content type="html" xml:base="https://www.fratepietro.com/2026/s0-cli-meta-harness-security-scanner/"><![CDATA[<blockquote>
  <p><strong>Update — 2026-04-20 (v0.3.1).</strong> Since this post first went up:</p>
  <ul>
    <li><strong><code class="language-plaintext highlighter-rouge">vulnhunter_v0</code> harness</strong> — LLM-driven agent that hunts the eight bug classes pattern matchers can’t see (SSRF, IDOR, indirect RCE, auth/session bypass, race conditions, mass assignment, subtle crypto, path traversal). No scanner seeds; pure novelty detection. Found all 3 seeded novel vulns in a Flask test app with concrete attack payloads + fix hints.</li>
    <li><strong><code class="language-plaintext highlighter-rouge">supply_chain</code> composite scanner</strong> — OSV-Scanner (CVEs across all OSS lockfiles) + OpenSSF Scorecard (repo trust signals) + guarddog (malicious-package heuristics for PyPI/npm) in one rule. Found <strong>37 real CVEs</strong> on a vulnerable test target.</li>
    <li><strong>Standalone binaries</strong> for macOS (arm64/x86_64), Linux (x86_64/arm64), Windows. One-liner install via <code class="language-plaintext highlighter-rouge">curl … | bash</code> — no Python required.</li>
    <li><strong>MCP server</strong> (<code class="language-plaintext highlighter-rouge">s0-mcp</code>) + Claude Code skill + Cursor rule, so the AI assistants you already use can call <code class="language-plaintext highlighter-rouge">s0</code> directly.</li>
    <li><strong>8 LLM providers</strong> (added OpenRouter / Ollama local+cloud / self-hosted OpenAI-compatible / Groq / Mistral / DeepSeek / Azure) and <strong>7 output formats</strong> (added Rich terminal default / CSV / GitLab Code Quality / JUnit XML).</li>
    <li><strong>Real-world run on OWASP PyGoat</strong>: 252 raw scanner findings → 14 LLM-triaged real bugs (<a href="https://github.com/antonellof/s0-cli/blob/main/docs/results/REAL_WORLD_RESULTS.md">94% noise reduction, results doc</a>).</li>
  </ul>

  <p>The narrative below is unchanged — the Meta-Harness approach is the point — but the “Try it” section at the end and the architecture/install snippets are updated to match the current state. Updates marked <strong>[v0.3]</strong> inline.</p>
</blockquote>

<p>Static security scanners give you a wall of JSON. Semgrep finds a <code class="language-plaintext highlighter-rouge">subprocess.run(..., shell=True)</code>; bandit flags an <code class="language-plaintext highlighter-rouge">md5</code> call; gitleaks shouts about a token-shaped string in a test fixture. You — the engineer — read every alert, decide which are real, trace data flow by hand, and then close the ones that don’t matter. The scanner doesn’t help with any of that. It can’t, because it doesn’t read source.</p>

<p>The natural next move is to wedge an LLM into the triage step: run the scanners, hand the findings to a model, ask it to mark false positives, assign severities, and write fix hints. That’s the easy part. The hard part is the second-order question: <em>how do you know the LLM triage is good?</em> The standard answer in 2026 is “feels better on my test repo,” which is the same answer 2018 had for hand-tuned semgrep rules and it didn’t age well.</p>

<p><a href="https://github.com/antonellof/s0-cli"><strong>s0-cli</strong></a> is an LLM-driven CLI agent that finds security vulnerabilities and “vibe-code” problems (AI-slop patterns: stub authentication, hallucinated imports, dummy crypto, prompt-injection sinks) in any repository, diff, or single file. The thing I want to talk about in this post isn’t the scanner itself — it’s the loop <em>around</em> the scanner. The whole scanning agent is a single Python file that gets <strong>automatically rewritten by an outer optimization loop</strong>, scored against a labeled benchmark with a held-out test set. This is the <a href="https://yoonholee.com/meta-harness/">Meta-Harness</a> approach (Lee et al., 2026) applied to security triage.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ uv run s0 scan ./my-app

  hallucinated import           src/email.py:8       critical   CWE-829
    `import emailclient` — no such package on PyPI; nearest match is
    `emailclient-aws` (likely typosquat). Suggest pinning `email-validator`.

  SQL injection (f-string)      src/api/users.py:42  critical   CWE-89
    `cur.execute(f"SELECT … {user_id}")`. Use `cur.execute("… ?", (user_id,))`.

  weak password hashing         src/auth/hash.py:7   high       CWE-327
    `hashlib.md5(...)` for password storage. Use `argon2-cffi` or `bcrypt`.

3 findings (1 critical hidden as triage filtered out 6 false positives)
</code></pre></div></div>

<h2 id="the-hybrid-classic-scanners--llm-triage">The hybrid: classic scanners + LLM triage</h2>

<p>The architecture isn’t novel — <code class="language-plaintext highlighter-rouge">s0 scan</code> runs five classic scanners (<code class="language-plaintext highlighter-rouge">semgrep</code>, <code class="language-plaintext highlighter-rouge">bandit</code>, <code class="language-plaintext highlighter-rouge">ruff</code>, <code class="language-plaintext highlighter-rouge">gitleaks</code>, <code class="language-plaintext highlighter-rouge">trivy</code>) plus two AI-slop detectors (<code class="language-plaintext highlighter-rouge">hallucinated_import</code> AST-based, <code class="language-plaintext highlighter-rouge">vibe</code> LLM-based) <strong>[v0.3: + a <code class="language-plaintext highlighter-rouge">supply_chain</code> composite scanner that wraps OSV-Scanner + OpenSSF Scorecard + guarddog]</strong> on the target in parallel, deduplicates by <code class="language-plaintext highlighter-rouge">(path, line, rule_id)</code>, and hands the merged list to a multi-turn LLM agent with a tightly scoped tool surface (read source, grep for taint, blame git history, re-run scanners with tighter rules). For each finding the agent either accepts it (assigning a severity and a <code class="language-plaintext highlighter-rouge">fix_hint</code>) or marks it as a false positive.</p>

<p>The scanners do detection; the LLM does triage. That split matters because of how the numbers come out, which I’ll get to in a moment.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Traditional SAST</th>
      <th>s0-cli</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Detection</td>
      <td>one scanner</td>
      <td>5 classic + 1 supply-chain composite + 2 AI-slop detectors, deduped</td>
    </tr>
    <tr>
      <td>Triage</td>
      <td>manual (engineer reads each alert)</td>
      <td>LLM agent reads source, traces taint, marks FPs</td>
    </tr>
    <tr>
      <td>Output</td>
      <td>rule_id + line</td>
      <td>severity + <code class="language-plaintext highlighter-rouge">why_real</code> + <code class="language-plaintext highlighter-rouge">fix_hint</code>, in 7 formats (markdown / JSON / SARIF / Rich terminal / CSV / GitLab CodeQuality / JUnit XML)</td>
    </tr>
    <tr>
      <td>Audit trail</td>
      <td>none</td>
      <td>full prompt + every tool call recorded under <code class="language-plaintext highlighter-rouge">runs/</code></td>
    </tr>
    <tr>
      <td>Reproducibility</td>
      <td>re-run and hope</td>
      <td>replay any past scan from <code class="language-plaintext highlighter-rouge">runs/&lt;id&gt;/</code></td>
    </tr>
  </tbody>
</table>

<p>Everything the agent does — every prompt, every tool call, every LLM response — is recorded under <code class="language-plaintext highlighter-rouge">runs/&lt;timestamp&gt;__&lt;harness&gt;__&lt;id&gt;/</code>. That recording is not just for debugging; it’s also the input to the optimization loop.</p>

<blockquote>
  <p><strong>[v0.3 — two harnesses, two jobs]</strong> The default agent (<code class="language-plaintext highlighter-rouge">baseline_v0_agentic</code>) does the <strong>triage</strong> job described above: scanner seeds in, calibrated findings out. A second agent (<code class="language-plaintext highlighter-rouge">vulnhunter_v0</code>) does <strong>novelty detection</strong> instead: no scanner seeds, just an LLM with the same tool surface and a system prompt that targets the eight classes pattern matchers structurally can’t see (SSRF, IDOR, indirect RCE, auth bypass, race conditions, mass assignment, crypto mistakes, path traversal). The two agents share findings via the same <code class="language-plaintext highlighter-rouge">(path, line, rule_id)</code> fingerprint, so you can run both — <code class="language-plaintext highlighter-rouge">s0 scan ./repo &amp;&amp; s0 scan ./repo --harness vulnhunter_v0</code> — and downstream tools dedup automatically. Calibration of known classes is one problem; finding unknown ones is a different one, with a different optimal harness.</p>
</blockquote>

<h2 id="benchmark-11-labeled-tasks-traintest-split">Benchmark: 11 labeled tasks, train/test split</h2>

<p>Before getting to the loop, the harder problem: what does “good triage” even mean numerically?</p>

<p>The repo ships with 11 labeled tasks under <code class="language-plaintext highlighter-rouge">bench/</code>. Each task is a tiny self-contained target with a <code class="language-plaintext highlighter-rouge">ground_truth.json</code> listing the real vulnerabilities. The scorer matches predictions by <code class="language-plaintext highlighter-rouge">(path, line ± 5)</code>. The split is deliberate:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">bench/tasks_train/</code></strong> — 7 tasks, visible to the optimizer. SQL injection, XSS, hallucinated imports, command injection, weak crypto, unsafe yaml load, path traversal.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">bench/tasks_test/</code></strong> — 4 tasks, held out. Hardcoded secrets, vibe stub auth, pickle deserialization, JWT no-verify.</li>
</ul>

<p>The proposer cannot see <code class="language-plaintext highlighter-rouge">tasks_test/</code> and the loop refuses to start if the train and test paths resolve to the same directory. That last sentence sounds defensive because it is: the temptation to peek at the test set is enormous in any optimization loop, and the easiest way to remove the temptation is to make peeking impossible.</p>

<p>Two configurations on <code class="language-plaintext highlighter-rouge">openai/gpt-4o-mini</code>:</p>

<table>
  <thead>
    <tr>
      <th>Configuration</th>
      <th>Split</th>
      <th>TP</th>
      <th>FP</th>
      <th>FN</th>
      <th>Precision</th>
      <th>Recall</th>
      <th>F1</th>
      <th>Cost (in/out tokens)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--no-llm</code> (raw scanners only)</td>
      <td>train</td>
      <td>8</td>
      <td>25</td>
      <td>0</td>
      <td>0.24</td>
      <td><strong>1.00</strong></td>
      <td>0.39</td>
      <td>0 / 0</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">--no-llm</code> (raw scanners only)</td>
      <td>test</td>
      <td>5</td>
      <td>10</td>
      <td>0</td>
      <td>0.33</td>
      <td><strong>1.00</strong></td>
      <td>0.50</td>
      <td>0 / 0</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">baseline_v0_agentic</code> (LLM)</td>
      <td>train</td>
      <td>8</td>
      <td>23</td>
      <td>0</td>
      <td>0.26</td>
      <td><strong>1.00</strong></td>
      <td>0.41</td>
      <td>97k / 6k</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">baseline_v0_agentic</code> (LLM)</td>
      <td>test</td>
      <td>5</td>
      <td><strong>7</strong></td>
      <td>0</td>
      <td><strong>0.42</strong></td>
      <td><strong>1.00</strong></td>
      <td><strong>0.59</strong></td>
      <td>60k / 2k</td>
    </tr>
  </tbody>
</table>

<p>What this proves:</p>

<ul>
  <li><strong>Recall = 1.00 in every configuration.</strong> Across all 13 ground-truth vulnerabilities (train + test) — SQL injection, command injection, hallucinated imports, path traversal, weak crypto, hardcoded secrets, JWT no-verify, pickle deserialization, stub auth, … — the deterministic scanner pipeline alone catches every one. The LLM never has to <em>find</em> anything; its job is purely to triage what was already found.</li>
  <li><strong>LLM triage cuts false positives by 30% on the held-out set</strong> (10 → 7) without dropping a single true positive. Held-out F1 climbs from 0.50 → <strong>0.59</strong> (+18% relative).</li>
  <li><strong>Every scan ends in a fixed turn budget</strong> (median 5 turns, max 11 in this run) and a fixed token budget. No runaway costs.</li>
  <li><strong>The held-out test split was never seen by the LLM during any optimization run</strong> — generalization is measured, not assumed.</li>
</ul>

<p>The train F1 only moves from 0.39 → 0.41. The test F1 moves from 0.50 → 0.59. That asymmetry is the most interesting line in the table: the LLM’s triage <em>generalizes</em>, and on the held-out tasks it removes 30% of false positives without losing recall. The <code class="language-plaintext highlighter-rouge">--no-llm</code> mode stays useful as a free anchor — you keep 100% recall at zero LLM cost, at the price of more false positives to skim through. Most CI pipelines will want the LLM mode on PR diffs (small target, low token cost, accurate triage) and the no-LLM mode on full-repo nightly scans.</p>

<p>A statistical-honesty note that I’ll repeat throughout: 11 tasks is a small bench. The +0.09 test-F1 delta is one model on one bench on one run; it would be premature to claim the same delta will hold for <code class="language-plaintext highlighter-rouge">claude-sonnet-4-5</code> on a 200-task bench. What I can claim is that the <em>measurement infrastructure</em> exists, the test split is honest, and the loop is set up to keep producing those numbers as the bench grows.</p>

<h2 id="the-meta-harness-loop">The Meta-Harness loop</h2>

<p>So now the second-order question. The scanner achieves train F1=0.41 / test F1=0.59. How do you make those numbers go up <em>without</em> hand-tuning?</p>

<p>The standard answer is “iterate on the prompt.” That’s fine, but it has obvious limits:</p>

<ul>
  <li>The thing that changes is a string in a config file, but a real triage decision involves prompts <em>and</em> tool selection <em>and</em> dedup heuristics <em>and</em> severity calibration <em>and</em> when to give up.</li>
  <li>“Better” is measured by vibes, not by F1.</li>
  <li>There’s no guard against overfitting your dev repo.</li>
  <li>There’s no audit trail of what you tried and why each variant was rejected.</li>
</ul>

<p>The Meta-Harness paper (<a href="https://arxiv.org/abs/2603.28052">Lee et al., 2026</a>) generalizes the loop. The unit of mutation isn’t a string — it’s the entire single-file agent (prompts + tools + scanner-selection + dedup logic, ~300–500 lines of Python). The unit of progress is a labeled bench scored by F1, precision, recall, tokens, turns. The guard against overfitting is a held-out test set the proposer literally cannot read. And the history is a directory full of every attempt, every score, every trace.</p>

<p><code class="language-plaintext highlighter-rouge">s0 optimize</code> runs that loop:</p>

<ol>
  <li>A coding-agent <strong>proposer</strong> reads <code class="language-plaintext highlighter-rouge">runs/</code> (every prior agent, every score, every tool trace), forms a hypothesis about the worst current failure mode, and writes a new harness file under <code class="language-plaintext highlighter-rouge">src/s0_cli/harnesses/</code>.</li>
  <li>The <strong>runner</strong> validates and re-scores it on <code class="language-plaintext highlighter-rouge">bench/tasks_train/</code>.</li>
  <li>After all training iterations finish, the <strong>best-train-F1 candidate is scored once on the disjoint <code class="language-plaintext highlighter-rouge">bench/tasks_test/</code></strong> to measure generalization.</li>
</ol>

<p>The proposer’s contract is in <a href="https://github.com/antonellof/s0-cli/blob/main/SKILL.md"><code class="language-plaintext highlighter-rouge">SKILL.md</code></a>, which is read by the outer loop. It pins the interface (must subclass <code class="language-plaintext highlighter-rouge">Harness</code>, must implement <code class="language-plaintext highlighter-rouge">async def scan(self, target: Target) -&gt; ScanResult</code>, must run within budgets) and forbids the obvious cheats — touching <code class="language-plaintext highlighter-rouge">bench/tasks_test/</code> is automatic disqualification, hardcoding bench task names is instant disqualification on held-out, and so on. The contract is short on purpose; the proposer needs room to be creative on what it changes.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Hand-tuning prompts/rules</th>
      <th>Meta-Harness loop</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>What changes</strong></td>
      <td>a string in a config file</td>
      <td>a whole single-file Python agent (prompts + tools + scanner-selection + dedup logic)</td>
    </tr>
    <tr>
      <td><strong>What measures progress</strong></td>
      <td>“feels better on my test repo”</td>
      <td>a labeled bench scored by F1, precision, recall, tokens, turns</td>
    </tr>
    <tr>
      <td><strong>What guards overfitting</strong></td>
      <td>nothing</td>
      <td>held-out <code class="language-plaintext highlighter-rouge">bench/tasks_test/</code> the proposer never sees</td>
    </tr>
    <tr>
      <td><strong>History</strong></td>
      <td><code class="language-plaintext highlighter-rouge">git log</code> of edits, no scores attached</td>
      <td>every attempt + score + full trace lives forever in <code class="language-plaintext highlighter-rouge">runs/&lt;id&gt;/</code></td>
    </tr>
    <tr>
      <td><strong>Cost vs. accuracy</strong></td>
      <td>implicit; you pick one config</td>
      <td>explicit Pareto frontier (F1 ↑ vs. tokens ↓) snapshotted to <code class="language-plaintext highlighter-rouge">runs/_frontier.json</code></td>
    </tr>
    <tr>
      <td><strong>Reproducibility</strong></td>
      <td>rerun and hope</td>
      <td><code class="language-plaintext highlighter-rouge">s0 runs show &lt;id&gt;</code> replays the exact harness file, prompts, and tool calls</td>
    </tr>
    <tr>
      <td><strong>Rollback</strong></td>
      <td>manual revert</td>
      <td>the prior harness file is still on disk; just point <code class="language-plaintext highlighter-rouge">S0_DEFAULT_HARNESS</code> at it</td>
    </tr>
  </tbody>
</table>

<h2 id="why-this-matters-more-than-iterating-on-the-prompt">Why this matters more than “iterating on the prompt”</h2>

<p>A handful of properties fall out of the loop that are not available in any “edit the prompt and rerun” workflow:</p>

<p><strong>Search beats intuition.</strong> The proposer can try ideas a human wouldn’t bother with — “lower confidence on bandit B608 inside <code class="language-plaintext highlighter-rouge">tests/</code> directories”, “escalate to critical when <code class="language-plaintext highlighter-rouge">pickle.loads</code> is reachable from a Flask handler”, “skip semgrep’s <code class="language-plaintext highlighter-rouge">python.lang.security.audit.dangerous-subprocess-use</code> for <code class="language-plaintext highlighter-rouge">subprocess.run</code> calls whose first argument is a list literal” — and <em>measure</em> whether each one helps. Most will not. That’s fine; the ones that do compound.</p>

<p><strong>Pareto, not point estimates.</strong> Real choice in CI isn’t “best F1”, it’s “best F1 at the token budget I can afford on a PR”. After every iteration the Pareto frontier (F1 vs. tokens) is snapshotted to <code class="language-plaintext highlighter-rouge">runs/_frontier.json</code>. You get a menu: “harness A is the best F1 at any cost; harness B is the best F1 below 50k tokens; harness C is the best F1 below 10k tokens.” You pick whichever fits the deadline.</p>

<p><strong>Generalization is enforced, not assumed.</strong> The proposer can’t see <code class="language-plaintext highlighter-rouge">tasks_test/</code>. The loop refuses to start if the train and test paths resolve to the same directory. So a +0.1 F1 on train that comes with a -0.05 test gap shows up in the summary table — you can’t cheat your own benchmark, even by accident.</p>

<p><strong>Every iteration is auditable.</strong> Each attempt is one new file plus a <code class="language-plaintext highlighter-rouge">runs/&lt;id&gt;/</code> directory containing <code class="language-plaintext highlighter-rouge">harness.py</code>, <code class="language-plaintext highlighter-rouge">score.json</code>, <code class="language-plaintext highlighter-rouge">summary.md</code>, and per-task traces with the full prompt and every tool call. Disk-as-database; no schema migrations, just <code class="language-plaintext highlighter-rouge">grep</code>. When the team six months from now asks “why does <code class="language-plaintext highlighter-rouge">baseline_v3_taint</code> skip semgrep on test files?”, the answer is in <code class="language-plaintext highlighter-rouge">runs/2026-04-12_…/score.json</code> next to the diff that introduced it.</p>

<h2 id="the-outer-loop-reads-the-inner-loop-recursion">The “outer loop reads the inner loop” recursion</h2>

<p>The bit that makes me happiest about this design is also the bit that took me the longest to internalize. The proposer doesn’t optimize against scores. It optimizes against <strong>traces</strong>.</p>

<p>When the proposer wakes up at the start of an iteration, the first thing it does is <code class="language-plaintext highlighter-rouge">s0 runs list --frontier</code> to find the current best harnesses. The second thing it does is <code class="language-plaintext highlighter-rouge">s0 runs tail-traces &lt;run_id&gt; &lt;task_id&gt;</code> for each failure mode it suspects. It reads the actual prompt and the actual LLM response and the actual tool call sequence that produced the false positive. <em>Then</em> it forms a hypothesis. The Meta-Harness paper’s §A.1 reports a median of 82 file reads per iteration in the tbench2 setting; the SKILL.md instructs the proposer to read at least 3-5 prior trace files for each suspected failure mode, because “optimize from scores alone” is the ablation that loses 15 points on the original bench.</p>

<p>This matters because the failure mode of a triage agent is rarely “the F1 is low.” It’s usually “on this specific path-traversal task, the LLM read the wrong file first, ran out of turn budget on a tangent, and gave up before ever looking at <code class="language-plaintext highlighter-rouge">routes.py</code>.” That diagnosis is in the trace. It is not in the score. A proposer that only sees scores will rewrite the prompt; a proposer that reads traces will increase the turn cap, or change the scanner ordering, or add a <code class="language-plaintext highlighter-rouge">git_blame</code> step before <code class="language-plaintext highlighter-rouge">read_file</code>.</p>

<h2 id="multi-candidate-proposals">Multi-candidate proposals</h2>

<p>There’s one more knob worth highlighting because it changes the cost model. Pass <code class="language-plaintext highlighter-rouge">-k N</code> (or <code class="language-plaintext highlighter-rouge">--candidates N</code>) to fan out <strong>N parallel proposals per iteration</strong>, each with a different temperature, seed harness, and focus directive. The runner evaluates them concurrently and keeps the highest-F1 winner; losers are still recorded under <code class="language-plaintext highlighter-rouge">runs/</code> so you can see what each design slot tried.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 2 parallel proposals per iteration; pick the better one each time</span>
uv run s0 optimize <span class="nt">-n</span> 5 <span class="nt">-k</span> 2 <span class="nt">--run-name</span> exp_multicand <span class="nt">--fresh</span>
</code></pre></div></div>

<p>Cost scales linearly with <code class="language-plaintext highlighter-rouge">k</code>, but wall-clock cost stays roughly constant (the proposers run concurrently). The strategy ladder lives in <a href="https://github.com/antonellof/s0-cli/blob/main/src/s0_cli/optimizer/strategies.py"><code class="language-plaintext highlighter-rouge">src/s0_cli/optimizer/strategies.py</code></a> and is deterministic — <code class="language-plaintext highlighter-rouge">k=2</code> always means slot 0 (greedy, exploit) plus slot 1 (warmer, “shrink token cost”), so reruns hit the same regions of design space.</p>

<p>This is the part of the loop that feels most like classical optimization: you’re not just gradient-descending one harness, you’re doing beam search over a population of harnesses with different temperatures and different objectives, and the disk-resident <code class="language-plaintext highlighter-rouge">runs/</code> directory is the population history.</p>

<h2 id="honest-limits">Honest limits</h2>

<p>I don’t want to oversell the size of what’s been measured here. A few caveats worth keeping in mind if you’re considering using this in production or running the loop yourself:</p>

<ul>
  <li><strong>11 tasks is small.</strong> The +0.09 test-F1 delta from <code class="language-plaintext highlighter-rouge">--no-llm</code> to <code class="language-plaintext highlighter-rouge">baseline_v0_agentic</code> is one model on one bench. The bench needs to grow before any of these absolute numbers should be quoted as evidence about real CI cost vs accuracy tradeoffs. Adding tasks is documented in <a href="https://github.com/antonellof/s0-cli/blob/main/bench/README.md"><code class="language-plaintext highlighter-rouge">bench/README.md</code></a> and is the most useful contribution someone could make right now.</li>
  <li><strong>Recall = 1.00 is partly a property of the bench.</strong> Every ground-truth label in the train and test set is something one of the five classic scanners catches, by construction. A bench item like “side-channel timing leak in a custom JWT verifier” would not be caught by any current scanner, and the LLM-only <code class="language-plaintext highlighter-rouge">vibe</code> detector would have to find it from scratch. Adding tasks that <em>only</em> the vibe detector catches is the next thing that needs to happen to stress the LLM-as-detector path rather than the LLM-as-triage path. <strong>[v0.3 update]</strong> The <code class="language-plaintext highlighter-rouge">vulnhunter_v0</code> harness is the first concrete attempt at this — it found all 3 seeded SSRF / IDOR / RCE-via-indirection bugs in a custom Flask test app <em>without</em> any scanner seeds, but those numbers aren’t yet in the train/test bench above. Adding novelty-class tasks to <code class="language-plaintext highlighter-rouge">tasks_test/</code> is the obvious next benchmark contribution.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">gpt-4o-mini</code> is the cheap baseline.</strong> The numbers above are for the model that maximizes “interesting per dollar.” <code class="language-plaintext highlighter-rouge">claude-sonnet-4-5</code> is the default in <code class="language-plaintext highlighter-rouge">.env.example</code> and likely produces sharper triage; I haven’t run the full optimize loop on it because each iteration is non-trivially expensive and I want the bench to grow first.</li>
  <li><strong>The optimize loop is a research artifact, not a CI tool.</strong> <code class="language-plaintext highlighter-rouge">s0 scan</code> is the production path; <code class="language-plaintext highlighter-rouge">s0 optimize</code> is what produces <em>better</em> <code class="language-plaintext highlighter-rouge">s0 scan</code> configurations. Running optimize on every PR would be cost-prohibitive and beside the point.</li>
</ul>

<h2 id="try-it">Try it</h2>

<p><strong>Install (no Python required, v0.3.1+):</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Standalone binary — autodetects OS/arch, verifies SHA-256, installs to /usr/local/bin</span>
curl <span class="nt">-fsSL</span> https://raw.githubusercontent.com/antonellof/s0-cli/main/install.sh | bash

<span class="c"># Or pin a version + install into ~/.local without sudo</span>
curl <span class="nt">-fsSL</span> https://raw.githubusercontent.com/antonellof/s0-cli/main/install.sh <span class="se">\</span>
  | bash <span class="nt">-s</span> <span class="nt">--</span> <span class="nt">--version</span> v0.3.1 <span class="nt">--prefix</span> <span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/.local"</span>
</code></pre></div></div>

<p>The bundle ships every LLM provider plugin (Anthropic / OpenAI / Gemini / OpenRouter / Ollama / Groq / Mistral / DeepSeek / Azure) and every harness; you only install the SAST scanners you want. <code class="language-plaintext highlighter-rouge">s0 doctor</code> reports which are present.</p>

<p><strong>Or from source (recommended for development):</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/antonellof/s0-cli.git
<span class="nb">cd </span>s0-cli
uv <span class="nb">sync</span>                    <span class="c"># Python 3.12+, uv &gt;= 0.5</span>

<span class="nb">cp</span> .env.example .env       <span class="c"># then fill in one provider key</span>
</code></pre></div></div>

<p><strong>Use it:</strong></p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Default agent — triage classic + AI-slop scanner findings</span>
s0 scan ./your/repo

<span class="c"># v0.3: hunt UNKNOWN vulnerability classes (SSRF, IDOR, indirect RCE, ...)</span>
<span class="c"># — LLM-driven, no scanner seeds</span>
s0 scan ./your/repo <span class="nt">--harness</span> vulnhunter_v0

<span class="c"># v0.3: just the supply-chain layer (CVEs + repo trust + malicious-pkg heuristics)</span>
s0 scan ./your/repo <span class="nt">--no-llm</span> <span class="nt">--scanner</span> supply_chain

<span class="c"># Score the default harness on the training bench</span>
s0 <span class="nb">eval</span>

<span class="c"># Score on the held-out test set</span>
s0 <span class="nb">eval</span> <span class="nt">--split</span> <span class="nb">test</span>

<span class="c"># Run the optimize loop (5 iterations, then a held-out pass)</span>
s0 optimize <span class="nt">-n</span> 5 <span class="nt">--run-name</span> exp1 <span class="nt">--fresh</span>
</code></pre></div></div>

<p><strong>Combine the agents for full coverage:</strong> <code class="language-plaintext highlighter-rouge">s0 scan ./repo &amp;&amp; s0 scan ./repo --harness vulnhunter_v0</code>. The first calibrates known-class findings; the second hunts what pattern matchers can’t see. Findings from both runs share the same fingerprint, so SARIF / GitLab CodeQuality / JUnit reports dedup automatically downstream.</p>

<p><strong>Real-world numbers — OWASP PyGoat [v0.3]:</strong> running the default agent on the <a href="https://github.com/adeyosemanputra/pygoat">PyGoat</a> intentionally-vulnerable Django app reduced <strong>252 raw scanner findings to 14 LLM-triaged real bugs</strong> — a 94% noise reduction without dropping a single ground-truth vuln. Full session including the Pareto frontier after one <code class="language-plaintext highlighter-rouge">s0 optimize</code> iteration is reproducible from <a href="https://github.com/antonellof/s0-cli/blob/main/docs/results/REAL_WORLD_RESULTS.md"><code class="language-plaintext highlighter-rouge">docs/results/REAL_WORLD_RESULTS.md</code></a>.</p>

<p><strong>Drop-in CI integrations ship with the repo:</strong> a <a href="https://github.com/antonellof/s0-cli/blob/main/action.yml">GitHub Action</a>, a <a href="https://github.com/antonellof/s0-cli/blob/main/Dockerfile">multi-arch Docker image</a> with every scanner pre-installed, and a <a href="https://github.com/antonellof/s0-cli/blob/main/.pre-commit-hooks.yaml">pre-commit hook pair</a>. Seven output formats: <code class="language-plaintext highlighter-rouge">terminal</code> (Rich-based, default in TTY), <code class="language-plaintext highlighter-rouge">markdown</code>, <code class="language-plaintext highlighter-rouge">json</code>, <code class="language-plaintext highlighter-rouge">sarif</code> (GitHub code-scanning / GitLab SAST), <code class="language-plaintext highlighter-rouge">csv</code>, <code class="language-plaintext highlighter-rouge">gitlab</code> (Code Quality JSON for MR widgets), <code class="language-plaintext highlighter-rouge">junit</code> (XML for any CI test reporter).</p>

<p><strong>[v0.3] Use it from your AI assistant.</strong> A built-in MCP server (<code class="language-plaintext highlighter-rouge">s0-mcp</code>) plus a <a href="https://github.com/antonellof/s0-cli/blob/main/.claude/skills/s0-cli/SKILL.md">Claude Code skill</a> and a <a href="https://github.com/antonellof/s0-cli/blob/main/.cursor/rules/s0-cli.mdc">Cursor rule</a> let Claude Code, Cursor, or any MCP-aware client invoke <code class="language-plaintext highlighter-rouge">s0 scan</code> / <code class="language-plaintext highlighter-rouge">s0 scan --diff</code> / <code class="language-plaintext highlighter-rouge">s0 list_scanners</code> / <code class="language-plaintext highlighter-rouge">s0 list_harnesses</code> directly. Install guide: <a href="https://github.com/antonellof/s0-cli/blob/main/docs/integrations/INSTALL.md"><code class="language-plaintext highlighter-rouge">docs/integrations/INSTALL.md</code></a>.</p>

<p>I’m particularly interested in feedback from anyone running an LLM-augmented SAST in CI today. The hypothesis I’m trying to falsify — that <em>the agent itself should be the optimization variable, not the prompt</em> — needs validation from people who’ve actually paid the LLM bills on real PR traffic. The 11-task bench is a starting point, not an answer.</p>

<p><strong>Links:</strong></p>
<ul>
  <li><a href="https://github.com/antonellof/s0-cli">GitHub repository</a> · <a href="https://github.com/antonellof/s0-cli/releases/latest">latest release</a></li>
  <li><a href="https://github.com/antonellof/s0-cli/blob/main/README.md">README</a> — top-level overview, quickstart, CI integrations</li>
  <li><a href="https://github.com/antonellof/s0-cli/blob/main/SKILL.md"><code class="language-plaintext highlighter-rouge">SKILL.md</code></a> — proposer contract read by the outer loop</li>
  <li><a href="https://github.com/antonellof/s0-cli/blob/main/bench/README.md"><code class="language-plaintext highlighter-rouge">bench/README.md</code></a> — task layout and how to add new ones</li>
  <li><a href="https://github.com/antonellof/s0-cli/blob/main/docs/results/REAL_WORLD_RESULTS.md"><code class="language-plaintext highlighter-rouge">docs/results/REAL_WORLD_RESULTS.md</code></a> — PyGoat case study, 94% noise reduction</li>
  <li><a href="https://github.com/antonellof/s0-cli/blob/main/docs/integrations/INSTALL.md"><code class="language-plaintext highlighter-rouge">docs/integrations/INSTALL.md</code></a> — Claude Code / Cursor / generic MCP integration guide</li>
  <li>Lee et al. <strong>Meta-Harness: End-to-End Optimization of Model Harnesses.</strong> arXiv:2603.28052 (2026). <a href="https://arxiv.org/abs/2603.28052">paper</a> · <a href="https://github.com/stanford-iris-lab/meta-harness">code</a></li>
  <li>KRAFTON AI &amp; Ludo Robotics. <strong>Terminus-KIRA.</strong> <a href="https://github.com/krafton-ai/KIRA">github.com/krafton-ai/KIRA</a></li>
</ul>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="Security" /><category term="LLM Agents" /><category term="Static Analysis" /><category term="SAST" /><category term="Meta-Harness" /><category term="AI Security" /><category term="Optimization" /><category term="Supply Chain" /><category term="MCP" /><summary type="html"><![CDATA[Most security tools encode their heuristics in scattered config files or undocumented engineer intuition. s0-cli encodes them in a versioned single-file Python agent that gets automatically rewritten by an outer optimization loop, scored against a labeled benchmark with a held-out test split. On openai/gpt-4o-mini the LLM triage layer cuts false positives by 30% on held-out tasks (10 → 7) without dropping a single true positive — held-out F1 climbs from 0.50 to 0.59 while keeping recall at 1.00. v0.3.x adds a novelty-hunting harness for unknown vuln classes, a supply-chain composite scanner, MCP integration for Claude/Cursor, and standalone binaries.]]></summary></entry><entry><title type="html">MARS: Episode-Scoped GPU Retrieval for Real-Time Embodied AI</title><link href="https://www.fratepietro.com/2026/mars-gpu-memory-realtime-ai/" rel="alternate" type="text/html" title="MARS: Episode-Scoped GPU Retrieval for Real-Time Embodied AI" /><published>2026-04-10T00:00:00+02:00</published><updated>2026-04-10T00:00:00+02:00</updated><id>https://www.fratepietro.com/2026/mars-gpu-memory-realtime-ai</id><content type="html" xml:base="https://www.fratepietro.com/2026/mars-gpu-memory-realtime-ai/"><![CDATA[<p>A child’s ball rolls into the road from behind a parked van. The vehicle’s camera sees only the ball. But 600 ms earlier, the microphones captured children’s voices from that same direction — a memory that, if retrievable now, raises the prior that a child may follow the ball into the street.</p>

<p>The useful memory is 600 ms old, from a different modality than the query, and low in cosine similarity compared to countless irrelevant alternatives. No ranking by similarity alone can surface it.</p>

<p>The interesting observation is that an embodied perception stack already knows more than “find me the nearest vector” at query time. It carries an active <strong>track id</strong>, a <strong>dialogue session</strong>, an <strong>AR room</strong>, or a <strong>robot sub-task</strong> — the right answer almost always lives inside that <em>episode</em>. The whole content of <a href="https://github.com/antonellof/MARS">MARS</a> — <em>Memory for Autonomous Real-time Systems</em> — is what happens when you take that observation seriously and push it into a GPU kernel.</p>

<p><strong><a href="https://www.fratepietro.com/papers/MARS/main.pdf">Read the full paper (PDF, 1.97 MB)</a></strong></p>

<p>MARS treats episode handles as a kernel parameter. When the application can supply the active track / session / room / sub-task id, retrieval becomes a <strong>197 µs</strong> operation at N=1M — 33× faster than the only GPU baseline that retains perfect cross-modal recall on this contract, and the only configuration that fits inside the 1 ms autonomous-vehicle deadline at that corpus size. The rest of this post is what falls out of taking that one design move seriously: how the kernel pipeline changes, how the multimodal graph is structured, what happens to the recall axis at scale, and what the contract is <em>not</em> good for.</p>

<h2 id="what-existing-libraries-do-well--and-where-they-stop">What existing libraries do well — and where they stop</h2>

<p>FAISS GPU and cuVS CAGRA are excellent at finding the K most similar vectors in a static corpus. The contract they expose is: <em>over the entire indexed corpus, return the K vectors with highest cosine similarity</em>. For document retrieval, recommendation, image search — that’s the right contract.</p>

<p>Embodied workloads typically know more than that at query time. The question isn’t “find me the closest vector globally” — it’s “find any recent sensor evidence, across any modality, relevant to <em>this current track</em>.” Encoding that knowledge as host-side post-filtering on the output of a global ANN sweep wastes GPU work and, as the head-to-head benchmark below shows, breaks recall at scale on tiny multimodal clusters.</p>

<p>A first concrete instance of the same idea is temporal decay. Recency is a first-class ranking signal in a sensor stack — but cosine ANN libraries don’t fold it into the kernel:</p>

<table>
  <thead>
    <tr>
      <th>System</th>
      <th>Temporal Precision@10</th>
      <th>p99 latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>FAISS Flat (cosine only)</td>
      <td>0.218</td>
      <td>0.13 ms</td>
    </tr>
    <tr>
      <td>FAISS + post-hoc temporal filter</td>
      <td>0.910</td>
      <td>0.25 ms</td>
    </tr>
    <tr>
      <td><strong>MARS</strong> (native temporal decay)</td>
      <td><strong>0.910</strong></td>
      <td><strong>0.26 ms</strong></td>
    </tr>
    <tr>
      <td>Ring buffer + cuBLAS SGEMV (N=2,400)</td>
      <td>—</td>
      <td>0.12 ms</td>
    </tr>
  </tbody>
</table>

<p>MARS matches FAISS+filter at identical TP@10 (0.910) and a comparable per-query latency (0.26 ms vs 0.25 ms p99). The 0.01 ms gap is within run-to-run noise on a non-locked-clock A100, so the contribution here is <strong>API consolidation</strong> — the temporal filter becomes a kernel parameter rather than a second pipeline stage — rather than a raw speedup. A raw cuBLAS-only ring buffer is 3.2× faster (0.12 ms at N=2,400) but provides no temporal decay, no cross-modal retrieval, and no streaming insertion.</p>

<p>That’s the warm-up; the larger contribution is what happens when an episode handle is also pushed into the kernel.</p>

<h2 id="head-to-head-against-modern-gpu-ann-libraries">Head-to-head against modern GPU ANN libraries</h2>

<p>The contract is the embodied multimodal one: the query is an IMAGE node and we measure <code class="language-plaintext highlighter-rouge">hit@15</code> of the same-episode TEXT <em>and</em> AUDIO neighbors (kids-ball corpus, paired RNG seed=2026, A100 SXM4 40 GB).</p>

<p><img src="/papers/MARS/figures/fig_competitors.png" alt="MARS vs FAISS-GPU 1.14.1 vs cuVS CAGRA 26.04: paired bar charts of per-query p99 latency (left, log scale) and same-episode TEXT+AUDIO hit@15 (right) on the kids-ball multimodal contract at N=10K, 100K and 1M, A100 SXM4 40 GB" />
<em>Figure 7 from the paper: head-to-head wall-clock p99 (left, log scale) and <code class="language-plaintext highlighter-rouge">hit@15</code> (right, linear) on the kids-ball multimodal contract. MARS Episode-scoped (rightmost green bar in each group) is the only system that is simultaneously below the 1 ms AV deadline and at perfect recall across all three corpus sizes.</em></p>

<table>
  <thead>
    <tr>
      <th>System</th>
      <th style="text-align: right">N=10K p99 / hit@15</th>
      <th style="text-align: right">N=100K p99 / hit@15</th>
      <th style="text-align: right">N=1M p99 / hit@15</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>FAISS Flat-GPU (exhaustive)</td>
      <td style="text-align: right">0.18 ms / <strong>1.00</strong></td>
      <td style="text-align: right">0.78 ms / <strong>1.00</strong></td>
      <td style="text-align: right">6.64 ms / <strong>1.00</strong></td>
    </tr>
    <tr>
      <td>FAISS IVF-GPU (nprobe=64)</td>
      <td style="text-align: right">0.45 ms / 0.52</td>
      <td style="text-align: right">0.60 ms / 0.37</td>
      <td style="text-align: right">1.42 ms / <strong>0.00</strong></td>
    </tr>
    <tr>
      <td>cuVS CAGRA (graph_degree=64)</td>
      <td style="text-align: right">3.32 ms / <strong>1.00</strong></td>
      <td style="text-align: right">3.23 ms / 0.01</td>
      <td style="text-align: right">3.25 ms / <strong>0.00</strong></td>
    </tr>
    <tr>
      <td>MARS Global (FP32 cuBLAS)</td>
      <td style="text-align: right">0.47 ms / 0.83</td>
      <td style="text-align: right">0.67 ms / 0.79</td>
      <td style="text-align: right"><strong>2.51 ms</strong> / 0.84</td>
    </tr>
    <tr>
      <td>MARS Global (FP16 fused)</td>
      <td style="text-align: right"><strong>0.28 ms</strong> / 0.83</td>
      <td style="text-align: right">0.66 ms / 0.79</td>
      <td style="text-align: right">3.73 ms / 0.84</td>
    </tr>
    <tr>
      <td><strong>MARS Episode-scoped</strong></td>
      <td style="text-align: right"><strong>0.19 ms / 1.00</strong></td>
      <td style="text-align: right"><strong>0.20 ms / 1.00</strong></td>
      <td style="text-align: right"><strong>0.20 ms / 1.00</strong></td>
    </tr>
  </tbody>
</table>

<p>Two findings drove the design:</p>

<ul>
  <li><strong>MARS Episode-scoped is Pareto-optimal on the latency–recall frontier</strong> at every N: 33× faster than FAISS Flat at 1M and 16× faster than CAGRA, both at perfect recall on this metric. (“Pareto-optimal” here means: no measured baseline is simultaneously faster <em>and</em> better-recall on this contract; I make no claim about other latency–recall metrics or about other corpora.) When the application can supply an episode handle, restricting the kernel to episode members converts a <code class="language-plaintext highlighter-rouge">Θ(N · D)</code> cosine sweep into a <code class="language-plaintext highlighter-rouge">Θ(|episode| · D)</code> kernel that the A100 finishes in 197 µs independent of N.</li>
  <li><strong>Cosine ANN baselines collapse on this metric at scale.</strong> The kids-ball corpus has tiny clusters (10 nodes per episode, 100 K episodes at 1M), so per-episode TEXT/AUDIO neighbors are buried among ~999 990 distractors. CAGRA’s graph traversal (even at <code class="language-plaintext highlighter-rouge">search_k</code>=512) and IVF cells (any <code class="language-plaintext highlighter-rouge">nprobe</code>) miss them. MARS keeps episode membership in the graph topology and recovers them in O(member_count). FAISS Flat alone keeps recall at N=10⁶ because it is exhaustive — at 6.64 ms p99 it sits 6.6× past the 1 ms AV deadline.</li>
</ul>

<blockquote>
  <p><strong>Synthetic-corpus disclaimer.</strong> The kids-ball benchmark uses
Gaussian-perturbed cluster centroids in 768-D with a known, dense
small-cluster structure. Real-encoder embeddings (CLIP, CLAP, E5)
have different distance distributions and broader clusters; the
exact crossover N at which cosine ANN loses recall on real data
may shift. The qualitative point — that exhaustive cosine + episode
CSR is the right primitive when episodes are known and small —
should generalise; the absolute hit@15 numbers should not be quoted
without re-measurement on the target encoder. A real-encoder
validation run is the first item in §10.1 of the paper.</p>
</blockquote>

<blockquote>
  <p><strong>What would be a fairer FAISS baseline.</strong> A FAISS-Flat-GPU sweep
with <code class="language-plaintext highlighter-rouge">IDSelectorBatch</code> or <code class="language-plaintext highlighter-rouge">IDSelectorRange</code> set to the episode
member ids would do roughly the same work as MARS Episode-scoped
and is the next baseline to add. It is queued in §10.1 as the
first item of <em>Evaluation Hardening</em>. I expect it to be the same
order of magnitude as MARS-Episode — the win MARS keeps would
then be that the episode CSR is built into the same data
structure as the cross-modal NSN, not that the cosine kernel is
somehow faster.</p>
</blockquote>

<h2 id="episode-scoped-retrieval-is-near-flat">Episode-scoped retrieval is near-flat</h2>

<p><img src="/papers/MARS/figures/fig_episode_scoped_scaling.png" alt="Wall-clock p99 vs corpus size N (log–log) on the kids-ball contract: FAISS Flat-GPU rises sharply and crosses 1 ms by N=10^5, cuVS CAGRA stays at ~3.25 ms but loses recall, MARS Global rises moderately, and MARS Episode-scoped is near-flat at 192 / 203 / 199 µs from N=10K to N=1M with hit@15=1.00" />
<em>Figure 6 from the paper: episode-scoped MARS (green diamonds) is near-flat at ~200 µs across three decades of N while keeping perfect cross-modal recall; the only baseline that also achieves <code class="language-plaintext highlighter-rouge">hit@15</code>=1.00 (FAISS Flat) crosses the 1 ms AV deadline already at N=10^5 and is 33× slower at N=1M.</em></p>

<p>The episode-scoped curve grows by only ~19 % as N goes from 10⁴ to 10⁶, because the GPU work is bounded by the episode size (~10 members), not by N. At N=1M the path delivers <strong>197 µs p99</strong> wall-clock — below every per-frame deadline including the 1 ms AV budget — while returning a perfect-recall multimodal answer.</p>

<p>When does the contract apply? Episode scope is correct only when the right episode is known <em>before</em> the query. It is the natural contract for AV per-track re-identification (track id is the episode), voice-agent turn taking (session id is the episode), AR/VR per-room recall (room id is the episode), and embodied task loops (current sub-task id is the episode). It is <strong>not</strong> correct for open-ended global semantic search — but that workload has the wider deadline (10–100 ms) where the global path or a cosine ANN baseline already fits.</p>

<p>The paper’s <strong>Episode-Scope Crossover proposition</strong> (§6.3) makes this concrete: episode-scoped beats global on a bandwidth-bound model when <code class="language-plaintext highlighter-rouge">|episode| / N + ε_overhead &lt; 1</code>, which on the kids-ball corpus is satisfied for any N ≥ ~1K. It also derives an upper bound on the recall advantage as a function of episode density vs distractor density — useful for predicting whether the contract is worth wiring into an application before measuring.</p>

<h2 id="two-contracts-one-gpu-resident-substrate">Two contracts, one GPU-resident substrate</h2>

<p><img src="/papers/MARS/figures/diag_pipeline.png" alt="MARS retrieval pipeline diagram: text/audio/image encoders feed a unified 768-D embedding space; three GPU stages (cuBLAS SGEMV with temporal decay, CUB radix sort top-K, warp-cooperative BFS expansion) operate over a CSR-format memory graph holding row_offsets, col_indices, embeddings, modalities, timestamps and an episode_csr member list; a green dashed arc shows the episode-scoped fast path that bypasses Stage 3 BFS when query_episode_id is supplied" />
<em>Figure 1 from the paper: MARS retrieval pipeline. Sensor encoders (left) feed a shared 768-D embedding space; four GPU kernels orchestrate retrieval over a CSR-format memory graph resident in device memory. The green dashed arc is the episode-scoped fast path.</em></p>

<p>MARS stores text, audio, image, and sensor embeddings in a shared 768-D space as nodes in a Neural Shortcut Network (NSN) with cross-modal bridges. Two contracts share the same GPU-resident data:</p>

<p><strong>Global path</strong> (when no episode handle is available): four kernels, sub-millisecond at N ≤ 50K, zero per-query allocation.</p>

<ol>
  <li><strong>Stage 1 — Cosine + temporal decay.</strong> Default: cuBLAS <code class="language-plaintext highlighter-rouge">Sgemv</code> (FP32) followed by <code class="language-plaintext highlighter-rouge">score × exp(-λ·age)</code>. Opt-in <code class="language-plaintext highlighter-rouge">--use-fp16</code> switches to the hand-fused FP16 cosine kernel — wins by 41 % at N=10K but loses by 49 % at N=1M.</li>
  <li><strong>Stage 2 — CUB radix sort top-K</strong> in O(N).</li>
  <li><strong>Stage 3 — Warp-cooperative BFS</strong> through NSN bridges with <code class="language-plaintext highlighter-rouge">atomicCAS</code> race-free neighbor claiming. Score propagation rule (Algorithm 1 in the paper): <code class="language-plaintext highlighter-rouge">score[u] ← max(score[parent] · δ_prop, sim[u] · α_bfs)</code> with defaults <code class="language-plaintext highlighter-rouge">δ_prop = 0.85</code> (per-hop attenuation) and <code class="language-plaintext highlighter-rouge">α_bfs = 0.8</code> (cap on raw similarity for purely-graph-discovered nodes). Temporal decay is <em>not</em> re-applied during BFS — it has already been folded into <code class="language-plaintext highlighter-rouge">sim[u]</code> at Stage 1.</li>
</ol>

<p><strong>Episode-scoped fast path</strong> (when <code class="language-plaintext highlighter-rouge">query_episode_id</code> is supplied): Stage 1 is restricted to the episode’s CSR member list and Stage 3 BFS is skipped entirely. The result is the green dashed arc in the diagram above and the near-flat scaling curve in the previous section.</p>

<h3 id="when-fp16-fused-beats-cublas-sgemv-and-when-it-doesnt">When FP16 fused beats cuBLAS Sgemv (and when it doesn’t)</h3>

<p><img src="/papers/MARS/figures/fig_fp16_crossover.png" alt="Bar chart of FP32 cuBLAS Sgemv (default) vs hand-fused FP16 wall-clock p99 at N=10K, 100K and 1M on A100: FP16 wins by 41% at 10K (0.47 → 0.28 ms), is essentially tied at 100K (0.67 → 0.65 ms), and loses by 49% at 1M (2.51 → 3.73 ms)" />
<em>Figure 8 from the paper: the hand-fused FP16 cosine kernel wins at small N where the working set fits in L2, but cuBLAS <code class="language-plaintext highlighter-rouge">Sgemv</code> engages Tensor-Core paths that the fused kernel cannot match at large N. The default deployment uses cuBLAS; <code class="language-plaintext highlighter-rouge">--use-fp16</code> is documented as a small-N opt-in.</em></p>

<p>The hand-fused FP16 cosine kernel is bandwidth-optimal at small N because the entire embedding tile fits in L2 and the dot product becomes memory-bound. cuBLAS <code class="language-plaintext highlighter-rouge">Sgemv</code> cannot beat that at N=10K. But once N exceeds the L2 working set, cuBLAS’s Tensor-Core paths and per-arch tile heuristics dominate — FP16 fused regresses by 49 % at N=1M.</p>

<p>This is one of the more honest findings of the paper: bring-your-own-kernel is not always faster than vendor BLAS, and the crossover point is hardware- and corpus-size dependent. The build defaults to cuBLAS; <code class="language-plaintext highlighter-rouge">--use-fp16</code> is documented as a small-N opt-in.</p>

<h2 id="scaling-and-deadline-compliance">Scaling and deadline compliance</h2>

<p>Measured on A100 SXM4 40GB (D=768, K=10, cuBLAS+CUB):</p>

<table>
  <thead>
    <tr>
      <th>Corpus</th>
      <th style="text-align: right">Global path p99</th>
      <th style="text-align: right"><strong>Episode-scoped p99</strong></th>
      <th>Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1K</td>
      <td style="text-align: right">0.31 ms</td>
      <td style="text-align: right">—</td>
      <td>Sub-ms</td>
    </tr>
    <tr>
      <td>10K</td>
      <td style="text-align: right">0.44 ms</td>
      <td style="text-align: right"><strong>0.19 ms</strong></td>
      <td>Sub-ms</td>
    </tr>
    <tr>
      <td>50K</td>
      <td style="text-align: right">0.56 ms</td>
      <td style="text-align: right"><strong>0.17 ms</strong></td>
      <td>Sub-ms</td>
    </tr>
    <tr>
      <td>100K</td>
      <td style="text-align: right">0.74 ms</td>
      <td style="text-align: right"><strong>0.20 ms</strong></td>
      <td>Sub-ms</td>
    </tr>
    <tr>
      <td>1M</td>
      <td style="text-align: right">2.67 ms</td>
      <td style="text-align: right"><strong>0.20 ms</strong></td>
      <td>Real-time only via Episode-scoped</td>
    </tr>
    <tr>
      <td>10M</td>
      <td style="text-align: right">22.3 ms</td>
      <td style="text-align: right">(mem-bound)</td>
      <td>Batch</td>
    </tr>
    <tr>
      <td>13M</td>
      <td style="text-align: right">29.1 ms</td>
      <td style="text-align: right">(mem-bound)</td>
      <td>VRAM limit</td>
    </tr>
  </tbody>
</table>

<p>Five workloads pass empirical p99 deadlines on A100:</p>

<p><img src="/papers/MARS/figures/fig_deadline.png" alt="Horizontal bar chart of measured p99 vs deadline budgets on A100 SXM4 40 GB: MARS Episode-scoped at N=1M (0.20 ms / 80% headroom on the 1 ms AV deadline), AV perception 60 Hz N=10K (0.87 ms / 13%), humanoid robot 1 kHz N=10K (0.76 ms / 24%), AR/VR spatial 90 Hz N=10K (1.56 ms / 69% on the 5 ms budget), voice agent 30 Hz N=10K (0.88 ms / 96% on the 20 ms budget); all five PASS" />
<em>Figure 4 from the paper: deadline compliance for all four demonstrators on A100 SXM4 plus the episode-scoped path at N=1M. All pass; episode-scoped delivers a perfect-recall multimodal answer with 80% headroom on the 1 ms AV deadline at N=1M.</em></p>

<table>
  <thead>
    <tr>
      <th>Workload</th>
      <th>Rate</th>
      <th>Budget</th>
      <th style="text-align: right">Measured p99</th>
      <th>Headroom</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>AV perception (N=10K)</td>
      <td>60 Hz</td>
      <td>1 ms</td>
      <td style="text-align: right">0.87 ms</td>
      <td>13 %</td>
    </tr>
    <tr>
      <td>Humanoid robot (N=10K)</td>
      <td>1 kHz</td>
      <td>1 ms</td>
      <td style="text-align: right">0.76 ms</td>
      <td>24 %</td>
    </tr>
    <tr>
      <td>AR/VR spatial (N=10K)</td>
      <td>90 Hz</td>
      <td>5 ms</td>
      <td style="text-align: right">1.56 ms</td>
      <td>69 %</td>
    </tr>
    <tr>
      <td>Voice agent (N=10K)</td>
      <td>30 Hz</td>
      <td>20 ms</td>
      <td style="text-align: right">0.88 ms</td>
      <td>96 %</td>
    </tr>
    <tr>
      <td><strong>MARS Episode-scoped (N=1M)</strong></td>
      <td>any</td>
      <td>1 ms</td>
      <td style="text-align: right"><strong>0.20 ms</strong></td>
      <td><strong>80 %</strong></td>
    </tr>
  </tbody>
</table>

<p>The last row is the one I’d flag for embodied roboticists: a perfect-recall multimodal answer at sensor rate against a million-memory corpus, with 80 % headroom on the AV deadline.</p>

<h2 id="a-statistical-honesty-note">A statistical-honesty note</h2>

<p>All the p99s above are over 128–256 paired probes on <strong>non-locked-clock</strong> vast.ai instances with shared multi-tenant hosts. At sub-millisecond scale, run-to-run jitter is on the order of ±10 % and the 12th-largest of 128 samples has wide confidence intervals. Concretely:</p>

<ul>
  <li>The 0.26 ms vs 0.25 ms gap on the temporal-decay experiment is well within noise; quoting either as “winning” would be misleading.</li>
  <li>The flip in the FP16-vs-cuBLAS large-corpus table where RTX 5060 Ti looks faster than A100 SXM4 at N=500K but slower at N=1M is reported faithfully but should not be over-interpreted.</li>
  <li>The episode-scoped advantage at N=1M is large enough (13× over MARS-Global, 33× over FAISS-Flat) that no plausible jitter explains it away — this is the one number I’d defend even after locked-clock re-measurement.</li>
</ul>

<p>The paper carries an explicit <strong>Measurement Methodology and Statistical Caveats</strong> subsection (§7.2) and an <strong>Evaluation Hardening</strong> track (§10.1) that lists the locked-clock re-runs, the real-encoder benchmark, the FAISS+IDSelector head-to-head, the standard temporal-IR metrics (TS-Recall@10, time-NDCG@10), and per-stage Nsight Compute breakdowns. Cost ~$15, time ~2 days; queued for the next iteration.</p>

<h2 id="the-neural-shortcut-network">The Neural Shortcut Network</h2>

<p>What makes MARS more than “cuBLAS with a timestamp column” is the graph structure. Memories are nodes in a CSR-format graph built in five phases:</p>

<ol>
  <li>Ring lattice (k=6 local neighbors)</li>
  <li>Hierarchical skip connections (powers of 2)</li>
  <li>Hub supernodes at √N intervals</li>
  <li>Small-world rewiring (Watts–Strogatz, p=0.15)</li>
  <li><strong>Cross-modal bridges</strong> — every node gets one edge to each other modality</li>
</ol>

<p>Phase 5 is critical: a query starting with an audio embedding reaches visual and text memories through graph traversal, without separate per-modality indices. The warp-cooperative BFS kernel explores these bridges in &lt;0.04 ms.</p>

<p>The construction is <strong>deterministic</strong> — Watts–Strogatz small-world plus deterministic cross-modal bridges, with no learned weights, no gradient-tuned objectives, no learned edge selection. The “Neural” in <em>Neural Shortcut Network</em> refers only to its role as a neural-embedding index. Phase 5’s guarantee is correspondingly modest: it guarantees <em>structural</em> reachability — by construction, every node has at least one edge to every other modality, so a 1-hop BFS sees all modalities — and makes <strong>no claim</strong> that the reached neighbors are semantically relevant. Whether the BFS-discovered cross-modal memory is the <em>right</em> one depends on the quality of the shared embedding space, not on the graph topology.</p>

<p>The graph store also carries a sixth component beyond the five edge-construction phases: an <code class="language-plaintext highlighter-rouge">episode_csr</code> member list that drives the episode-scoped fast path. It is a per-episode index (uint32 offsets into the embedding array), built once at corpus load time, queried by the <code class="language-plaintext highlighter-rouge">RetrievalScope::EpisodeScoped</code> contract.</p>

<h2 id="what-its-not">What it’s not</h2>

<p>MARS is not a vector database. Same conceptual layer — indexing, similarity, retrieval — but different latency envelope, different durability model, different deployment target. Think cuBLAS vs LAPACK: same operations, different hardware. The working set is seconds to hours of recent sensor data, bounded to fit in GPU VRAM.</p>

<p>This is also soft real-time, not hard real-time. The evaluation shows empirical p99 compliance with zero deadline misses over 30-second runs. True hard real-time (ISO-26262 ASIL-D) would require provable worst-case bounds, which MARS does not provide.</p>

<p><strong>One known issue.</strong> The CUDA Graph capture path (<code class="language-plaintext highlighter-rouge">--use-cuda-graph</code>) currently corrupts results because counters and the episode-scoped reset are not re-initialised between graph replays — captured-graph replays return <code class="language-plaintext highlighter-rouge">hit@15</code>=0.004 and the next direct launch hits <code class="language-plaintext highlighter-rouge">memory_cuda.cu:1197 — invalid argument</code>. Tracked in <a href="https://github.com/antonellof/MARS/blob/main/docs/ARCHITECTURE.md"><code class="language-plaintext highlighter-rouge">docs/ARCHITECTURE.md</code> §7.4</a> and in the paper’s Future Work section. The hand-fused FP16 path and episode-scoped fast path land cleanly without <code class="language-plaintext highlighter-rouge">--use-cuda-graph</code>.</p>

<h2 id="try-it">Try it</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/antonellof/MARS.git
<span class="nb">cd </span>MARS
make tests          <span class="c"># host-only unit tests (17/17, no GPU needed)</span>
make <span class="o">&amp;&amp;</span> make check  <span class="c"># full build + hardware validation</span>
make demo-av        <span class="c"># 60 Hz AV perception demo</span>

<span class="c"># Episode-scoped fast path on the kids-ball corpus</span>
./demos/embodied_scene/demo <span class="nt">--scope</span><span class="o">=</span>episode

<span class="c"># Reproduce the head-to-head competitor benchmarks</span>
pip <span class="nb">install</span> <span class="nt">--extra-index-url</span><span class="o">=</span>https://pypi.nvidia.com <span class="se">\</span>
  <span class="s1">'cuvs-cu12==26.4.*'</span> faiss-gpu-cu12 cupy-cuda12x
python3 scripts/bench_kids_ball_faiss.py <span class="se">\</span>
  <span class="nt">--corpus</span> results/competitors_20260417/corpus/kids_1m.bin
python3 scripts/bench_kids_ball_cuvs_cagra.py <span class="se">\</span>
  <span class="nt">--corpus</span> results/competitors_20260417/corpus/kids_1m.bin
</code></pre></div></div>

<p>The code is MIT licensed. The <a href="https://www.fratepietro.com/papers/MARS/main.pdf">paper</a> has the full methodology, kernel pseudocode, ablation studies, the §7.11 (episode-scoped retrieval) and §7.12 (head-to-head against FAISS / cuVS CAGRA) sections, the statistical caveats subsection (§7.2), and the §10.1 evaluation hardening track.</p>

<p>I’m particularly interested in feedback from anyone building real-time perception pipelines. The hypothesis — that an embodied loop’s existing notion of <em>track / room / session / sub-task id</em> is worth pushing all the way down into a GPU kernel parameter — needs validation from people who’ve actually shipped those loops. The data on the kids-ball corpus is dramatic, but I’d much rather know that someone with a real CLIP/CLAP corpus tried it and either confirmed the win or measured a counter-example.</p>

<p><strong>Links:</strong></p>
<ul>
  <li><a href="https://www.fratepietro.com/papers/MARS/main.pdf">Paper (PDF, 1.97 MB)</a></li>
  <li><a href="https://github.com/antonellof/MARS">GitHub repository</a></li>
  <li><a href="https://github.com/antonellof/MARS/blob/main/README.md">README</a> — top-level overview</li>
  <li><a href="https://github.com/antonellof/MARS/blob/main/docs/ARCHITECTURE.md">Architecture deep dive</a> — episode-scoped, head-to-head, FP16, known issues</li>
  <li><a href="https://github.com/antonellof/MARS/blob/main/docs/BENCHMARKS.md">Benchmark results</a></li>
  <li><a href="https://github.com/antonellof/MARS/blob/main/results/competitors_20260417/SUMMARY.md">Head-to-head competitor SUMMARY</a> — full FAISS / cuVS CAGRA / MARS run logs</li>
  <li><a href="https://github.com/antonellof/MARS/blob/main/NEXT_STEPS.md">Next steps</a> — Evaluation Hardening track ($15, ~2 days)</li>
  <li><a href="https://github.com/antonellof/MARS/blob/main/scripts/generate_competitor_figures.py">Figure-generation script</a> — re-renders the paper figures from the JSON artefacts</li>
</ul>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="Systems" /><category term="CUDA" /><category term="GPU" /><category term="Real-Time Systems" /><category term="Autonomous Vehicles" /><category term="Robotics" /><category term="Memory" /><category term="Vector Search" /><summary type="html"><![CDATA[A child's ball rolls into the road. Vision sees only the ball — but 600 ms ago, microphones captured children's voices from that direction. MARS (Memory for Autonomous Real-time Systems) treats scope, time, and cross-modal connectivity as kernel-level primitives and delivers a perfect-recall multimodal answer in 197 µs at N=1M, 33× faster than FAISS-Flat-GPU on the same hardware — when the application can supply an episode handle.]]></summary></entry><entry><title type="html">Agentic AI Systems: Multi-Agent Architectures</title><link href="https://www.fratepietro.com/2025/agentic-ai-multi-agent/" rel="alternate" type="text/html" title="Agentic AI Systems: Multi-Agent Architectures" /><published>2025-11-22T00:00:00+01:00</published><updated>2025-11-22T00:00:00+01:00</updated><id>https://www.fratepietro.com/2025/agentic-ai-multi-agent</id><content type="html" xml:base="https://www.fratepietro.com/2025/agentic-ai-multi-agent/"><![CDATA[<p>Single AI agents struggle with complex tasks. They exceed context limits, conflate responsibilities, and become brittle monoliths. Multi-agent systems decompose complexity: specialized agents handle distinct concerns, coordinate through messages, and compose into robust systems.</p>

<p>I built a multi-agent code analysis system: one agent parsed code structure, another reasoned about architecture, a third suggested refactorings. Each was smaller, testable, and replaceable. The coordinator orchestrated their interaction. The result was more maintainable than a single “do everything” agent.</p>

<p>Multi-agent systems aren’t new—<a href="https://en.wikipedia.org/wiki/Multi-agent_system">distributed AI</a> has decades of research. But LLMs make them practical: agents can understand natural language instructions, reason about tasks, and collaborate without rigid protocols.</p>

<h2 id="why-multiple-agents">Why Multiple Agents?</h2>

<p><strong>Separation of concerns</strong> - Parsing, reasoning, and execution are distinct skills. Separate agents, separate prompts, separate tests.</p>

<p><strong>Context management</strong> - LLMs have finite context. Multiple focused agents stay within limits.</p>

<p><strong>Specialization</strong> - Train/tune agents for specific domains (legal analysis, code review, data extraction).</p>

<p><strong>Fault isolation</strong> - If the code execution agent fails, the reasoning agent continues.</p>

<p><strong>Testability</strong> - Test each agent independently with unit tests.</p>

<p><strong>Scalability</strong> - Scale expensive agents (GPT-4) separately from cheap ones (Claude Haiku).</p>

<p>Read <a href="https://microsoft.github.io/autogen/">AutoGen</a> and <a href="https://www.langchain.com/langgraph">LangGraph</a> for framework approaches.</p>

<h2 id="agent-roles-and-patterns">Agent Roles and Patterns</h2>

<h3 id="1-coordinator-orchestrator">1. Coordinator (Orchestrator)</h3>

<p>Decomposes high-level goals into subtasks, assigns to specialists, aggregates results.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">anthropic</span> <span class="kn">import</span> <span class="n">Anthropic</span>
<span class="kn">from</span> <span class="n">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Dict</span>

<span class="k">class</span> <span class="nc">Coordinator</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Orchestrate multi-agent workflow.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">specialists</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="sh">'</span><span class="s">Agent</span><span class="sh">'</span><span class="p">]):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">client</span> <span class="o">=</span> <span class="nc">Anthropic</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">specialists</span> <span class="o">=</span> <span class="n">specialists</span>
        <span class="n">self</span><span class="p">.</span><span class="n">history</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Break down task and coordinate execution.</span><span class="sh">"""</span>
        
        <span class="c1"># Decompose task
</span>        <span class="n">subtasks</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">decompose</span><span class="p">(</span><span class="n">task</span><span class="p">)</span>
        
        <span class="n">results</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="k">for</span> <span class="n">subtask</span> <span class="ow">in</span> <span class="n">subtasks</span><span class="p">:</span>
            <span class="n">agent_type</span> <span class="o">=</span> <span class="n">subtask</span><span class="p">[</span><span class="sh">'</span><span class="s">agent</span><span class="sh">'</span><span class="p">]</span>
            <span class="n">agent</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">specialists</span><span class="p">[</span><span class="n">agent_type</span><span class="p">]</span>
            
            <span class="c1"># Execute subtask
</span>            <span class="n">result</span> <span class="o">=</span> <span class="k">await</span> <span class="n">agent</span><span class="p">.</span><span class="nf">execute</span><span class="p">(</span><span class="n">subtask</span><span class="p">[</span><span class="sh">'</span><span class="s">task</span><span class="sh">'</span><span class="p">])</span>
            <span class="n">results</span><span class="p">[</span><span class="n">subtask</span><span class="p">[</span><span class="sh">'</span><span class="s">id</span><span class="sh">'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">result</span>
        
        <span class="c1"># Synthesize results
</span>        <span class="n">final_answer</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">synthesize</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">results</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">final_answer</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">decompose</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Dict</span><span class="p">]:</span>
        <span class="sh">"""</span><span class="s">Decompose task into subtasks.</span><span class="sh">"""</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">claude-3-5-sonnet-20241022</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">max_tokens</span><span class="o">=</span><span class="mi">2048</span><span class="p">,</span>
            <span class="n">system</span><span class="o">=</span><span class="sh">"""</span><span class="s">You are a task coordinator. Break down complex tasks into subtasks.

Output JSON array:
[
  {</span><span class="sh">"</span><span class="s">id</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">1</span><span class="sh">"</span><span class="s">, </span><span class="sh">"</span><span class="s">agent</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">search</span><span class="sh">"</span><span class="s">, </span><span class="sh">"</span><span class="s">task</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">Find relevant documentation</span><span class="sh">"</span><span class="s">},
  {</span><span class="sh">"</span><span class="s">id</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">2</span><span class="sh">"</span><span class="s">, </span><span class="sh">"</span><span class="s">agent</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">code</span><span class="sh">"</span><span class="s">, </span><span class="sh">"</span><span class="s">task</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">Analyze code structure</span><span class="sh">"</span><span class="s">}
]</span><span class="sh">"""</span><span class="p">,</span>
            <span class="n">messages</span><span class="o">=</span><span class="p">[{</span>
                <span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span>
                <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="sa">f</span><span class="sh">"</span><span class="s">Break down this task:</span><span class="se">\n\n</span><span class="si">{</span><span class="n">task</span><span class="si">}</span><span class="sh">"</span>
            <span class="p">}]</span>
        <span class="p">)</span>
        
        <span class="kn">import</span> <span class="n">json</span>
        <span class="k">return</span> <span class="n">json</span><span class="p">.</span><span class="nf">loads</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="p">)</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">synthesize</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">results</span><span class="p">:</span> <span class="n">Dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Combine results into final answer.</span><span class="sh">"""</span>
        <span class="n">context</span> <span class="o">=</span> <span class="sh">"</span><span class="se">\n\n</span><span class="sh">"</span><span class="p">.</span><span class="nf">join</span><span class="p">([</span>
            <span class="sa">f</span><span class="sh">"</span><span class="s">Subtask </span><span class="si">{</span><span class="n">k</span><span class="si">}</span><span class="s">:</span><span class="se">\n</span><span class="si">{</span><span class="n">v</span><span class="si">}</span><span class="sh">"</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">results</span><span class="p">.</span><span class="nf">items</span><span class="p">()</span>
        <span class="p">])</span>
        
        <span class="n">response</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">claude-3-5-sonnet-20241022</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">max_tokens</span><span class="o">=</span><span class="mi">4096</span><span class="p">,</span>
            <span class="n">system</span><span class="o">=</span><span class="sh">"</span><span class="s">You are a synthesizer. Combine subtask results into a coherent answer.</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">messages</span><span class="o">=</span><span class="p">[{</span>
                <span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span>
                <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">Original task: </span><span class="si">{</span><span class="n">task</span><span class="si">}</span><span class="s">

Subtask results:
</span><span class="si">{</span><span class="n">context</span><span class="si">}</span><span class="s">

Provide a comprehensive answer to the original task.</span><span class="sh">"""</span>
            <span class="p">}]</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span>
</code></pre></div></div>

<h3 id="2-specialist-agents">2. Specialist Agents</h3>

<p>Domain-specific agents with focused expertise:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SearchAgent</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Specialist for web/documentation search.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">search_api</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">client</span> <span class="o">=</span> <span class="nc">Anthropic</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">search_api</span> <span class="o">=</span> <span class="n">search_api</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">execute</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Execute search task.</span><span class="sh">"""</span>
        <span class="c1"># Extract search query
</span>        <span class="n">query</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">extract_query</span><span class="p">(</span><span class="n">task</span><span class="p">)</span>
        
        <span class="c1"># Perform search
</span>        <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="n">search_api</span><span class="p">.</span><span class="nf">search</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
        
        <span class="c1"># Synthesize results
</span>        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">synthesize_results</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">extract_query</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Extract search query from task description.</span><span class="sh">"""</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">claude-3-haiku-20240307</span><span class="sh">"</span><span class="p">,</span>  <span class="c1"># Cheap model for extraction
</span>            <span class="n">max_tokens</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span>
            <span class="n">system</span><span class="o">=</span><span class="sh">"</span><span class="s">Extract the search query from the task. Return only the query text.</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">}]</span>
        <span class="p">)</span>
        <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="p">.</span><span class="nf">strip</span><span class="p">()</span>


<span class="k">class</span> <span class="nc">CodeAgent</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Specialist for code analysis.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">client</span> <span class="o">=</span> <span class="nc">Anthropic</span><span class="p">()</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">execute</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Execute code analysis task.</span><span class="sh">"""</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">claude-3-5-sonnet-20241022</span><span class="sh">"</span><span class="p">,</span>
            <span class="n">max_tokens</span><span class="o">=</span><span class="mi">4096</span><span class="p">,</span>
            <span class="n">system</span><span class="o">=</span><span class="sh">"""</span><span class="s">You are a code analysis expert. Analyze code for:
- Structure and architecture
- Potential bugs
- Performance issues
- Security vulnerabilities

Provide clear, actionable feedback.</span><span class="sh">"""</span><span class="p">,</span>
            <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">}]</span>
        <span class="p">)</span>
        <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span>


<span class="k">class</span> <span class="nc">ExecutionAgent</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Specialist for executing code/commands safely.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">sandbox</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">client</span> <span class="o">=</span> <span class="nc">Anthropic</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">sandbox</span> <span class="o">=</span> <span class="n">sandbox</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">execute</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Execute code in sandbox.</span><span class="sh">"""</span>
        <span class="c1"># Parse code from task
</span>        <span class="n">code</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">extract_code</span><span class="p">(</span><span class="n">task</span><span class="p">)</span>
        
        <span class="c1"># Execute in sandbox
</span>        <span class="n">result</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="n">sandbox</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span><span class="n">code</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="sa">f</span><span class="sh">"</span><span class="s">Execution result:</span><span class="se">\n</span><span class="si">{</span><span class="n">result</span><span class="p">.</span><span class="n">stdout</span><span class="si">}</span><span class="se">\n\n</span><span class="s">Errors:</span><span class="se">\n</span><span class="si">{</span><span class="n">result</span><span class="p">.</span><span class="n">stderr</span><span class="si">}</span><span class="sh">"</span>
</code></pre></div></div>

<h3 id="3-message-bus-pattern">3. Message Bus Pattern</h3>

<p>For loose coupling and extensibility:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">asyncio</span>
<span class="kn">from</span> <span class="n">typing</span> <span class="kn">import</span> <span class="n">Callable</span><span class="p">,</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">List</span>
<span class="kn">import</span> <span class="n">json</span>

<span class="k">class</span> <span class="nc">MessageBus</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Pub/sub message bus for agent communication.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">subscribers</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">List</span><span class="p">[</span><span class="n">Callable</span><span class="p">]]</span> <span class="o">=</span> <span class="p">{}</span>
    
    <span class="k">def</span> <span class="nf">subscribe</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">topic</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">handler</span><span class="p">:</span> <span class="n">Callable</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Subscribe to topic.</span><span class="sh">"""</span>
        <span class="k">if</span> <span class="n">topic</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">subscribers</span><span class="p">:</span>
            <span class="n">self</span><span class="p">.</span><span class="n">subscribers</span><span class="p">[</span><span class="n">topic</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">self</span><span class="p">.</span><span class="n">subscribers</span><span class="p">[</span><span class="n">topic</span><span class="p">].</span><span class="nf">append</span><span class="p">(</span><span class="n">handler</span><span class="p">)</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">publish</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">topic</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">message</span><span class="p">:</span> <span class="n">Dict</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Publish message to topic.</span><span class="sh">"""</span>
        <span class="k">if</span> <span class="n">topic</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">subscribers</span><span class="p">:</span>
            <span class="k">return</span>
        
        <span class="c1"># Add metadata
</span>        <span class="n">message</span><span class="p">[</span><span class="sh">'</span><span class="s">topic</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">topic</span>
        <span class="n">message</span><span class="p">[</span><span class="sh">'</span><span class="s">timestamp</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span>
        
        <span class="c1"># Notify all subscribers
</span>        <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span>
            <span class="n">asyncio</span><span class="p">.</span><span class="nf">create_task</span><span class="p">(</span><span class="nf">handler</span><span class="p">(</span><span class="n">message</span><span class="p">))</span>
            <span class="k">for</span> <span class="n">handler</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">subscribers</span><span class="p">[</span><span class="n">topic</span><span class="p">]</span>
        <span class="p">]</span>
        
        <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="nf">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">,</span> <span class="n">return_exceptions</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>


<span class="c1"># Usage
</span><span class="n">bus</span> <span class="o">=</span> <span class="nc">MessageBus</span><span class="p">()</span>

<span class="c1"># Subscribe agents
</span><span class="k">async</span> <span class="k">def</span> <span class="nf">search_handler</span><span class="p">(</span><span class="n">message</span><span class="p">):</span>
    <span class="n">query</span> <span class="o">=</span> <span class="n">message</span><span class="p">[</span><span class="sh">'</span><span class="s">query</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">search_api</span><span class="p">.</span><span class="nf">search</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
    <span class="k">await</span> <span class="n">bus</span><span class="p">.</span><span class="nf">publish</span><span class="p">(</span><span class="sh">'</span><span class="s">search_results</span><span class="sh">'</span><span class="p">,</span> <span class="p">{</span><span class="sh">'</span><span class="s">results</span><span class="sh">'</span><span class="p">:</span> <span class="n">results</span><span class="p">})</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">code_handler</span><span class="p">(</span><span class="n">message</span><span class="p">):</span>
    <span class="n">results</span> <span class="o">=</span> <span class="n">message</span><span class="p">[</span><span class="sh">'</span><span class="s">results</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">analysis</span> <span class="o">=</span> <span class="k">await</span> <span class="n">code_agent</span><span class="p">.</span><span class="nf">analyze</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
    <span class="k">await</span> <span class="n">bus</span><span class="p">.</span><span class="nf">publish</span><span class="p">(</span><span class="sh">'</span><span class="s">analysis_complete</span><span class="sh">'</span><span class="p">,</span> <span class="p">{</span><span class="sh">'</span><span class="s">analysis</span><span class="sh">'</span><span class="p">:</span> <span class="n">analysis</span><span class="p">})</span>

<span class="n">bus</span><span class="p">.</span><span class="nf">subscribe</span><span class="p">(</span><span class="sh">'</span><span class="s">search_request</span><span class="sh">'</span><span class="p">,</span> <span class="n">search_handler</span><span class="p">)</span>
<span class="n">bus</span><span class="p">.</span><span class="nf">subscribe</span><span class="p">(</span><span class="sh">'</span><span class="s">search_results</span><span class="sh">'</span><span class="p">,</span> <span class="n">code_handler</span><span class="p">)</span>

<span class="c1"># Trigger workflow
</span><span class="k">await</span> <span class="n">bus</span><span class="p">.</span><span class="nf">publish</span><span class="p">(</span><span class="sh">'</span><span class="s">search_request</span><span class="sh">'</span><span class="p">,</span> <span class="p">{</span><span class="sh">'</span><span class="s">query</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">Flask security best practices</span><span class="sh">'</span><span class="p">})</span>
</code></pre></div></div>

<h2 id="production-architecture">Production Architecture</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>
<span class="kn">from</span> <span class="n">enum</span> <span class="kn">import</span> <span class="n">Enum</span>
<span class="kn">import</span> <span class="n">time</span>

<span class="k">class</span> <span class="nc">AgentStatus</span><span class="p">(</span><span class="n">Enum</span><span class="p">):</span>
    <span class="n">IDLE</span> <span class="o">=</span> <span class="sh">"</span><span class="s">idle</span><span class="sh">"</span>
    <span class="n">WORKING</span> <span class="o">=</span> <span class="sh">"</span><span class="s">working</span><span class="sh">"</span>
    <span class="n">FAILED</span> <span class="o">=</span> <span class="sh">"</span><span class="s">failed</span><span class="sh">"</span>

<span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">AgentMetrics</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Track agent performance.</span><span class="sh">"""</span>
    <span class="n">total_tasks</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">successful_tasks</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">failed_tasks</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">total_latency</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.0</span>
    <span class="n">total_cost</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.0</span>

<span class="k">class</span> <span class="nc">ProductionAgent</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Production-ready agent with monitoring.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">client</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="n">self</span><span class="p">.</span><span class="n">client</span> <span class="o">=</span> <span class="n">client</span>
        <span class="n">self</span><span class="p">.</span><span class="n">status</span> <span class="o">=</span> <span class="n">AgentStatus</span><span class="p">.</span><span class="n">IDLE</span>
        <span class="n">self</span><span class="p">.</span><span class="n">metrics</span> <span class="o">=</span> <span class="nc">AgentMetrics</span><span class="p">()</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">execute</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Execute with monitoring and error handling.</span><span class="sh">"""</span>
        <span class="n">self</span><span class="p">.</span><span class="n">status</span> <span class="o">=</span> <span class="n">AgentStatus</span><span class="p">.</span><span class="n">WORKING</span>
        <span class="n">start_time</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span>
        
        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Execute with retries
</span>            <span class="n">result</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">_execute_with_retry</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
            
            <span class="c1"># Update metrics
</span>            <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">successful_tasks</span> <span class="o">+=</span> <span class="mi">1</span>
            <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">total_latency</span> <span class="o">+=</span> <span class="n">time</span><span class="p">.</span><span class="nf">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start_time</span>
            
            <span class="n">self</span><span class="p">.</span><span class="n">status</span> <span class="o">=</span> <span class="n">AgentStatus</span><span class="p">.</span><span class="n">IDLE</span>
            <span class="k">return</span> <span class="n">result</span>
            
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="c1"># Handle failure
</span>            <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">failed_tasks</span> <span class="o">+=</span> <span class="mi">1</span>
            <span class="n">self</span><span class="p">.</span><span class="n">status</span> <span class="o">=</span> <span class="n">AgentStatus</span><span class="p">.</span><span class="n">FAILED</span>
            
            <span class="c1"># Log error
</span>            <span class="n">logger</span><span class="p">.</span><span class="nf">error</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Agent </span><span class="si">{</span><span class="n">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s"> failed: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
            
            <span class="c1"># Raise for coordinator to handle
</span>            <span class="k">raise</span>
        
        <span class="k">finally</span><span class="p">:</span>
            <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">total_tasks</span> <span class="o">+=</span> <span class="mi">1</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">_execute_with_retry</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">max_retries</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Execute with exponential backoff.</span><span class="sh">"""</span>
        <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">max_retries</span><span class="p">):</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="n">response</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
                    <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">claude-3-5-sonnet-20241022</span><span class="sh">"</span><span class="p">,</span>
                    <span class="n">max_tokens</span><span class="o">=</span><span class="mi">2048</span><span class="p">,</span>
                    <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">task</span><span class="p">}]</span>
                <span class="p">)</span>
                
                <span class="c1"># Track cost
</span>                <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">total_cost</span> <span class="o">+=</span> <span class="n">self</span><span class="p">.</span><span class="nf">_calculate_cost</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
                
                <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span>
                
            <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                <span class="k">if</span> <span class="n">attempt</span> <span class="o">==</span> <span class="n">max_retries</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
                    <span class="k">raise</span>
                
                <span class="c1"># Exponential backoff
</span>                <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="nf">sleep</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="n">attempt</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">_calculate_cost</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">response</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Calculate API cost.</span><span class="sh">"""</span>
        <span class="n">input_tokens</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">usage</span><span class="p">.</span><span class="n">input_tokens</span>
        <span class="n">output_tokens</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">usage</span><span class="p">.</span><span class="n">output_tokens</span>
        
        <span class="c1"># Claude Sonnet pricing (example)
</span>        <span class="n">input_cost</span> <span class="o">=</span> <span class="n">input_tokens</span> <span class="o">*</span> <span class="mf">0.003</span> <span class="o">/</span> <span class="mi">1000</span>
        <span class="n">output_cost</span> <span class="o">=</span> <span class="n">output_tokens</span> <span class="o">*</span> <span class="mf">0.015</span> <span class="o">/</span> <span class="mi">1000</span>
        
        <span class="k">return</span> <span class="n">input_cost</span> <span class="o">+</span> <span class="n">output_cost</span>
    
    <span class="k">def</span> <span class="nf">get_metrics</span><span class="p">(</span><span class="n">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Export metrics.</span><span class="sh">"""</span>
        <span class="k">return</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">agent</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="n">name</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">status</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="n">status</span><span class="p">.</span><span class="n">value</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">total_tasks</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">total_tasks</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">success_rate</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">successful_tasks</span> <span class="o">/</span> <span class="nf">max</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">total_tasks</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
            <span class="sh">'</span><span class="s">avg_latency</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">total_latency</span> <span class="o">/</span> <span class="nf">max</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">successful_tasks</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
            <span class="sh">'</span><span class="s">total_cost</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">total_cost</span>
        <span class="p">}</span>
</code></pre></div></div>

<h2 id="observability-and-monitoring">Observability and Monitoring</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">structlog</span>
<span class="kn">from</span> <span class="n">opentelemetry</span> <span class="kn">import</span> <span class="n">trace</span>
<span class="kn">from</span> <span class="n">opentelemetry.sdk.trace</span> <span class="kn">import</span> <span class="n">TracerProvider</span>
<span class="kn">from</span> <span class="n">opentelemetry.sdk.trace.export</span> <span class="kn">import</span> <span class="n">BatchSpanProcessor</span>
<span class="kn">from</span> <span class="n">opentelemetry.exporter.otlp.proto.grpc.trace_exporter</span> <span class="kn">import</span> <span class="n">OTLPSpanExporter</span>

<span class="c1"># Set up tracing
</span><span class="n">trace</span><span class="p">.</span><span class="nf">set_tracer_provider</span><span class="p">(</span><span class="nc">TracerProvider</span><span class="p">())</span>
<span class="n">tracer</span> <span class="o">=</span> <span class="n">trace</span><span class="p">.</span><span class="nf">get_tracer</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>

<span class="c1"># Add OTLP exporter
</span><span class="n">otlp_exporter</span> <span class="o">=</span> <span class="nc">OTLPSpanExporter</span><span class="p">(</span><span class="n">endpoint</span><span class="o">=</span><span class="sh">"</span><span class="s">http://localhost:4317</span><span class="sh">"</span><span class="p">)</span>
<span class="n">span_processor</span> <span class="o">=</span> <span class="nc">BatchSpanProcessor</span><span class="p">(</span><span class="n">otlp_exporter</span><span class="p">)</span>
<span class="n">trace</span><span class="p">.</span><span class="nf">get_tracer_provider</span><span class="p">().</span><span class="nf">add_span_processor</span><span class="p">(</span><span class="n">span_processor</span><span class="p">)</span>

<span class="c1"># Structured logging
</span><span class="n">logger</span> <span class="o">=</span> <span class="n">structlog</span><span class="p">.</span><span class="nf">get_logger</span><span class="p">()</span>

<span class="k">class</span> <span class="nc">ObservableCoordinator</span><span class="p">(</span><span class="n">Coordinator</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Coordinator with full observability.</span><span class="sh">"""</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Process with tracing and logging.</span><span class="sh">"""</span>
        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="nf">start_as_current_span</span><span class="p">(</span><span class="sh">"</span><span class="s">coordinator.process</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">span</span><span class="p">.</span><span class="nf">set_attribute</span><span class="p">(</span><span class="sh">"</span><span class="s">task</span><span class="sh">"</span><span class="p">,</span> <span class="n">task</span><span class="p">)</span>
            
            <span class="n">logger</span><span class="p">.</span><span class="nf">info</span><span class="p">(</span><span class="sh">"</span><span class="s">processing_task</span><span class="sh">"</span><span class="p">,</span> <span class="n">task</span><span class="o">=</span><span class="n">task</span><span class="p">)</span>
            
            <span class="k">try</span><span class="p">:</span>
                <span class="c1"># Decompose
</span>                <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="nf">start_as_current_span</span><span class="p">(</span><span class="sh">"</span><span class="s">coordinator.decompose</span><span class="sh">"</span><span class="p">):</span>
                    <span class="n">subtasks</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">decompose</span><span class="p">(</span><span class="n">task</span><span class="p">)</span>
                    <span class="n">span</span><span class="p">.</span><span class="nf">set_attribute</span><span class="p">(</span><span class="sh">"</span><span class="s">subtask_count</span><span class="sh">"</span><span class="p">,</span> <span class="nf">len</span><span class="p">(</span><span class="n">subtasks</span><span class="p">))</span>
                    <span class="n">logger</span><span class="p">.</span><span class="nf">info</span><span class="p">(</span><span class="sh">"</span><span class="s">decomposed_task</span><span class="sh">"</span><span class="p">,</span> <span class="n">subtasks</span><span class="o">=</span><span class="nf">len</span><span class="p">(</span><span class="n">subtasks</span><span class="p">))</span>
                
                <span class="c1"># Execute
</span>                <span class="n">results</span> <span class="o">=</span> <span class="p">{}</span>
                <span class="k">for</span> <span class="n">subtask</span> <span class="ow">in</span> <span class="n">subtasks</span><span class="p">:</span>
                    <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="nf">start_as_current_span</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">agent.</span><span class="si">{</span><span class="n">subtask</span><span class="p">[</span><span class="sh">'</span><span class="s">agent</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">):</span>
                        <span class="n">result</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="n">specialists</span><span class="p">[</span><span class="n">subtask</span><span class="p">[</span><span class="sh">'</span><span class="s">agent</span><span class="sh">'</span><span class="p">]].</span><span class="nf">execute</span><span class="p">(</span><span class="n">subtask</span><span class="p">[</span><span class="sh">'</span><span class="s">task</span><span class="sh">'</span><span class="p">])</span>
                        <span class="n">results</span><span class="p">[</span><span class="n">subtask</span><span class="p">[</span><span class="sh">'</span><span class="s">id</span><span class="sh">'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">result</span>
                
                <span class="c1"># Synthesize
</span>                <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="nf">start_as_current_span</span><span class="p">(</span><span class="sh">"</span><span class="s">coordinator.synthesize</span><span class="sh">"</span><span class="p">):</span>
                    <span class="n">answer</span> <span class="o">=</span> <span class="k">await</span> <span class="n">self</span><span class="p">.</span><span class="nf">synthesize</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">results</span><span class="p">)</span>
                
                <span class="n">logger</span><span class="p">.</span><span class="nf">info</span><span class="p">(</span><span class="sh">"</span><span class="s">task_completed</span><span class="sh">"</span><span class="p">,</span> <span class="n">task</span><span class="o">=</span><span class="n">task</span><span class="p">)</span>
                <span class="k">return</span> <span class="n">answer</span>
                
            <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                <span class="n">logger</span><span class="p">.</span><span class="nf">error</span><span class="p">(</span><span class="sh">"</span><span class="s">task_failed</span><span class="sh">"</span><span class="p">,</span> <span class="n">task</span><span class="o">=</span><span class="n">task</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="nf">str</span><span class="p">(</span><span class="n">e</span><span class="p">))</span>
                <span class="n">span</span><span class="p">.</span><span class="nf">record_exception</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
                <span class="n">span</span><span class="p">.</span><span class="nf">set_status</span><span class="p">(</span><span class="n">trace</span><span class="p">.</span><span class="nc">Status</span><span class="p">(</span><span class="n">trace</span><span class="p">.</span><span class="n">StatusCode</span><span class="p">.</span><span class="n">ERROR</span><span class="p">))</span>
                <span class="k">raise</span>
</code></pre></div></div>

<h2 id="testing-multi-agent-systems">Testing Multi-Agent Systems</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">pytest</span>
<span class="kn">from</span> <span class="n">unittest.mock</span> <span class="kn">import</span> <span class="n">Mock</span><span class="p">,</span> <span class="n">AsyncMock</span>

<span class="nd">@pytest.mark.asyncio</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">test_coordinator_decomposition</span><span class="p">():</span>
    <span class="sh">"""</span><span class="s">Test task decomposition.</span><span class="sh">"""</span>
    <span class="n">coordinator</span> <span class="o">=</span> <span class="nc">Coordinator</span><span class="p">({})</span>
    <span class="n">coordinator</span><span class="p">.</span><span class="n">client</span> <span class="o">=</span> <span class="nc">Mock</span><span class="p">()</span>
    <span class="n">coordinator</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="n">create</span> <span class="o">=</span> <span class="nc">AsyncMock</span><span class="p">(</span><span class="n">return_value</span><span class="o">=</span><span class="nc">Mock</span><span class="p">(</span>
        <span class="n">content</span><span class="o">=</span><span class="p">[</span><span class="nc">Mock</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="sh">'</span><span class="s">[{</span><span class="sh">"</span><span class="s">id</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">1</span><span class="sh">"</span><span class="s">, </span><span class="sh">"</span><span class="s">agent</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">search</span><span class="sh">"</span><span class="s">, </span><span class="sh">"</span><span class="s">task</span><span class="sh">"</span><span class="s">: </span><span class="sh">"</span><span class="s">Search docs</span><span class="sh">"</span><span class="s">}]</span><span class="sh">'</span><span class="p">)]</span>
    <span class="p">))</span>
    
    <span class="n">subtasks</span> <span class="o">=</span> <span class="k">await</span> <span class="n">coordinator</span><span class="p">.</span><span class="nf">decompose</span><span class="p">(</span><span class="sh">"</span><span class="s">Find Flask security info</span><span class="sh">"</span><span class="p">)</span>
    
    <span class="k">assert</span> <span class="nf">len</span><span class="p">(</span><span class="n">subtasks</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">subtasks</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="sh">'</span><span class="s">agent</span><span class="sh">'</span><span class="p">]</span> <span class="o">==</span> <span class="sh">'</span><span class="s">search</span><span class="sh">'</span>

<span class="nd">@pytest.mark.asyncio</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">test_agent_execution</span><span class="p">():</span>
    <span class="sh">"""</span><span class="s">Test agent execution with mock API.</span><span class="sh">"""</span>
    <span class="n">agent</span> <span class="o">=</span> <span class="nc">SearchAgent</span><span class="p">(</span><span class="nc">Mock</span><span class="p">())</span>
    <span class="n">agent</span><span class="p">.</span><span class="n">client</span> <span class="o">=</span> <span class="nc">Mock</span><span class="p">()</span>
    <span class="n">agent</span><span class="p">.</span><span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="n">create</span> <span class="o">=</span> <span class="nc">AsyncMock</span><span class="p">(</span><span class="n">return_value</span><span class="o">=</span><span class="nc">Mock</span><span class="p">(</span>
        <span class="n">content</span><span class="o">=</span><span class="p">[</span><span class="nc">Mock</span><span class="p">(</span><span class="n">text</span><span class="o">=</span><span class="sh">'</span><span class="s">Flask security</span><span class="sh">'</span><span class="p">)]</span>
    <span class="p">))</span>
    <span class="n">agent</span><span class="p">.</span><span class="n">search_api</span><span class="p">.</span><span class="n">search</span> <span class="o">=</span> <span class="nc">AsyncMock</span><span class="p">(</span><span class="n">return_value</span><span class="o">=</span><span class="p">[</span><span class="sh">'</span><span class="s">result1</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">result2</span><span class="sh">'</span><span class="p">])</span>
    
    <span class="n">result</span> <span class="o">=</span> <span class="k">await</span> <span class="n">agent</span><span class="p">.</span><span class="nf">execute</span><span class="p">(</span><span class="sh">"</span><span class="s">Search for Flask security</span><span class="sh">"</span><span class="p">)</span>
    
    <span class="k">assert</span> <span class="n">result</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span>
    <span class="n">agent</span><span class="p">.</span><span class="n">search_api</span><span class="p">.</span><span class="n">search</span><span class="p">.</span><span class="nf">assert_called_once</span><span class="p">()</span>

<span class="nd">@pytest.mark.asyncio</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">test_message_bus</span><span class="p">():</span>
    <span class="sh">"""</span><span class="s">Test pub/sub message bus.</span><span class="sh">"""</span>
    <span class="n">bus</span> <span class="o">=</span> <span class="nc">MessageBus</span><span class="p">()</span>
    <span class="n">received</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="k">async</span> <span class="k">def</span> <span class="nf">handler</span><span class="p">(</span><span class="n">message</span><span class="p">):</span>
        <span class="n">received</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">message</span><span class="p">)</span>
    
    <span class="n">bus</span><span class="p">.</span><span class="nf">subscribe</span><span class="p">(</span><span class="sh">'</span><span class="s">test</span><span class="sh">'</span><span class="p">,</span> <span class="n">handler</span><span class="p">)</span>
    <span class="k">await</span> <span class="n">bus</span><span class="p">.</span><span class="nf">publish</span><span class="p">(</span><span class="sh">'</span><span class="s">test</span><span class="sh">'</span><span class="p">,</span> <span class="p">{</span><span class="sh">'</span><span class="s">data</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">test</span><span class="sh">'</span><span class="p">})</span>
    
    <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="nf">sleep</span><span class="p">(</span><span class="mf">0.1</span><span class="p">)</span>  <span class="c1"># Wait for async handlers
</span>    <span class="k">assert</span> <span class="nf">len</span><span class="p">(</span><span class="n">received</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span>
    <span class="k">assert</span> <span class="n">received</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="sh">'</span><span class="s">data</span><span class="sh">'</span><span class="p">]</span> <span class="o">==</span> <span class="sh">'</span><span class="s">test</span><span class="sh">'</span>
</code></pre></div></div>

<h2 id="best-practices">Best Practices</h2>

<ol>
  <li>
    <p><strong>Design for failure</strong> - Agents will fail. Implement retries, circuit breakers, fallbacks.</p>
  </li>
  <li>
    <p><strong>Keep agents focused</strong> - One agent, one responsibility. Don’t build god agents.</p>
  </li>
  <li>
    <p><strong>Use structured outputs</strong> - JSON schemas, Pydantic models. Makes coordination reliable.</p>
  </li>
  <li>
    <p><strong>Monitor everything</strong> - Latency, cost, success rate, per agent.</p>
  </li>
  <li>
    <p><strong>Test independently</strong> - Unit test each agent with mocked dependencies.</p>
  </li>
  <li>
    <p><strong>Version agents</strong> - Deploy different agent versions independently.</p>
  </li>
  <li>
    <p><strong>Implement timeouts</strong> - Agents can hang. Set aggressive timeouts.</p>
  </li>
  <li>
    <p><strong>Cache expensive operations</strong> - Search results, embeddings, analysis.</p>
  </li>
  <li>
    <p><strong>Cost management</strong> - Track per-agent costs. Use cheaper models where possible.</p>
  </li>
  <li>
    <p><strong>Security boundaries</strong> - Agents may have different trust levels. Enforce permissions.</p>
  </li>
</ol>

<h2 id="conclusion">Conclusion</h2>

<p>Multi-agent systems transform complex AI tasks into manageable, composable components. By decomposing responsibilities, you gain testability, fault isolation, and scalability—at the cost of coordination complexity.</p>

<p>The patterns are well-established: coordinators orchestrate, specialists execute, message buses decouple. The tooling is maturing: LangGraph, AutoGen, CrewAI provide frameworks. The economics work: scaling cheap and expensive agents independently optimizes cost.</p>

<p>Start simple: coordinator + 2-3 specialists. Add observability early. Measure everything. Iterate based on bottlenecks.</p>

<p>Multi-agent systems aren’t always the answer—sometimes a well-prompted single agent suffices. But for complex, multi-step tasks requiring different expertise, they’re the right architecture.</p>

<p><strong>Further Resources:</strong></p>
<ul>
  <li><a href="https://microsoft.github.io/autogen/">AutoGen Framework</a> - Microsoft’s multi-agent framework</li>
  <li><a href="https://www.langchain.com/langgraph">LangGraph</a> - LangChain’s graph-based agents</li>
  <li><a href="https://www.crewai.com/">CrewAI</a> - Role-based multi-agent system</li>
  <li><a href="https://www.multiagent.com/">Multi-Agent Systems Book</a> - Academic foundation</li>
  <li><a href="https://opentelemetry.io/">OpenTelemetry</a> - Observability standard</li>
  <li><a href="https://docs.anthropic.com/claude/docs/agent-patterns">Anthropic Agent Patterns</a> - Claude agent guidance</li>
  <li><a href="https://agentprotocol.ai/">Agent Protocol</a> - Standardized agent communication</li>
</ul>

<hr />

<p><em>Agentic AI systems from November 2025, covering multi-agent architectures and production patterns.</em></p>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="Architecture" /><category term="AI Agents" /><category term="Multi-Agent" /><category term="Architecture" /><category term="LLM" /><summary type="html"><![CDATA[Design multi-agent AI systems: agent coordination, communication patterns, task decomposition, and architectural patterns for complex agent systems.]]></summary></entry><entry><title type="html">WebSocket vs SSE vs Long Polling: Choosing the Right Protocol</title><link href="https://www.fratepietro.com/2025/websocket-sse-longpolling/" rel="alternate" type="text/html" title="WebSocket vs SSE vs Long Polling: Choosing the Right Protocol" /><published>2025-10-08T00:00:00+02:00</published><updated>2025-10-08T00:00:00+02:00</updated><id>https://www.fratepietro.com/2025/websocket-sse-longpolling</id><content type="html" xml:base="https://www.fratepietro.com/2025/websocket-sse-longpolling/"><![CDATA[<p>Real-time communication on the web comes in three flavors: <a href="https://datatracker.ietf.org/doc/html/rfc6455">WebSocket</a>, <a href="https://html.spec.whatwg.org/multipage/server-sent-events.html">Server-Sent Events (SSE)</a>, and Long Polling. Each has distinct trade-offs affecting latency, resource usage, and complexity.</p>

<p>I’ve built systems using all three. For a collaborative code editor, WebSocket was essential—bidirectional, low-latency updates. For a live dashboard showing server metrics, SSE was perfect—simple, unidirectional stream. For a notification system supporting old browsers, Long Polling grudgingly worked.</p>

<p>The right choice depends on your requirements: bidirectionality, browser support, firewall friendliness, and operational complexity.</p>

<h2 id="websocket-full-duplex-communication">WebSocket: Full Duplex Communication</h2>

<p><strong>How it works:</strong> Upgrade HTTP connection to persistent TCP socket. After handshake, both client and server can send messages anytime.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Client                    Server
  |                         |
  |--- HTTP Upgrade -------&gt;|
  |&lt;-- 101 Switching ------- |
  |                         |
  |&lt;-----&gt; Binary/Text &lt;---&gt;|  (Bidirectional messaging)
  |                         |
</code></pre></div></div>

<h3 id="when-to-use-websocket">When to Use WebSocket</h3>

<ul>
  <li><strong>Real-time collaboration</strong> - Google Docs, Figma, VS Code Live Share</li>
  <li><strong>Chat applications</strong> - Slack, Discord, Telegram web</li>
  <li><strong>Multiplayer games</strong> - Real-time position updates, game state</li>
  <li><strong>Live trading platforms</strong> - Stock prices, order book updates</li>
  <li><strong>IoT dashboards</strong> - Sensor data streaming</li>
</ul>

<h3 id="nodejs-websocket-server">Node.js WebSocket Server</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">WebSocket</span> <span class="o">=</span> <span class="nf">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">ws</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">wss</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">WebSocket</span><span class="p">.</span><span class="nc">Server</span><span class="p">({</span> <span class="na">port</span><span class="p">:</span> <span class="mi">8080</span> <span class="p">});</span>

<span class="c1">// Track connected clients</span>
<span class="kd">const</span> <span class="nx">clients</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Set</span><span class="p">();</span>

<span class="nx">wss</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">connection</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">ws</span><span class="p">,</span> <span class="nx">req</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Client connected from</span><span class="dl">'</span><span class="p">,</span> <span class="nx">req</span><span class="p">.</span><span class="nx">socket</span><span class="p">.</span><span class="nx">remoteAddress</span><span class="p">);</span>
    <span class="nx">clients</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="nx">ws</span><span class="p">);</span>
    
    <span class="c1">// Send welcome message</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nf">send</span><span class="p">(</span><span class="nx">JSON</span><span class="p">.</span><span class="nf">stringify</span><span class="p">({</span>
        <span class="na">type</span><span class="p">:</span> <span class="dl">'</span><span class="s1">welcome</span><span class="dl">'</span><span class="p">,</span>
        <span class="na">timestamp</span><span class="p">:</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">()</span>
    <span class="p">}));</span>
    
    <span class="c1">// Handle messages</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">message</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">message</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Received:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">message</span><span class="p">.</span><span class="nf">toString</span><span class="p">());</span>
        
        <span class="k">try</span> <span class="p">{</span>
            <span class="kd">const</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="nx">message</span><span class="p">);</span>
            
            <span class="c1">// Broadcast to all clients except sender</span>
            <span class="nx">clients</span><span class="p">.</span><span class="nf">forEach</span><span class="p">(</span><span class="nx">client</span> <span class="o">=&gt;</span> <span class="p">{</span>
                <span class="k">if </span><span class="p">(</span><span class="nx">client</span> <span class="o">!==</span> <span class="nx">ws</span> <span class="o">&amp;&amp;</span> <span class="nx">client</span><span class="p">.</span><span class="nx">readyState</span> <span class="o">===</span> <span class="nx">WebSocket</span><span class="p">.</span><span class="nx">OPEN</span><span class="p">)</span> <span class="p">{</span>
                    <span class="nx">client</span><span class="p">.</span><span class="nf">send</span><span class="p">(</span><span class="nx">JSON</span><span class="p">.</span><span class="nf">stringify</span><span class="p">({</span>
                        <span class="na">type</span><span class="p">:</span> <span class="dl">'</span><span class="s1">broadcast</span><span class="dl">'</span><span class="p">,</span>
                        <span class="na">data</span><span class="p">:</span> <span class="nx">data</span><span class="p">,</span>
                        <span class="na">timestamp</span><span class="p">:</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">()</span>
                    <span class="p">}));</span>
                <span class="p">}</span>
            <span class="p">});</span>
        <span class="p">}</span> <span class="k">catch </span><span class="p">(</span><span class="nx">error</span><span class="p">)</span> <span class="p">{</span>
            <span class="nx">console</span><span class="p">.</span><span class="nf">error</span><span class="p">(</span><span class="dl">'</span><span class="s1">Invalid message:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">error</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">});</span>
    
    <span class="c1">// Handle errors</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">error</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">error</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">error</span><span class="p">(</span><span class="dl">'</span><span class="s1">WebSocket error:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">error</span><span class="p">);</span>
    <span class="p">});</span>
    
    <span class="c1">// Handle disconnect</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">close</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">code</span><span class="p">,</span> <span class="nx">reason</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Client disconnected:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">code</span><span class="p">,</span> <span class="nx">reason</span><span class="p">.</span><span class="nf">toString</span><span class="p">());</span>
        <span class="nx">clients</span><span class="p">.</span><span class="k">delete</span><span class="p">(</span><span class="nx">ws</span><span class="p">);</span>
    <span class="p">});</span>
    
    <span class="c1">// Heartbeat to detect dead connections</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nx">isAlive</span> <span class="o">=</span> <span class="kc">true</span><span class="p">;</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">pong</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">ws</span><span class="p">.</span><span class="nx">isAlive</span> <span class="o">=</span> <span class="kc">true</span><span class="p">;</span>
    <span class="p">});</span>
<span class="p">});</span>

<span class="c1">// Ping clients every 30 seconds</span>
<span class="kd">const</span> <span class="nx">interval</span> <span class="o">=</span> <span class="nf">setInterval</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">clients</span><span class="p">.</span><span class="nf">forEach</span><span class="p">(</span><span class="nx">ws</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="k">if </span><span class="p">(</span><span class="nx">ws</span><span class="p">.</span><span class="nx">isAlive</span> <span class="o">===</span> <span class="kc">false</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="nx">ws</span><span class="p">.</span><span class="nf">terminate</span><span class="p">();</span>
        <span class="p">}</span>
        
        <span class="nx">ws</span><span class="p">.</span><span class="nx">isAlive</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span>
        <span class="nx">ws</span><span class="p">.</span><span class="nf">ping</span><span class="p">();</span>
    <span class="p">});</span>
<span class="p">},</span> <span class="mi">30000</span><span class="p">);</span>

<span class="nx">wss</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">close</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nf">clearInterval</span><span class="p">(</span><span class="nx">interval</span><span class="p">);</span>
<span class="p">});</span>

<span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">WebSocket server running on ws://localhost:8080</span><span class="dl">'</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="browser-client">Browser Client</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">ws</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">WebSocket</span><span class="p">(</span><span class="dl">'</span><span class="s1">ws://localhost:8080</span><span class="dl">'</span><span class="p">);</span>

<span class="nx">ws</span><span class="p">.</span><span class="nx">onopen</span> <span class="o">=</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Connected</span><span class="dl">'</span><span class="p">);</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nf">send</span><span class="p">(</span><span class="nx">JSON</span><span class="p">.</span><span class="nf">stringify</span><span class="p">({</span> <span class="na">action</span><span class="p">:</span> <span class="dl">'</span><span class="s1">subscribe</span><span class="dl">'</span><span class="p">,</span> <span class="na">channel</span><span class="p">:</span> <span class="dl">'</span><span class="s1">updates</span><span class="dl">'</span> <span class="p">}));</span>
<span class="p">};</span>

<span class="nx">ws</span><span class="p">.</span><span class="nx">onmessage</span> <span class="o">=</span> <span class="p">(</span><span class="nx">event</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="kd">const</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="nx">event</span><span class="p">.</span><span class="nx">data</span><span class="p">);</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Received:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">data</span><span class="p">);</span>
    
    <span class="k">if </span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">type</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">welcome</span><span class="dl">'</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Welcome message received</span><span class="dl">'</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">};</span>

<span class="nx">ws</span><span class="p">.</span><span class="nx">onerror</span> <span class="o">=</span> <span class="p">(</span><span class="nx">error</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">error</span><span class="p">(</span><span class="dl">'</span><span class="s1">WebSocket error:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">error</span><span class="p">);</span>
<span class="p">};</span>

<span class="nx">ws</span><span class="p">.</span><span class="nx">onclose</span> <span class="o">=</span> <span class="p">(</span><span class="nx">event</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Disconnected:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">event</span><span class="p">.</span><span class="nx">code</span><span class="p">,</span> <span class="nx">event</span><span class="p">.</span><span class="nx">reason</span><span class="p">);</span>
    
    <span class="c1">// Reconnect logic</span>
    <span class="nf">setTimeout</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Reconnecting...</span><span class="dl">'</span><span class="p">);</span>
        <span class="c1">// Recreate WebSocket connection</span>
    <span class="p">},</span> <span class="mi">3000</span><span class="p">);</span>
<span class="p">};</span>
</code></pre></div></div>

<h3 id="production-considerations">Production Considerations</h3>

<p><strong>Load Balancing:</strong> Use sticky sessions or shared pub/sub:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Shared pub/sub with Redis</span>
<span class="kd">const</span> <span class="nx">Redis</span> <span class="o">=</span> <span class="nf">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">ioredis</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">pub</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Redis</span><span class="p">();</span>
<span class="kd">const</span> <span class="nx">sub</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Redis</span><span class="p">();</span>

<span class="nx">sub</span><span class="p">.</span><span class="nf">subscribe</span><span class="p">(</span><span class="dl">'</span><span class="s1">messages</span><span class="dl">'</span><span class="p">);</span>
<span class="nx">sub</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">message</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">channel</span><span class="p">,</span> <span class="nx">message</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="c1">// Broadcast to local WebSocket clients</span>
    <span class="nx">clients</span><span class="p">.</span><span class="nf">forEach</span><span class="p">(</span><span class="nx">client</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="k">if </span><span class="p">(</span><span class="nx">client</span><span class="p">.</span><span class="nx">readyState</span> <span class="o">===</span> <span class="nx">WebSocket</span><span class="p">.</span><span class="nx">OPEN</span><span class="p">)</span> <span class="p">{</span>
            <span class="nx">client</span><span class="p">.</span><span class="nf">send</span><span class="p">(</span><span class="nx">message</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">});</span>
<span class="p">});</span>

<span class="c1">// When receiving from WebSocket client</span>
<span class="nx">ws</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">message</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">message</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="c1">// Publish to Redis (reaches all servers)</span>
    <span class="nx">pub</span><span class="p">.</span><span class="nf">publish</span><span class="p">(</span><span class="dl">'</span><span class="s1">messages</span><span class="dl">'</span><span class="p">,</span> <span class="nx">message</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<p><strong>Monitoring:</strong></p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">metrics</span> <span class="o">=</span> <span class="p">{</span>
    <span class="na">connections</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
    <span class="na">messagesReceived</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
    <span class="na">messagesSent</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
    <span class="na">errors</span><span class="p">:</span> <span class="mi">0</span>
<span class="p">};</span>

<span class="nx">wss</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">connection</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">ws</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">metrics</span><span class="p">.</span><span class="nx">connections</span><span class="o">++</span><span class="p">;</span>
    
    <span class="nx">ws</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">message</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="nx">metrics</span><span class="p">.</span><span class="nx">messagesReceived</span><span class="o">++</span><span class="p">);</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">error</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="nx">metrics</span><span class="p">.</span><span class="nx">errors</span><span class="o">++</span><span class="p">);</span>
    <span class="nx">ws</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">close</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="nx">metrics</span><span class="p">.</span><span class="nx">connections</span><span class="o">--</span><span class="p">);</span>
<span class="p">});</span>

<span class="c1">// Expose metrics endpoint</span>
<span class="nx">app</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="dl">'</span><span class="s1">/metrics</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">req</span><span class="p">,</span> <span class="nx">res</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">res</span><span class="p">.</span><span class="nf">json</span><span class="p">(</span><span class="nx">metrics</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<p>Read <a href="https://datatracker.ietf.org/doc/html/rfc6455">WebSocket RFC 6455</a> for protocol details.</p>

<h2 id="server-sent-events-unidirectional-streaming">Server-Sent Events: Unidirectional Streaming</h2>

<p><strong>How it works:</strong> HTTP connection kept open, server sends events as <code class="language-plaintext highlighter-rouge">text/event-stream</code>.</p>

<h3 id="when-to-use-sse">When to Use SSE</h3>

<ul>
  <li><strong>Live feeds</strong> - News, sports scores, social media updates</li>
  <li><strong>Monitoring dashboards</strong> - Metrics, logs, alerts</li>
  <li><strong>Notifications</strong> - Push notifications, status updates</li>
  <li><strong>Progress tracking</strong> - File upload progress, job status</li>
  <li><strong>Stock tickers</strong> - Price updates (when client doesn’t need to send)</li>
</ul>

<h3 id="express-sse-server">Express SSE Server</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">express</span> <span class="o">=</span> <span class="nf">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">express</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">app</span> <span class="o">=</span> <span class="nf">express</span><span class="p">();</span>

<span class="c1">// SSE endpoint</span>
<span class="nx">app</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="dl">'</span><span class="s1">/events</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">req</span><span class="p">,</span> <span class="nx">res</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="c1">// Set SSE headers</span>
    <span class="nx">res</span><span class="p">.</span><span class="nf">setHeader</span><span class="p">(</span><span class="dl">'</span><span class="s1">Content-Type</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">text/event-stream</span><span class="dl">'</span><span class="p">);</span>
    <span class="nx">res</span><span class="p">.</span><span class="nf">setHeader</span><span class="p">(</span><span class="dl">'</span><span class="s1">Cache-Control</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">no-cache</span><span class="dl">'</span><span class="p">);</span>
    <span class="nx">res</span><span class="p">.</span><span class="nf">setHeader</span><span class="p">(</span><span class="dl">'</span><span class="s1">Connection</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">keep-alive</span><span class="dl">'</span><span class="p">);</span>
    <span class="nx">res</span><span class="p">.</span><span class="nf">setHeader</span><span class="p">(</span><span class="dl">'</span><span class="s1">Access-Control-Allow-Origin</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">*</span><span class="dl">'</span><span class="p">);</span>
    
    <span class="c1">// Set reconnection time</span>
    <span class="nx">res</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="dl">'</span><span class="s1">retry: 10000</span><span class="se">\n\n</span><span class="dl">'</span><span class="p">);</span>
    
    <span class="c1">// Send initial message</span>
    <span class="nx">res</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="s2">`data: </span><span class="p">${</span><span class="nx">JSON</span><span class="p">.</span><span class="nf">stringify</span><span class="p">({</span> <span class="na">type</span><span class="p">:</span> <span class="dl">'</span><span class="s1">connected</span><span class="dl">'</span><span class="p">,</span> <span class="na">time</span><span class="p">:</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">()</span> <span class="p">})}</span><span class="s2">\n\n`</span><span class="p">);</span>
    
    <span class="c1">// Send updates every second</span>
    <span class="kd">const</span> <span class="nx">interval</span> <span class="o">=</span> <span class="nf">setInterval</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="kd">const</span> <span class="nx">data</span> <span class="o">=</span> <span class="p">{</span>
            <span class="na">type</span><span class="p">:</span> <span class="dl">'</span><span class="s1">update</span><span class="dl">'</span><span class="p">,</span>
            <span class="na">value</span><span class="p">:</span> <span class="nb">Math</span><span class="p">.</span><span class="nf">random</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span>
            <span class="na">timestamp</span><span class="p">:</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">()</span>
        <span class="p">};</span>
        
        <span class="c1">// SSE format: "data: &lt;json&gt;\n\n"</span>
        <span class="nx">res</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="s2">`data: </span><span class="p">${</span><span class="nx">JSON</span><span class="p">.</span><span class="nf">stringify</span><span class="p">(</span><span class="nx">data</span><span class="p">)}</span><span class="s2">\n\n`</span><span class="p">);</span>
    <span class="p">},</span> <span class="mi">1000</span><span class="p">);</span>
    
    <span class="c1">// Clean up on disconnect</span>
    <span class="nx">req</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">close</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nf">clearInterval</span><span class="p">(</span><span class="nx">interval</span><span class="p">);</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Client disconnected</span><span class="dl">'</span><span class="p">);</span>
    <span class="p">});</span>
<span class="p">});</span>

<span class="nx">app</span><span class="p">.</span><span class="nf">listen</span><span class="p">(</span><span class="mi">3000</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">SSE server running on http://localhost:3000</span><span class="dl">'</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<h3 id="browser-client-1">Browser Client</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">eventSource</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">EventSource</span><span class="p">(</span><span class="dl">'</span><span class="s1">http://localhost:3000/events</span><span class="dl">'</span><span class="p">);</span>

<span class="nx">eventSource</span><span class="p">.</span><span class="nx">onopen</span> <span class="o">=</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">SSE connection opened</span><span class="dl">'</span><span class="p">);</span>
<span class="p">};</span>

<span class="nx">eventSource</span><span class="p">.</span><span class="nx">onmessage</span> <span class="o">=</span> <span class="p">(</span><span class="nx">event</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="kd">const</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="nx">event</span><span class="p">.</span><span class="nx">data</span><span class="p">);</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Received:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">data</span><span class="p">);</span>
    
    <span class="c1">// Update UI</span>
    <span class="nb">document</span><span class="p">.</span><span class="nf">getElementById</span><span class="p">(</span><span class="dl">'</span><span class="s1">value</span><span class="dl">'</span><span class="p">).</span><span class="nx">textContent</span> <span class="o">=</span> <span class="nx">data</span><span class="p">.</span><span class="nx">value</span><span class="p">.</span><span class="nf">toFixed</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>
<span class="p">};</span>

<span class="nx">eventSource</span><span class="p">.</span><span class="nx">onerror</span> <span class="o">=</span> <span class="p">(</span><span class="nx">error</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">error</span><span class="p">(</span><span class="dl">'</span><span class="s1">SSE error:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">error</span><span class="p">);</span>
    
    <span class="k">if </span><span class="p">(</span><span class="nx">eventSource</span><span class="p">.</span><span class="nx">readyState</span> <span class="o">===</span> <span class="nx">EventSource</span><span class="p">.</span><span class="nx">CLOSED</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">SSE connection closed</span><span class="dl">'</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">};</span>

<span class="c1">// Named events</span>
<span class="nx">eventSource</span><span class="p">.</span><span class="nf">addEventListener</span><span class="p">(</span><span class="dl">'</span><span class="s1">custom-event</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">event</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Custom event:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">event</span><span class="p">.</span><span class="nx">data</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<h3 id="named-events">Named Events</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Server: Send named events</span>
<span class="nx">res</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="dl">'</span><span class="s1">event: alert</span><span class="se">\n</span><span class="dl">'</span><span class="p">);</span>
<span class="nx">res</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="s2">`data: </span><span class="p">${</span><span class="nx">JSON</span><span class="p">.</span><span class="nf">stringify</span><span class="p">({</span> <span class="na">message</span><span class="p">:</span> <span class="dl">'</span><span class="s1">System alert!</span><span class="dl">'</span> <span class="p">})}</span><span class="s2">\n\n`</span><span class="p">);</span>

<span class="c1">// Client: Listen for specific events</span>
<span class="nx">eventSource</span><span class="p">.</span><span class="nf">addEventListener</span><span class="p">(</span><span class="dl">'</span><span class="s1">alert</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">event</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="kd">const</span> <span class="nx">alert</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="nx">event</span><span class="p">.</span><span class="nx">data</span><span class="p">);</span>
    <span class="nf">showAlert</span><span class="p">(</span><span class="nx">alert</span><span class="p">.</span><span class="nx">message</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<p>Read <a href="https://html.spec.whatwg.org/multipage/server-sent-events.html">Server-Sent Events spec</a> for details.</p>

<h2 id="long-polling-request-response-loop">Long Polling: Request-Response Loop</h2>

<p><strong>How it works:</strong> Client makes request, server holds it open until data available or timeout, client immediately reconnects.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Client                    Server
  |                         |
  |--- HTTP Request -------&gt;|
  |                         | (Wait for data or timeout)
  |&lt;-- Response with data --|
  |                         |
  |--- HTTP Request -------&gt;| (Immediately reconnect)
  |                         |
</code></pre></div></div>

<h3 id="when-to-use-long-polling">When to Use Long Polling</h3>

<ul>
  <li><strong>Legacy browser support</strong> - IE9, old mobile browsers</li>
  <li><strong>Firewall restrictions</strong> - Corporate networks blocking WebSocket</li>
  <li><strong>Simple notifications</strong> - Infrequent updates</li>
  <li><strong>Fallback mechanism</strong> - When WebSocket unavailable</li>
</ul>

<h3 id="express-long-polling-server">Express Long Polling Server</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">express</span> <span class="o">=</span> <span class="nf">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">express</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">app</span> <span class="o">=</span> <span class="nf">express</span><span class="p">();</span>

<span class="c1">// In-memory queue of pending messages</span>
<span class="kd">const</span> <span class="nx">messageQueues</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Map</span><span class="p">();</span>

<span class="nx">app</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="dl">'</span><span class="s1">/poll</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">req</span><span class="p">,</span> <span class="nx">res</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="kd">const</span> <span class="nx">userId</span> <span class="o">=</span> <span class="nx">req</span><span class="p">.</span><span class="nx">query</span><span class="p">.</span><span class="nx">userId</span><span class="p">;</span>
    
    <span class="k">if </span><span class="p">(</span><span class="o">!</span><span class="nx">messageQueues</span><span class="p">.</span><span class="nf">has</span><span class="p">(</span><span class="nx">userId</span><span class="p">))</span> <span class="p">{</span>
        <span class="nx">messageQueues</span><span class="p">.</span><span class="nf">set</span><span class="p">(</span><span class="nx">userId</span><span class="p">,</span> <span class="p">[]);</span>
    <span class="p">}</span>
    
    <span class="kd">const</span> <span class="nx">queue</span> <span class="o">=</span> <span class="nx">messageQueues</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="nx">userId</span><span class="p">);</span>
    
    <span class="c1">// If messages available, send immediately</span>
    <span class="k">if </span><span class="p">(</span><span class="nx">queue</span><span class="p">.</span><span class="nx">length</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">res</span><span class="p">.</span><span class="nf">json</span><span class="p">({</span> <span class="na">messages</span><span class="p">:</span> <span class="nx">queue</span><span class="p">.</span><span class="nf">splice</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nx">queue</span><span class="p">.</span><span class="nx">length</span><span class="p">)</span> <span class="p">});</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>
    
    <span class="c1">// Otherwise, wait for new message or timeout</span>
    <span class="kd">const</span> <span class="nx">timeout</span> <span class="o">=</span> <span class="nf">setTimeout</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">res</span><span class="p">.</span><span class="nf">json</span><span class="p">({</span> <span class="na">messages</span><span class="p">:</span> <span class="p">[]</span> <span class="p">});</span>
    <span class="p">},</span> <span class="mi">30000</span><span class="p">);</span>  <span class="c1">// 30 second timeout</span>
    
    <span class="c1">// Store request to send message when available</span>
    <span class="kd">const</span> <span class="nx">checkInterval</span> <span class="o">=</span> <span class="nf">setInterval</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="k">if </span><span class="p">(</span><span class="nx">queue</span><span class="p">.</span><span class="nx">length</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
            <span class="nf">clearTimeout</span><span class="p">(</span><span class="nx">timeout</span><span class="p">);</span>
            <span class="nf">clearInterval</span><span class="p">(</span><span class="nx">checkInterval</span><span class="p">);</span>
            <span class="nx">res</span><span class="p">.</span><span class="nf">json</span><span class="p">({</span> <span class="na">messages</span><span class="p">:</span> <span class="nx">queue</span><span class="p">.</span><span class="nf">splice</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nx">queue</span><span class="p">.</span><span class="nx">length</span><span class="p">)</span> <span class="p">});</span>
        <span class="p">}</span>
    <span class="p">},</span> <span class="mi">100</span><span class="p">);</span>
    
    <span class="c1">// Clean up on disconnect</span>
    <span class="nx">req</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">close</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nf">clearTimeout</span><span class="p">(</span><span class="nx">timeout</span><span class="p">);</span>
        <span class="nf">clearInterval</span><span class="p">(</span><span class="nx">checkInterval</span><span class="p">);</span>
    <span class="p">});</span>
<span class="p">});</span>

<span class="c1">// Endpoint to send message to user</span>
<span class="nx">app</span><span class="p">.</span><span class="nf">post</span><span class="p">(</span><span class="dl">'</span><span class="s1">/send</span><span class="dl">'</span><span class="p">,</span> <span class="nx">express</span><span class="p">.</span><span class="nf">json</span><span class="p">(),</span> <span class="p">(</span><span class="nx">req</span><span class="p">,</span> <span class="nx">res</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="kd">const</span> <span class="p">{</span> <span class="nx">userId</span><span class="p">,</span> <span class="nx">message</span> <span class="p">}</span> <span class="o">=</span> <span class="nx">req</span><span class="p">.</span><span class="nx">body</span><span class="p">;</span>
    
    <span class="k">if </span><span class="p">(</span><span class="o">!</span><span class="nx">messageQueues</span><span class="p">.</span><span class="nf">has</span><span class="p">(</span><span class="nx">userId</span><span class="p">))</span> <span class="p">{</span>
        <span class="nx">messageQueues</span><span class="p">.</span><span class="nf">set</span><span class="p">(</span><span class="nx">userId</span><span class="p">,</span> <span class="p">[]);</span>
    <span class="p">}</span>
    
    <span class="nx">messageQueues</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="nx">userId</span><span class="p">).</span><span class="nf">push</span><span class="p">({</span>
        <span class="nx">message</span><span class="p">,</span>
        <span class="na">timestamp</span><span class="p">:</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">()</span>
    <span class="p">});</span>
    
    <span class="nx">res</span><span class="p">.</span><span class="nf">json</span><span class="p">({</span> <span class="na">success</span><span class="p">:</span> <span class="kc">true</span> <span class="p">});</span>
<span class="p">});</span>

<span class="nx">app</span><span class="p">.</span><span class="nf">listen</span><span class="p">(</span><span class="mi">3000</span><span class="p">);</span>
</code></pre></div></div>

<h3 id="browser-client-2">Browser Client</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">polling</span> <span class="o">=</span> <span class="kc">true</span><span class="p">;</span>

<span class="k">async</span> <span class="kd">function</span> <span class="nf">poll</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">while </span><span class="p">(</span><span class="nx">polling</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">try</span> <span class="p">{</span>
            <span class="kd">const</span> <span class="nx">response</span> <span class="o">=</span> <span class="k">await</span> <span class="nf">fetch</span><span class="p">(</span><span class="s2">`/poll?userId=</span><span class="p">${</span><span class="nx">userId</span><span class="p">}</span><span class="s2">`</span><span class="p">);</span>
            <span class="kd">const</span> <span class="nx">data</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">response</span><span class="p">.</span><span class="nf">json</span><span class="p">();</span>
            
            <span class="k">if </span><span class="p">(</span><span class="nx">data</span><span class="p">.</span><span class="nx">messages</span><span class="p">.</span><span class="nx">length</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                <span class="nx">data</span><span class="p">.</span><span class="nx">messages</span><span class="p">.</span><span class="nf">forEach</span><span class="p">(</span><span class="nx">handleMessage</span><span class="p">);</span>
            <span class="p">}</span>
            
        <span class="p">}</span> <span class="k">catch </span><span class="p">(</span><span class="nx">error</span><span class="p">)</span> <span class="p">{</span>
            <span class="nx">console</span><span class="p">.</span><span class="nf">error</span><span class="p">(</span><span class="dl">'</span><span class="s1">Polling error:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">error</span><span class="p">);</span>
            <span class="k">await</span> <span class="nf">sleep</span><span class="p">(</span><span class="mi">5000</span><span class="p">);</span>  <span class="c1">// Back off on error</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="kd">function</span> <span class="nf">handleMessage</span><span class="p">(</span><span class="nx">message</span><span class="p">)</span> <span class="p">{</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Received:</span><span class="dl">'</span><span class="p">,</span> <span class="nx">message</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// Start polling</span>
<span class="nf">poll</span><span class="p">();</span>

<span class="c1">// Stop polling</span>
<span class="kd">function</span> <span class="nf">stopPolling</span><span class="p">()</span> <span class="p">{</span>
    <span class="nx">polling</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="comparison-table">Comparison Table</h2>

<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>WebSocket</th>
      <th>SSE</th>
      <th>Long Polling</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Direction</strong></td>
      <td>Bidirectional</td>
      <td>Server → Client</td>
      <td>Bidirectional (via requests)</td>
    </tr>
    <tr>
      <td><strong>Protocol</strong></td>
      <td>Custom (WS)</td>
      <td>HTTP</td>
      <td>HTTP</td>
    </tr>
    <tr>
      <td><strong>Latency</strong></td>
      <td>&lt;10ms</td>
      <td>&lt;50ms</td>
      <td>100-500ms</td>
    </tr>
    <tr>
      <td><strong>Overhead</strong></td>
      <td>Low</td>
      <td>Low</td>
      <td>High (HTTP headers each poll)</td>
    </tr>
    <tr>
      <td><strong>Browser Support</strong></td>
      <td>IE10+</td>
      <td>IE/Edge (polyfill), others native</td>
      <td>Universal</td>
    </tr>
    <tr>
      <td><strong>Firewall Friendly</strong></td>
      <td>Sometimes blocked</td>
      <td>Yes (HTTP)</td>
      <td>Yes (HTTP)</td>
    </tr>
    <tr>
      <td><strong>Reconnection</strong></td>
      <td>Manual</td>
      <td>Automatic</td>
      <td>Manual</td>
    </tr>
    <tr>
      <td><strong>Binary Data</strong></td>
      <td>Native</td>
      <td>Base64 encoding</td>
      <td>Base64 encoding</td>
    </tr>
    <tr>
      <td><strong>Complexity</strong></td>
      <td>High</td>
      <td>Low</td>
      <td>Medium</td>
    </tr>
    <tr>
      <td><strong>Max Connections</strong></td>
      <td>65k per server</td>
      <td>65k per server</td>
      <td>Limited by request rate</td>
    </tr>
  </tbody>
</table>

<h2 id="scaling-patterns">Scaling Patterns</h2>

<h3 id="pubsub-for-websocketsse">Pub/Sub for WebSocket/SSE</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Using NATS for pub/sub</span>
<span class="kd">const</span> <span class="nx">NATS</span> <span class="o">=</span> <span class="nf">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">nats</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">nc</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">NATS</span><span class="p">.</span><span class="nf">connect</span><span class="p">({</span> <span class="na">servers</span><span class="p">:</span> <span class="dl">'</span><span class="s1">nats://localhost:4222</span><span class="dl">'</span> <span class="p">});</span>

<span class="c1">// Subscribe to messages</span>
<span class="kd">const</span> <span class="nx">sub</span> <span class="o">=</span> <span class="nx">nc</span><span class="p">.</span><span class="nf">subscribe</span><span class="p">(</span><span class="dl">'</span><span class="s1">messages</span><span class="dl">'</span><span class="p">);</span>
<span class="k">for</span> <span class="k">await </span><span class="p">(</span><span class="kd">const</span> <span class="nx">msg</span> <span class="k">of</span> <span class="nx">sub</span><span class="p">)</span> <span class="p">{</span>
    <span class="kd">const</span> <span class="nx">data</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="nx">msg</span><span class="p">.</span><span class="nx">data</span><span class="p">);</span>
    
    <span class="c1">// Broadcast to WebSocket clients</span>
    <span class="nf">broadcastToClients</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// Publish message</span>
<span class="nx">nc</span><span class="p">.</span><span class="nf">publish</span><span class="p">(</span><span class="dl">'</span><span class="s1">messages</span><span class="dl">'</span><span class="p">,</span> <span class="nx">JSON</span><span class="p">.</span><span class="nf">stringify</span><span class="p">({</span> <span class="na">type</span><span class="p">:</span> <span class="dl">'</span><span class="s1">update</span><span class="dl">'</span><span class="p">,</span> <span class="na">data</span><span class="p">:</span> <span class="p">{...}</span> <span class="p">}));</span>
</code></pre></div></div>

<h3 id="graceful-shutdown">Graceful Shutdown</h3>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">let</span> <span class="nx">shuttingDown</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span>

<span class="nx">process</span><span class="p">.</span><span class="nf">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">SIGTERM</span><span class="dl">'</span><span class="p">,</span> <span class="k">async </span><span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span>
    <span class="nx">shuttingDown</span> <span class="o">=</span> <span class="kc">true</span><span class="p">;</span>
    <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Shutting down gracefully...</span><span class="dl">'</span><span class="p">);</span>
    
    <span class="c1">// Stop accepting new connections</span>
    <span class="nx">wss</span><span class="p">.</span><span class="nf">close</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">No longer accepting connections</span><span class="dl">'</span><span class="p">);</span>
    <span class="p">});</span>
    
    <span class="c1">// Wait for existing connections to finish</span>
    <span class="kd">const</span> <span class="nx">timeout</span> <span class="o">=</span> <span class="nf">setTimeout</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="dl">'</span><span class="s1">Forcing shutdown</span><span class="dl">'</span><span class="p">);</span>
        <span class="nx">process</span><span class="p">.</span><span class="nf">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
    <span class="p">},</span> <span class="mi">30000</span><span class="p">);</span>
    
    <span class="c1">// Close all connections gracefully</span>
    <span class="k">for </span><span class="p">(</span><span class="kd">const</span> <span class="nx">client</span> <span class="k">of</span> <span class="nx">clients</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">client</span><span class="p">.</span><span class="nf">close</span><span class="p">(</span><span class="mi">1001</span><span class="p">,</span> <span class="dl">'</span><span class="s1">Server shutting down</span><span class="dl">'</span><span class="p">);</span>
    <span class="p">}</span>
    
    <span class="nf">clearTimeout</span><span class="p">(</span><span class="nx">timeout</span><span class="p">);</span>
    <span class="nx">process</span><span class="p">.</span><span class="nf">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p><strong>Choose WebSocket</strong> for interactive, bidirectional communication (chat, games, collaboration).</p>

<p><strong>Choose SSE</strong> for server-to-client streams (dashboards, feeds, notifications). It’s simpler than WebSocket and works over HTTP.</p>

<p><strong>Choose Long Polling</strong> only as a fallback for old browsers or restrictive networks. The overhead is significant.</p>

<p>In practice, implement WebSocket with SSE as fallback. Long Polling is rarely worth the complexity in 2025.</p>

<p><strong>Further Resources:</strong></p>
<ul>
  <li><a href="https://datatracker.ietf.org/doc/html/rfc6455">WebSocket RFC 6455</a> - Protocol specification</li>
  <li><a href="https://html.spec.whatwg.org/multipage/server-sent-events.html">Server-Sent Events Spec</a> - HTML standard</li>
  <li><a href="https://socket.io/">Socket.IO</a> - WebSocket library with fallbacks</li>
  <li><a href="https://github.com/websockets/ws">ws npm package</a> - Fast WebSocket implementation</li>
  <li><a href="https://www.npmjs.com/package/sse">SSE npm package</a> - SSE server utilities</li>
  <li><a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers">WebSocket Best Practices</a> - MDN guide</li>
</ul>

<hr />

<p><em>WebSocket vs SSE vs Long Polling from October 2025, updated with production patterns.</em></p>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="Architecture" /><category term="WebSocket" /><category term="SSE" /><category term="Long Polling" /><category term="Real-time" /><summary type="html"><![CDATA[Compare real-time protocols: WebSocket, Server-Sent Events, and Long Polling. Learn when to use each, trade-offs, and implementation patterns.]]></summary></entry><entry><title type="html">Databricks for Data Engineers: Getting Started</title><link href="https://www.fratepietro.com/2025/databricks-data-engineers/" rel="alternate" type="text/html" title="Databricks for Data Engineers: Getting Started" /><published>2025-09-18T00:00:00+02:00</published><updated>2025-09-18T00:00:00+02:00</updated><id>https://www.fratepietro.com/2025/databricks-data-engineers</id><content type="html" xml:base="https://www.fratepietro.com/2025/databricks-data-engineers/"><![CDATA[<p><a href="https://www.databricks.com/">Databricks</a> is a unified analytics platform built on <a href="https://spark.apache.org/">Apache Spark</a>. Founded by Spark’s creators (Matei Zaharia, Ali Ghodsi, and others from UC Berkeley), it’s become the standard for big data processing and machine learning at scale.</p>

<p>I moved to Databricks after struggling with self-managed Spark clusters. Maintaining Spark—tuning configs, managing resources, debugging failed jobs—consumed more time than actual data engineering. Databricks handles the infrastructure, letting you focus on transforming data.</p>

<p>The platform combines notebooks for exploration, production-grade job scheduling, Delta Lake for reliable storage, and MLflow for ML workflows. It’s opinionated but that opinion is informed by years of Spark expertise.</p>

<h2 id="core-components">Core Components</h2>

<p><strong>Databricks Workspace</strong> - Web-based environment for notebooks, jobs, and clusters.</p>

<p><strong>Apache Spark</strong> - Distributed processing engine (3.5+ as of 2025). See <a href="https://spark.apache.org/docs/latest/">Spark documentation</a>.</p>

<p><strong>Delta Lake</strong> - ACID transactions on data lakes. Open source project: <a href="https://github.com/delta-io/delta">delta-io/delta</a>.</p>

<p><strong>MLflow</strong> - ML lifecycle management. Track experiments, package models, deploy. <a href="https://mlflow.org/docs/latest/index.html">MLflow docs</a>.</p>

<p><strong>Unity Catalog</strong> - Centralized governance, lineage, and access control.</p>

<p>Read <a href="https://docs.databricks.com/en/getting-started/overview.html">Databricks architecture</a> for details.</p>

<h2 id="notebooks-interactive-development">Notebooks: Interactive Development</h2>

<p>Databricks notebooks support Python, SQL, Scala, and R:</p>

<h3 id="python-notebook">Python Notebook</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Read data from Delta Lake
</span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">).</span><span class="nf">load</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/events</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Show schema
</span><span class="n">df</span><span class="p">.</span><span class="nf">printSchema</span><span class="p">()</span>

<span class="c1"># Quick stats
</span><span class="n">df</span><span class="p">.</span><span class="nf">describe</span><span class="p">().</span><span class="nf">show</span><span class="p">()</span>

<span class="c1"># Transform data
</span><span class="kn">from</span> <span class="n">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">col</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="n">window</span>

<span class="n">daily_active_users</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span>
    <span class="p">.</span><span class="nf">filter</span><span class="p">(</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">event_type</span><span class="sh">"</span><span class="p">)</span> <span class="o">==</span> <span class="sh">"</span><span class="s">login</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">groupBy</span><span class="p">(</span><span class="nf">window</span><span class="p">(</span><span class="sh">"</span><span class="s">timestamp</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">1 day</span><span class="sh">"</span><span class="p">))</span>
    <span class="p">.</span><span class="nf">agg</span><span class="p">(</span><span class="nf">count</span><span class="p">(</span><span class="sh">"</span><span class="s">user_id</span><span class="sh">"</span><span class="p">).</span><span class="nf">alias</span><span class="p">(</span><span class="sh">"</span><span class="s">daily_active_users</span><span class="sh">"</span><span class="p">))</span>
    <span class="p">.</span><span class="nf">orderBy</span><span class="p">(</span><span class="sh">"</span><span class="s">window</span><span class="sh">"</span><span class="p">)</span>
<span class="p">)</span>

<span class="c1"># Display in notebook
</span><span class="nf">display</span><span class="p">(</span><span class="n">daily_active_users</span><span class="p">)</span>

<span class="c1"># Write results
</span><span class="n">daily_active_users</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">).</span><span class="nf">mode</span><span class="p">(</span><span class="sh">"</span><span class="s">overwrite</span><span class="sh">"</span><span class="p">).</span><span class="nf">save</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/dau</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="sql-notebook">SQL Notebook</h3>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Create or replace table using Delta Lake</span>
<span class="k">CREATE</span> <span class="k">OR</span> <span class="k">REPLACE</span> <span class="k">TABLE</span> <span class="n">analytics</span><span class="p">.</span><span class="n">user_activity</span>
<span class="k">USING</span> <span class="n">DELTA</span>
<span class="k">LOCATION</span> <span class="s1">'/mnt/data/user_activity'</span>
<span class="k">AS</span>
<span class="k">SELECT</span> 
    <span class="n">user_id</span><span class="p">,</span>
    <span class="nb">DATE</span><span class="p">(</span><span class="nb">timestamp</span><span class="p">)</span> <span class="k">as</span> <span class="n">activity_date</span><span class="p">,</span>
    <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">as</span> <span class="n">event_count</span><span class="p">,</span>
    <span class="k">COUNT</span><span class="p">(</span><span class="k">DISTINCT</span> <span class="n">session_id</span><span class="p">)</span> <span class="k">as</span> <span class="n">session_count</span>
<span class="k">FROM</span> <span class="n">events</span>
<span class="k">WHERE</span> <span class="nb">timestamp</span> <span class="o">&gt;=</span> <span class="k">current_date</span><span class="p">()</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="mi">30</span> <span class="n">DAYS</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="nb">timestamp</span><span class="p">);</span>

<span class="c1">-- Query with visualization</span>
<span class="k">SELECT</span> 
    <span class="n">activity_date</span><span class="p">,</span>
    <span class="k">COUNT</span><span class="p">(</span><span class="k">DISTINCT</span> <span class="n">user_id</span><span class="p">)</span> <span class="k">as</span> <span class="n">active_users</span><span class="p">,</span>
    <span class="k">SUM</span><span class="p">(</span><span class="n">event_count</span><span class="p">)</span> <span class="k">as</span> <span class="n">total_events</span>
<span class="k">FROM</span> <span class="n">analytics</span><span class="p">.</span><span class="n">user_activity</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">activity_date</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">activity_date</span><span class="p">;</span>
</code></pre></div></div>

<p>Notebooks support inline visualizations—click “Visualization” to create charts.</p>

<h3 id="widgets-for-parameterization">Widgets for Parameterization</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create text widget
</span><span class="n">dbutils</span><span class="p">.</span><span class="n">widgets</span><span class="p">.</span><span class="nf">text</span><span class="p">(</span><span class="sh">"</span><span class="s">start_date</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">2025-01-01</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">Start Date</span><span class="sh">"</span><span class="p">)</span>
<span class="n">dbutils</span><span class="p">.</span><span class="n">widgets</span><span class="p">.</span><span class="nf">dropdown</span><span class="p">(</span><span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">US</span><span class="sh">"</span><span class="p">,</span> <span class="p">[</span><span class="sh">"</span><span class="s">US</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">EU</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">APAC</span><span class="sh">"</span><span class="p">],</span> <span class="sh">"</span><span class="s">Region</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Read widget values
</span><span class="n">start_date</span> <span class="o">=</span> <span class="n">dbutils</span><span class="p">.</span><span class="n">widgets</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">start_date</span><span class="sh">"</span><span class="p">)</span>
<span class="n">region</span> <span class="o">=</span> <span class="n">dbutils</span><span class="p">.</span><span class="n">widgets</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Use in queries
</span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">).</span><span class="nf">load</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/events</span><span class="sh">"</span><span class="p">)</span>
<span class="n">filtered</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="nf">filter</span><span class="p">(</span>
    <span class="p">(</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">date</span><span class="sh">"</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">start_date</span><span class="p">)</span> <span class="o">&amp;</span> 
    <span class="p">(</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">)</span> <span class="o">==</span> <span class="n">region</span><span class="p">)</span>
<span class="p">)</span>

<span class="nf">display</span><span class="p">(</span><span class="n">filtered</span><span class="p">)</span>
</code></pre></div></div>

<p>Widgets make notebooks reusable for different parameters.</p>

<h2 id="data-pipelines">Data Pipelines</h2>

<h3 id="etl-pipeline">ETL Pipeline</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>

<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span><span class="p">.</span><span class="nf">appName</span><span class="p">(</span><span class="sh">"</span><span class="s">ETL</span><span class="sh">"</span><span class="p">).</span><span class="nf">getOrCreate</span><span class="p">()</span>

<span class="c1"># Extract
</span><span class="n">raw_data</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="nf">csv</span><span class="p">(</span><span class="sh">"</span><span class="s">s3://bucket/data.csv</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Transform
</span><span class="n">transformed</span> <span class="o">=</span> <span class="n">raw_data</span><span class="p">.</span><span class="nf">select</span><span class="p">(</span><span class="sh">"</span><span class="s">id</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">name</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">age</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Load
</span><span class="n">transformed</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">).</span><span class="nf">save</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/processed</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="best-practices">Best Practices</h2>

<ol>
  <li><strong>Use notebooks</strong> - Interactive development</li>
  <li><strong>Leverage Spark</strong> - Distributed processing</li>
  <li><strong>Use Delta Lake</strong> - ACID transactions</li>
  <li><strong>Monitor</strong> - Track job performance</li>
  <li><strong>Optimize</strong> - Query tuning</li>
  <li><strong>Test</strong> - Verify pipelines</li>
  <li><strong>Document</strong> - Clear processes</li>
  <li><strong>Scale</strong> - Handle large data</li>
</ol>

<h2 id="delta-lake-acid-on-data-lakes">Delta Lake: ACID on Data Lakes</h2>

<p><a href="https://delta.io/">Delta Lake</a> brings database reliability to data lakes—ACID transactions, schema enforcement, time travel.</p>

<h3 id="writing-delta-tables">Writing Delta Tables</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="kn">from</span> <span class="n">delta.tables</span> <span class="kn">import</span> <span class="n">DeltaTable</span>

<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span><span class="p">.</span><span class="nf">appName</span><span class="p">(</span><span class="sh">"</span><span class="s">ETL</span><span class="sh">"</span><span class="p">).</span><span class="nf">getOrCreate</span><span class="p">()</span>

<span class="c1"># Write with Delta Lake
</span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="nf">csv</span><span class="p">(</span><span class="sh">"</span><span class="s">s3://bucket/raw-data.csv</span><span class="sh">"</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="p">(</span><span class="n">df</span>
    <span class="p">.</span><span class="n">write</span>
    <span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">mode</span><span class="p">(</span><span class="sh">"</span><span class="s">overwrite</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">option</span><span class="p">(</span><span class="sh">"</span><span class="s">overwriteSchema</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">true</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">save</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/processed/users</span><span class="sh">"</span><span class="p">)</span>
<span class="p">)</span>

<span class="c1"># Append new data
</span><span class="n">new_data</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="nf">csv</span><span class="p">(</span><span class="sh">"</span><span class="s">s3://bucket/new-data.csv</span><span class="sh">"</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">new_data</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">).</span><span class="nf">mode</span><span class="p">(</span><span class="sh">"</span><span class="s">append</span><span class="sh">"</span><span class="p">).</span><span class="nf">save</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/processed/users</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="time-travel">Time Travel</h3>

<p>Query historical versions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Read current version
</span><span class="n">df_current</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">).</span><span class="nf">load</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/processed/users</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Read version from 7 days ago
</span><span class="n">df_old</span> <span class="o">=</span> <span class="p">(</span><span class="n">spark</span><span class="p">.</span><span class="n">read</span>
    <span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">option</span><span class="p">(</span><span class="sh">"</span><span class="s">timestampAsOf</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">2025-09-01</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/processed/users</span><span class="sh">"</span><span class="p">)</span>
<span class="p">)</span>

<span class="c1"># Compare
</span><span class="n">changed_users</span> <span class="o">=</span> <span class="n">df_current</span><span class="p">.</span><span class="nf">subtract</span><span class="p">(</span><span class="n">df_old</span><span class="p">)</span>
<span class="nf">display</span><span class="p">(</span><span class="n">changed_users</span><span class="p">)</span>

<span class="c1"># Read specific version number
</span><span class="n">df_v5</span> <span class="o">=</span> <span class="p">(</span><span class="n">spark</span><span class="p">.</span><span class="n">read</span>
    <span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">option</span><span class="p">(</span><span class="sh">"</span><span class="s">versionAsOf</span><span class="sh">"</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="sh">"</span><span class="s">/mnt/data/processed/users</span><span class="sh">"</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>

<h3 id="merge-upserts">MERGE (Upserts)</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">delta.tables</span> <span class="kn">import</span> <span class="n">DeltaTable</span>

<span class="c1"># Load Delta table
</span><span class="n">delta_table</span> <span class="o">=</span> <span class="n">DeltaTable</span><span class="p">.</span><span class="nf">forPath</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="sh">"</span><span class="s">/mnt/data/processed/users</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Merge updates
</span><span class="n">delta_table</span><span class="p">.</span><span class="nf">alias</span><span class="p">(</span><span class="sh">"</span><span class="s">target</span><span class="sh">"</span><span class="p">).</span><span class="nf">merge</span><span class="p">(</span>
    <span class="n">source</span><span class="o">=</span><span class="n">new_data</span><span class="p">.</span><span class="nf">alias</span><span class="p">(</span><span class="sh">"</span><span class="s">source</span><span class="sh">"</span><span class="p">),</span>
    <span class="n">condition</span><span class="o">=</span><span class="sh">"</span><span class="s">target.user_id = source.user_id</span><span class="sh">"</span>
<span class="p">).</span><span class="nf">whenMatchedUpdateAll</span><span class="p">().</span><span class="nf">whenNotMatchedInsertAll</span><span class="p">().</span><span class="nf">execute</span><span class="p">()</span>
</code></pre></div></div>

<p>This is how you do CDC (Change Data Capture) efficiently.</p>

<h3 id="optimize-and-vacuum">OPTIMIZE and VACUUM</h3>

<p>Maintain table performance:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Optimize: compact small files into larger ones
</span><span class="n">spark</span><span class="p">.</span><span class="nf">sql</span><span class="p">(</span><span class="sh">"</span><span class="s">OPTIMIZE delta.`/mnt/data/processed/users`</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Z-ordering: colocate related data
</span><span class="n">spark</span><span class="p">.</span><span class="nf">sql</span><span class="p">(</span><span class="sh">"</span><span class="s">OPTIMIZE delta.`/mnt/data/processed/users` ZORDER BY (user_id, created_date)</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Vacuum: remove old versions (7 day default)
</span><span class="n">spark</span><span class="p">.</span><span class="nf">sql</span><span class="p">(</span><span class="sh">"</span><span class="s">VACUUM delta.`/mnt/data/processed/users` RETAIN 168 HOURS</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>Run OPTIMIZE weekly, VACUUM monthly. See <a href="https://docs.delta.io/latest/optimizations-oss.html">Delta Lake performance tuning</a>.</p>

<h2 id="production-etl-pipeline">Production ETL Pipeline</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="kn">from</span> <span class="n">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">col</span><span class="p">,</span> <span class="n">current_timestamp</span><span class="p">,</span> <span class="n">sha2</span><span class="p">,</span> <span class="n">concat_ws</span>
<span class="kn">from</span> <span class="n">delta.tables</span> <span class="kn">import</span> <span class="n">DeltaTable</span>

<span class="k">class</span> <span class="nc">ProductionETL</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Production-grade ETL pipeline.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span><span class="p">.</span><span class="nf">appName</span><span class="p">(</span><span class="sh">"</span><span class="s">ETL</span><span class="sh">"</span><span class="p">).</span><span class="nf">getOrCreate</span><span class="p">()</span>
    
    <span class="k">def</span> <span class="nf">extract</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">source_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Extract from source.</span><span class="sh">"""</span>
        <span class="nf">return </span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">spark</span><span class="p">.</span><span class="n">read</span>
            <span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">parquet</span><span class="sh">"</span><span class="p">)</span>
            <span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">source_path</span><span class="p">)</span>
        <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">transform</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">df</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Transform data with validation.</span><span class="sh">"""</span>
        <span class="c1"># Add processing metadata
</span>        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="nf">withColumn</span><span class="p">(</span><span class="sh">"</span><span class="s">processed_at</span><span class="sh">"</span><span class="p">,</span> <span class="nf">current_timestamp</span><span class="p">())</span>
        
        <span class="c1"># Data quality checks
</span>        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="nf">filter</span><span class="p">(</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">user_id</span><span class="sh">"</span><span class="p">).</span><span class="nf">isNotNull</span><span class="p">())</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="nf">filter</span><span class="p">(</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">amount</span><span class="sh">"</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
        
        <span class="c1"># Hash sensitive fields
</span>        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="nf">withColumn</span><span class="p">(</span>
            <span class="sh">"</span><span class="s">email_hash</span><span class="sh">"</span><span class="p">,</span>
            <span class="nf">sha2</span><span class="p">(</span><span class="nf">col</span><span class="p">(</span><span class="sh">"</span><span class="s">email</span><span class="sh">"</span><span class="p">),</span> <span class="mi">256</span><span class="p">)</span>
        <span class="p">)</span>
        
        <span class="c1"># Deduplication
</span>        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="nf">dropDuplicates</span><span class="p">([</span><span class="sh">"</span><span class="s">user_id</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">transaction_id</span><span class="sh">"</span><span class="p">])</span>
        
        <span class="k">return</span> <span class="n">df</span>
    
    <span class="k">def</span> <span class="nf">load</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">df</span><span class="p">,</span> <span class="n">target_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Load to Delta Lake with merge.</span><span class="sh">"""</span>
        <span class="c1"># Check if table exists
</span>        <span class="k">if</span> <span class="n">DeltaTable</span><span class="p">.</span><span class="nf">isDeltaTable</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">spark</span><span class="p">,</span> <span class="n">target_path</span><span class="p">):</span>
            <span class="c1"># Merge into existing table
</span>            <span class="n">delta_table</span> <span class="o">=</span> <span class="n">DeltaTable</span><span class="p">.</span><span class="nf">forPath</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">spark</span><span class="p">,</span> <span class="n">target_path</span><span class="p">)</span>
            
            <span class="n">delta_table</span><span class="p">.</span><span class="nf">alias</span><span class="p">(</span><span class="sh">"</span><span class="s">target</span><span class="sh">"</span><span class="p">).</span><span class="nf">merge</span><span class="p">(</span>
                <span class="n">source</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="nf">alias</span><span class="p">(</span><span class="sh">"</span><span class="s">source</span><span class="sh">"</span><span class="p">),</span>
                <span class="n">condition</span><span class="o">=</span><span class="sh">"</span><span class="s">target.transaction_id = source.transaction_id</span><span class="sh">"</span>
            <span class="p">).</span><span class="nf">whenMatchedUpdateAll</span><span class="p">().</span><span class="nf">whenNotMatchedInsertAll</span><span class="p">().</span><span class="nf">execute</span><span class="p">()</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="c1"># Create new table
</span>            <span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">write</span>
                <span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="sh">"</span><span class="s">delta</span><span class="sh">"</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">mode</span><span class="p">(</span><span class="sh">"</span><span class="s">overwrite</span><span class="sh">"</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">save</span><span class="p">(</span><span class="n">target_path</span><span class="p">)</span>
            <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">source_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">target_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Run complete ETL.</span><span class="sh">"""</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="c1"># Extract
</span>            <span class="n">raw_df</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">extract</span><span class="p">(</span><span class="n">source_path</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Extracted </span><span class="si">{</span><span class="n">raw_df</span><span class="p">.</span><span class="nf">count</span><span class="p">()</span><span class="si">}</span><span class="s"> rows</span><span class="sh">"</span><span class="p">)</span>
            
            <span class="c1"># Transform
</span>            <span class="n">clean_df</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">transform</span><span class="p">(</span><span class="n">raw_df</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Cleaned to </span><span class="si">{</span><span class="n">clean_df</span><span class="p">.</span><span class="nf">count</span><span class="p">()</span><span class="si">}</span><span class="s"> rows</span><span class="sh">"</span><span class="p">)</span>
            
            <span class="c1"># Load
</span>            <span class="n">self</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">clean_df</span><span class="p">,</span> <span class="n">target_path</span><span class="p">)</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Loaded to </span><span class="si">{</span><span class="n">target_path</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
            
            <span class="c1"># Optimize after load
</span>            <span class="n">self</span><span class="p">.</span><span class="n">spark</span><span class="p">.</span><span class="nf">sql</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">OPTIMIZE delta.`</span><span class="si">{</span><span class="n">target_path</span><span class="si">}</span><span class="s">`</span><span class="sh">"</span><span class="p">)</span>
            
        <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">ETL failed: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
            <span class="k">raise</span>

<span class="c1"># Usage
</span><span class="n">etl</span> <span class="o">=</span> <span class="nc">ProductionETL</span><span class="p">()</span>
<span class="n">etl</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span>
    <span class="n">source_path</span><span class="o">=</span><span class="sh">"</span><span class="s">s3://bucket/raw/transactions/</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">target_path</span><span class="o">=</span><span class="sh">"</span><span class="s">/mnt/data/processed/transactions</span><span class="sh">"</span>
<span class="p">)</span>
</code></pre></div></div>

<h2 id="best-practices-from-production">Best Practices from Production</h2>

<ol>
  <li><strong>Use Delta Lake always</strong> - ACID guarantees are worth it</li>
  <li><strong>Partition large tables</strong> - By date or high-cardinality keys</li>
  <li><strong>Z-order frequently queried columns</strong> - Colocates data</li>
  <li><strong>Set retention policies</strong> - Balance time travel vs storage costs</li>
  <li><strong>Monitor cluster metrics</strong> - CPU, memory, I/O utilization</li>
  <li><strong>Right-size clusters</strong> - Match to workload (don’t over-provision)</li>
  <li><strong>Use auto-termination</strong> - Clusters shutdown after idle time</li>
  <li><strong>Enable Photon</strong> - Vectorized execution engine (2-5x faster)</li>
  <li><strong>Cache frequently accessed data</strong> - <code class="language-plaintext highlighter-rouge">df.cache()</code> for reuse</li>
  <li><strong>Test with small samples</strong> - <code class="language-plaintext highlighter-rouge">.limit(1000)</code> for development</li>
</ol>

<h2 id="conclusion">Conclusion</h2>

<p>Databricks simplifies data engineering by handling Spark cluster management, providing excellent notebook UX, and offering Delta Lake for reliable storage. The managed platform lets you focus on transforming data rather than managing infrastructure.</p>

<p>The combination of Spark’s distributed processing, Delta Lake’s ACID guarantees, and notebook-based development creates a productive environment for data teams. ETL pipelines that took days to build on self-managed Spark can be prototyped in hours on Databricks.</p>

<p>Cost management is crucial—clusters can get expensive. Use autoscaling, right-size instances, and terminate idle clusters. The productivity gains typically justify the costs for teams processing terabytes of data.</p>

<p><strong>Further Resources:</strong></p>
<ul>
  <li><a href="https://docs.databricks.com/">Databricks Documentation</a> - Comprehensive guides</li>
  <li><a href="https://spark.apache.org/docs/latest/">Apache Spark Docs</a> - Core engine</li>
  <li><a href="https://delta.io/">Delta Lake</a> - Open format and engine</li>
  <li><a href="https://mlflow.org/">MLflow</a> - ML lifecycle management</li>
  <li><a href="https://www.databricks.com/learn/training">Databricks Academy</a> - Free courses</li>
  <li><a href="https://github.com/delta-io/delta">Delta Lake GitHub</a> - Open source repo</li>
  <li><a href="https://docs.databricks.com/en/data-governance/unity-catalog/index.html">Unity Catalog</a> - Data governance</li>
</ul>

<hr />

<p><em>Databricks for data engineers from September 2025 — updated with production guidance.</em></p>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="How-To" /><category term="Databricks" /><category term="Data Engineering" /><category term="Spark" /><category term="Analytics" /><summary type="html"><![CDATA[Get started with Databricks: notebooks, Spark, data pipelines, Delta Lake, and how Databricks enables data engineering at scale.]]></summary></entry><entry><title type="html">Container Orchestration at the Edge: New Paradigms</title><link href="https://www.fratepietro.com/2025/container-orchestration-edge/" rel="alternate" type="text/html" title="Container Orchestration at the Edge: New Paradigms" /><published>2025-08-23T00:00:00+02:00</published><updated>2025-08-23T00:00:00+02:00</updated><id>https://www.fratepietro.com/2025/container-orchestration-edge</id><content type="html" xml:base="https://www.fratepietro.com/2025/container-orchestration-edge/"><![CDATA[<p>Edge computing promises low latency by running workloads close to users. But orchestrating containers at thousands of edge locations isn’t the same as managing a data center cluster. Resource constraints, intermittent connectivity, and distributed management demand new approaches.</p>

<p>I deployed a CDN edge service using traditional Kubernetes—control plane used 2GB RAM before running any workload. At 500 edge locations, that’s 1TB just for orchestration. We switched to <a href="https://k3s.io/">K3s</a>, Rancher’s lightweight Kubernetes: 512MB for control plane + agents. Same APIs, 75% less overhead.</p>

<p>Edge orchestration challenges three Kubernetes assumptions: abundant resources, reliable networking, and centralized control. Solutions require rethinking each.</p>

<h2 id="the-edge-is-different">The Edge is Different</h2>

<p><strong>Resource constraints:</strong></p>
<ul>
  <li>Edge nodes: 2-4 CPU cores, 4-8GB RAM</li>
  <li>Data center nodes: 32-96 cores, 128-512GB RAM</li>
  <li>Difference: 10-20x less resources</li>
</ul>

<p><strong>Network reality:</strong></p>
<ul>
  <li>Data center: 10Gbps+ local, &lt;1ms latency</li>
  <li>Edge: 10-100Mbps WAN, 50-200ms latency, periodic disconnects</li>
</ul>

<p><strong>Management scale:</strong></p>
<ul>
  <li>Data center: 10-1000 nodes, centralized</li>
  <li>Edge: 100-10,000 nodes, geographically distributed</li>
</ul>

<p>Traditional Kubernetes doesn’t fit. New solutions emerged: K3s, MicroK8s, KubeEdge.</p>

<h2 id="lightweight-kubernetes-k3s">Lightweight Kubernetes: K3s</h2>

<p><a href="https://k3s.io/">K3s</a> is Kubernetes minus the bloat:</p>

<p><strong>What’s removed:</strong></p>
<ul>
  <li>Legacy alpha features</li>
  <li>Non-default admission controllers</li>
  <li>In-tree cloud providers</li>
  <li>In-tree storage plugins</li>
</ul>

<p><strong>What’s changed:</strong></p>
<ul>
  <li>etcd → SQLite (or Postgres/MySQL for HA)</li>
  <li>Docker → containerd (no Docker dependency)</li>
  <li>Single binary deployment</li>
</ul>

<p><strong>Result:</strong> 512MB RAM footprint vs 2GB+ for standard K8s.</p>

<h3 id="install-k3s">Install K3s</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Master node</span>
curl <span class="nt">-sfL</span> https://get.k3s.io | sh -

<span class="c"># Get node token</span>
<span class="nb">sudo cat</span> /var/lib/rancher/k3s/server/node-token

<span class="c"># Worker node</span>
curl <span class="nt">-sfL</span> https://get.k3s.io | <span class="nv">K3S_URL</span><span class="o">=</span>https://master-ip:6443 <span class="se">\</span>
  <span class="nv">K3S_TOKEN</span><span class="o">=</span>&lt;token&gt; sh -

<span class="c"># Verify</span>
<span class="nb">sudo </span>k3s kubectl get nodes
</code></pre></div></div>

<p>Production install (with external database):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># PostgreSQL HA</span>
curl <span class="nt">-sfL</span> https://get.k3s.io | sh <span class="nt">-s</span> - server <span class="se">\</span>
  <span class="nt">--datastore-endpoint</span><span class="o">=</span><span class="s2">"postgres://user:pass@postgres-host:5432/k3s"</span>
</code></pre></div></div>

<p>Read <a href="https://docs.k3s.io/architecture">K3s architecture</a> for details.</p>

<h3 id="deploy-edge-application">Deploy Edge Application</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># edge-app-deployment.yaml</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Deployment</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">edge-app</span>
  <span class="na">labels</span><span class="pi">:</span>
    <span class="na">app</span><span class="pi">:</span> <span class="s">edge-app</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">replicas</span><span class="pi">:</span> <span class="m">1</span>  <span class="c1"># Single replica per edge location</span>
  <span class="na">selector</span><span class="pi">:</span>
    <span class="na">matchLabels</span><span class="pi">:</span>
      <span class="na">app</span><span class="pi">:</span> <span class="s">edge-app</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">metadata</span><span class="pi">:</span>
      <span class="na">labels</span><span class="pi">:</span>
        <span class="na">app</span><span class="pi">:</span> <span class="s">edge-app</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="c1"># Resource limits for constrained edge</span>
      <span class="na">containers</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">app</span>
        <span class="na">image</span><span class="pi">:</span> <span class="s">my-edge-app:v1.2</span>
        <span class="na">ports</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">containerPort</span><span class="pi">:</span> <span class="m">8080</span>
        <span class="na">resources</span><span class="pi">:</span>
          <span class="na">requests</span><span class="pi">:</span>
            <span class="na">cpu</span><span class="pi">:</span> <span class="s">100m</span>      <span class="c1"># 0.1 CPU core</span>
            <span class="na">memory</span><span class="pi">:</span> <span class="s">128Mi</span>
          <span class="na">limits</span><span class="pi">:</span>
            <span class="na">cpu</span><span class="pi">:</span> <span class="s">500m</span>      <span class="c1"># 0.5 CPU core max</span>
            <span class="na">memory</span><span class="pi">:</span> <span class="s">512Mi</span>  <span class="c1"># Hard limit</span>
        
        <span class="c1"># Health checks</span>
        <span class="na">livenessProbe</span><span class="pi">:</span>
          <span class="na">httpGet</span><span class="pi">:</span>
            <span class="na">path</span><span class="pi">:</span> <span class="s">/health</span>
            <span class="na">port</span><span class="pi">:</span> <span class="m">8080</span>
          <span class="na">initialDelaySeconds</span><span class="pi">:</span> <span class="m">30</span>
          <span class="na">periodSeconds</span><span class="pi">:</span> <span class="m">10</span>
          <span class="na">failureThreshold</span><span class="pi">:</span> <span class="m">3</span>
        
        <span class="na">readinessProbe</span><span class="pi">:</span>
          <span class="na">httpGet</span><span class="pi">:</span>
            <span class="na">path</span><span class="pi">:</span> <span class="s">/ready</span>
            <span class="na">port</span><span class="pi">:</span> <span class="m">8080</span>
          <span class="na">initialDelaySeconds</span><span class="pi">:</span> <span class="m">5</span>
          <span class="na">periodSeconds</span><span class="pi">:</span> <span class="m">5</span>
        
        <span class="c1"># Environment config</span>
        <span class="na">env</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">REGION</span>
          <span class="na">valueFrom</span><span class="pi">:</span>
            <span class="na">fieldRef</span><span class="pi">:</span>
              <span class="na">fieldPath</span><span class="pi">:</span> <span class="s">metadata.labels['region']</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">NODE_NAME</span>
          <span class="na">valueFrom</span><span class="pi">:</span>
            <span class="na">fieldRef</span><span class="pi">:</span>
              <span class="na">fieldPath</span><span class="pi">:</span> <span class="s">spec.nodeName</span>

<span class="nn">---</span>
<span class="c1"># Service with NodePort (for edge ingress)</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Service</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">edge-app</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">type</span><span class="pi">:</span> <span class="s">NodePort</span>
  <span class="na">ports</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">port</span><span class="pi">:</span> <span class="m">8080</span>
    <span class="na">targetPort</span><span class="pi">:</span> <span class="m">8080</span>
    <span class="na">nodePort</span><span class="pi">:</span> <span class="m">30080</span>  <span class="c1"># Accessible on node IP</span>
  <span class="na">selector</span><span class="pi">:</span>
    <span class="na">app</span><span class="pi">:</span> <span class="s">edge-app</span>
</code></pre></div></div>

<p>Deploy:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl apply <span class="nt">-f</span> edge-app-deployment.yaml

<span class="c"># Verify</span>
kubectl get pods
kubectl get svc
</code></pre></div></div>

<h2 id="offline-first-applications">Offline-First Applications</h2>

<p>Edge locations lose connectivity. Design for it:</p>

<h3 id="local-state--sync">Local State + Sync</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">StatefulSet</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">edge-cache</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">serviceName</span><span class="pi">:</span> <span class="s">edge-cache</span>
  <span class="na">replicas</span><span class="pi">:</span> <span class="m">1</span>
  <span class="na">selector</span><span class="pi">:</span>
    <span class="na">matchLabels</span><span class="pi">:</span>
      <span class="na">app</span><span class="pi">:</span> <span class="s">edge-cache</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">metadata</span><span class="pi">:</span>
      <span class="na">labels</span><span class="pi">:</span>
        <span class="na">app</span><span class="pi">:</span> <span class="s">edge-cache</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">redis</span>
        <span class="na">image</span><span class="pi">:</span> <span class="s">redis:7-alpine</span>
        <span class="na">ports</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">containerPort</span><span class="pi">:</span> <span class="m">6379</span>
        <span class="na">volumeMounts</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">data</span>
          <span class="na">mountPath</span><span class="pi">:</span> <span class="s">/data</span>
        <span class="na">command</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="s">redis-server</span>
        <span class="pi">-</span> <span class="s">--save</span>
        <span class="pi">-</span> <span class="s2">"</span><span class="s">60</span><span class="nv"> </span><span class="s">1"</span>  <span class="c1"># Persist every 60s if 1+ keys changed</span>
        <span class="pi">-</span> <span class="s">--appendonly</span>
        <span class="pi">-</span> <span class="s2">"</span><span class="s">yes"</span>
        <span class="na">resources</span><span class="pi">:</span>
          <span class="na">requests</span><span class="pi">:</span>
            <span class="na">memory</span><span class="pi">:</span> <span class="s">256Mi</span>
          <span class="na">limits</span><span class="pi">:</span>
            <span class="na">memory</span><span class="pi">:</span> <span class="s">512Mi</span>
  
  <span class="na">volumeClaimTemplates</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">metadata</span><span class="pi">:</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">data</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">accessModes</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">ReadWriteOnce"</span><span class="pi">]</span>
      <span class="na">storageClassName</span><span class="pi">:</span> <span class="s">local-path</span>
      <span class="na">resources</span><span class="pi">:</span>
        <span class="na">requests</span><span class="pi">:</span>
          <span class="na">storage</span><span class="pi">:</span> <span class="s">1Gi</span>
</code></pre></div></div>

<p>Application uses local Redis, syncs to central database when online:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">redis</span>
<span class="kn">import</span> <span class="n">requests</span>
<span class="kn">from</span> <span class="n">typing</span> <span class="kn">import</span> <span class="n">Optional</span>

<span class="k">class</span> <span class="nc">EdgeCache</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Offline-first cache with background sync.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">redis</span> <span class="o">=</span> <span class="n">redis</span><span class="p">.</span><span class="nc">Redis</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="sh">'</span><span class="s">edge-cache</span><span class="sh">'</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">6379</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">central_api</span> <span class="o">=</span> <span class="sh">'</span><span class="s">https://central.example.com/api</span><span class="sh">'</span>
    
    <span class="k">def</span> <span class="nf">get</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">key</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
        <span class="sh">"""</span><span class="s">Get from local cache.</span><span class="sh">"""</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">redis</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">set</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">key</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Set in local cache and queue for sync.</span><span class="sh">"""</span>
        <span class="n">self</span><span class="p">.</span><span class="n">redis</span><span class="p">.</span><span class="nf">set</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">redis</span><span class="p">.</span><span class="nf">rpush</span><span class="p">(</span><span class="sh">'</span><span class="s">sync_queue</span><span class="sh">'</span><span class="p">,</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">key</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">value</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">sync</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Sync pending changes to central (background task).</span><span class="sh">"""</span>
        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="n">item</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">redis</span><span class="p">.</span><span class="nf">lpop</span><span class="p">(</span><span class="sh">'</span><span class="s">sync_queue</span><span class="sh">'</span><span class="p">)</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">item</span><span class="p">:</span>
                <span class="k">break</span>
            
            <span class="k">try</span><span class="p">:</span>
                <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="o">=</span> <span class="n">item</span><span class="p">.</span><span class="nf">decode</span><span class="p">().</span><span class="nf">split</span><span class="p">(</span><span class="sh">'</span><span class="s">:</span><span class="sh">'</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
                
                <span class="c1"># Upload to central
</span>                <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="nf">post</span><span class="p">(</span>
                    <span class="sa">f</span><span class="sh">'</span><span class="si">{</span><span class="n">self</span><span class="p">.</span><span class="n">central_api</span><span class="si">}</span><span class="s">/sync</span><span class="sh">'</span><span class="p">,</span>
                    <span class="n">json</span><span class="o">=</span><span class="p">{</span><span class="sh">'</span><span class="s">key</span><span class="sh">'</span><span class="p">:</span> <span class="n">key</span><span class="p">,</span> <span class="sh">'</span><span class="s">value</span><span class="sh">'</span><span class="p">:</span> <span class="n">value</span><span class="p">},</span>
                    <span class="n">timeout</span><span class="o">=</span><span class="mi">5</span>
                <span class="p">)</span>
                <span class="n">response</span><span class="p">.</span><span class="nf">raise_for_status</span><span class="p">()</span>
                
            <span class="k">except</span> <span class="n">requests</span><span class="p">.</span><span class="n">RequestException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                <span class="c1"># Network error - requeue
</span>                <span class="n">self</span><span class="p">.</span><span class="n">redis</span><span class="p">.</span><span class="nf">lpush</span><span class="p">(</span><span class="sh">'</span><span class="s">sync_queue</span><span class="sh">'</span><span class="p">,</span> <span class="n">item</span><span class="p">)</span>
                <span class="k">break</span>  <span class="c1"># Stop syncing, try again later
</span></code></pre></div></div>

<p>Run sync as cron job:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">batch/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">CronJob</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">sync-job</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">schedule</span><span class="pi">:</span> <span class="s2">"</span><span class="s">*/5</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*"</span>  <span class="c1"># Every 5 minutes</span>
  <span class="na">jobTemplate</span><span class="pi">:</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">template</span><span class="pi">:</span>
        <span class="na">spec</span><span class="pi">:</span>
          <span class="na">containers</span><span class="pi">:</span>
          <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">sync</span>
            <span class="na">image</span><span class="pi">:</span> <span class="s">my-edge-app:v1.2</span>
            <span class="na">command</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">python"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">sync.py"</span><span class="pi">]</span>
          <span class="na">restartPolicy</span><span class="pi">:</span> <span class="s">OnFailure</span>
</code></pre></div></div>

<h2 id="image-optimization-for-edge">Image Optimization for Edge</h2>

<p>Bandwidth is limited. Minimize image sizes:</p>

<h3 id="multi-stage-builds">Multi-Stage Builds</h3>

<div class="language-dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Build stage</span>
<span class="k">FROM</span><span class="w"> </span><span class="s">golang:1.21</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="s">builder</span>

<span class="k">WORKDIR</span><span class="s"> /app</span>
<span class="k">COPY</span><span class="s"> go.mod go.sum ./</span>
<span class="k">RUN </span>go mod download

<span class="k">COPY</span><span class="s"> . .</span>
<span class="k">RUN </span><span class="nv">CGO_ENABLED</span><span class="o">=</span>0 <span class="nv">GOOS</span><span class="o">=</span>linux go build <span class="nt">-a</span> <span class="nt">-installsuffix</span> cgo <span class="nt">-o</span> app .

<span class="c"># Runtime stage (distroless)</span>
<span class="k">FROM</span><span class="s"> gcr.io/distroless/static-debian12</span>

<span class="k">COPY</span><span class="s"> --from=builder /app/app /app</span>

<span class="k">EXPOSE</span><span class="s"> 8080</span>
<span class="k">USER</span><span class="s"> nonroot:nonroot</span>

<span class="k">ENTRYPOINT</span><span class="s"> ["/app"]</span>
</code></pre></div></div>

<p>Result: 10MB image vs 300MB+ with full golang base.</p>

<h3 id="pre-pull-images">Pre-pull Images</h3>

<p>Use DaemonSet to pre-pull images on all nodes:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">DaemonSet</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">image-puller</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">selector</span><span class="pi">:</span>
    <span class="na">matchLabels</span><span class="pi">:</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">image-puller</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">metadata</span><span class="pi">:</span>
      <span class="na">labels</span><span class="pi">:</span>
        <span class="na">name</span><span class="pi">:</span> <span class="s">image-puller</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">initContainers</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">pull-app-image</span>
        <span class="na">image</span><span class="pi">:</span> <span class="s">my-edge-app:v1.2</span>
        <span class="na">command</span><span class="pi">:</span> <span class="pi">[</span><span class="s1">'</span><span class="s">sh'</span><span class="pi">,</span> <span class="s1">'</span><span class="s">-c'</span><span class="pi">,</span> <span class="s1">'</span><span class="s">echo</span><span class="nv"> </span><span class="s">"Image</span><span class="nv"> </span><span class="s">pulled"'</span><span class="pi">]</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">pull-cache-image</span>
        <span class="na">image</span><span class="pi">:</span> <span class="s">redis:7-alpine</span>
        <span class="na">command</span><span class="pi">:</span> <span class="pi">[</span><span class="s1">'</span><span class="s">sh'</span><span class="pi">,</span> <span class="s1">'</span><span class="s">-c'</span><span class="pi">,</span> <span class="s1">'</span><span class="s">echo</span><span class="nv"> </span><span class="s">"Image</span><span class="nv"> </span><span class="s">pulled"'</span><span class="pi">]</span>
      <span class="na">containers</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">pause</span>
        <span class="na">image</span><span class="pi">:</span> <span class="s">gcr.io/google_containers/pause:3.9</span>
</code></pre></div></div>

<h2 id="multi-cluster-management">Multi-Cluster Management</h2>

<p>Managing 100+ edge clusters requires automation. <a href="https://www.rancher.com/">Rancher</a> and <a href="https://argo-cd.readthedocs.io/">ArgoCD</a> help:</p>

<h3 id="gitops-with-argocd">GitOps with ArgoCD</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># argocd-app.yaml - Deploy to all edge clusters</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">argoproj.io/v1alpha1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Application</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">edge-app-us-west</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">argocd</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">project</span><span class="pi">:</span> <span class="s">default</span>
  <span class="na">source</span><span class="pi">:</span>
    <span class="na">repoURL</span><span class="pi">:</span> <span class="s">https://github.com/company/edge-apps</span>
    <span class="na">targetRevision</span><span class="pi">:</span> <span class="s">HEAD</span>
    <span class="na">path</span><span class="pi">:</span> <span class="s">apps/edge-app</span>
    <span class="na">helm</span><span class="pi">:</span>
      <span class="na">values</span><span class="pi">:</span> <span class="pi">|</span>
        <span class="s">region: us-west</span>
        <span class="s">replicas: 1</span>
        <span class="s">image:</span>
          <span class="s">tag: v1.2</span>
  <span class="na">destination</span><span class="pi">:</span>
    <span class="na">server</span><span class="pi">:</span> <span class="s">https://edge-cluster-us-west.example.com</span>
    <span class="na">namespace</span><span class="pi">:</span> <span class="s">default</span>
  <span class="na">syncPolicy</span><span class="pi">:</span>
    <span class="na">automated</span><span class="pi">:</span>
      <span class="na">prune</span><span class="pi">:</span> <span class="kc">true</span>
      <span class="na">selfHeal</span><span class="pi">:</span> <span class="kc">true</span>
</code></pre></div></div>

<p>Generate apps for all clusters programmatically:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># generate-apps.py
</span><span class="n">regions</span> <span class="o">=</span> <span class="p">[</span><span class="sh">'</span><span class="s">us-west</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">us-east</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">eu-west</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">ap-southeast</span><span class="sh">'</span><span class="p">]</span>

<span class="k">for</span> <span class="n">region</span> <span class="ow">in</span> <span class="n">regions</span><span class="p">:</span>
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="sa">f</span><span class="sh">'</span><span class="s">argocd-app-</span><span class="si">{</span><span class="n">region</span><span class="si">}</span><span class="s">.yaml</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">w</span><span class="sh">'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">template</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span>
            <span class="n">name</span><span class="o">=</span><span class="sa">f</span><span class="sh">'</span><span class="s">edge-app-</span><span class="si">{</span><span class="n">region</span><span class="si">}</span><span class="sh">'</span><span class="p">,</span>
            <span class="n">region</span><span class="o">=</span><span class="n">region</span><span class="p">,</span>
            <span class="n">server</span><span class="o">=</span><span class="sa">f</span><span class="sh">'</span><span class="s">https://edge-cluster-</span><span class="si">{</span><span class="n">region</span><span class="si">}</span><span class="s">.example.com</span><span class="sh">'</span>
        <span class="p">))</span>
</code></pre></div></div>

<h2 id="monitoring-distributed-edge">Monitoring Distributed Edge</h2>

<p>Centralize metrics from all edge locations:</p>

<h3 id="prometheus-federation">Prometheus Federation</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># prometheus-config.yaml</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ConfigMap</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">prometheus-config</span>
<span class="na">data</span><span class="pi">:</span>
  <span class="na">prometheus.yml</span><span class="pi">:</span> <span class="pi">|</span>
    <span class="s">global:</span>
      <span class="s">scrape_interval: 15s</span>
      <span class="s">evaluation_interval: 15s</span>
      <span class="s">external_labels:</span>
        <span class="s">cluster: edge-us-west</span>
        <span class="s">region: us-west</span>
    
    <span class="s"># Scrape local metrics</span>
    <span class="s">scrape_configs:</span>
    <span class="s">- job_name: 'edge-apps'</span>
      <span class="s">kubernetes_sd_configs:</span>
      <span class="s">- role: pod</span>
      <span class="s">relabel_configs:</span>
      <span class="s">- source_labels: [__meta_kubernetes_pod_label_app]</span>
        <span class="s">action: keep</span>
        <span class="s">regex: edge-app</span>
    
    <span class="s"># Federate to central Prometheus</span>
    <span class="s">remote_write:</span>
    <span class="s">- url: https://central-prometheus.example.com/api/v1/write</span>
      <span class="s">basic_auth:</span>
        <span class="s">username: edge</span>
        <span class="s">password: secret</span>
</code></pre></div></div>

<p>Query across all edge locations from central Prometheus:</p>

<pre><code class="language-promql"># Total requests across all edge locations
sum(http_requests_total) by (region)

# P95 latency per region
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (region, le)
)
</code></pre>

<h2 id="best-practices">Best Practices</h2>

<ol>
  <li><strong>Right-size resources</strong> - Edge nodes are constrained. Profile actual usage:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl top pods
kubectl top nodes
</code></pre></div>    </div>
  </li>
  <li><strong>Use local storage</strong> - Network storage adds latency. Use K3s local-path provisioner:
    <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">storageClassName</span><span class="pi">:</span> <span class="s">local-path</span>
</code></pre></div>    </div>
  </li>
  <li><strong>Design for network failures</strong> - Test disconnected mode:
```bash
    <h1 id="simulate-network-partition">Simulate network partition</h1>
    <p>sudo iptables -A OUTPUT -p tcp –dport 6443 -j DROP</p>
  </li>
</ol>

<h1 id="app-should-continue-working-offline">App should continue working offline</h1>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
4. **Automate updates** - Manual updates don't scale to 100+ clusters. Use GitOps.

5. **Monitor everything** - Metrics, logs, traces. Edge issues are hard to debug remotely.

6. **Security at edge** - Edge nodes may be physically accessible:
```yaml
# Enable Pod Security Standards
apiVersion: v1
kind: Namespace
metadata:
  name: default
  labels:
    pod-security.kubernetes.io/enforce: restricted
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>Edge container orchestration requires rethinking traditional patterns. Lightweight runtimes (K3s), offline-first applications, optimized images, and centralized management make it practical.</p>

<p>The paradigm shift: from assuming abundant resources and reliable networking to designing for constraints and intermittency. K3s proves Kubernetes APIs work at edge scale—if you remove the bloat.</p>

<p>For 10-100 edge locations, this approach works. Beyond that, consider specialized edge platforms (AWS Wavelength, Cloudflare Workers) that abstract orchestration entirely.</p>

<p><strong>Further Resources:</strong></p>
<ul>
  <li><a href="https://docs.k3s.io/">K3s Documentation</a> - Lightweight Kubernetes</li>
  <li><a href="https://kubeedge.io/">KubeEdge</a> - Edge-native Kubernetes</li>
  <li><a href="https://microk8s.io/">MicroK8s</a> - Minimal Kubernetes</li>
  <li><a href="https://argo-cd.readthedocs.io/">ArgoCD</a> - GitOps continuous delivery</li>
  <li><a href="https://www.rancher.com/">Rancher</a> - Multi-cluster management</li>
  <li><a href="https://www.cncf.io/blog/2021/09/14/edge-computing-a-cncf-perspective/">CNCF Edge Computing</a> - Architecture patterns</li>
  <li><a href="https://github.com/k3s-io/k3s">K3s GitHub</a> - Source and issues</li>
</ul>

<hr />

<p><em>Container orchestration at edge from August 2025 — updated with production guidance.</em></p>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="Architecture" /><category term="Edge Computing" /><category term="Containers" /><category term="Orchestration" /><category term="Kubernetes" /><summary type="html"><![CDATA[Orchestrate containers at the edge: edge Kubernetes, lightweight runtimes, distributed orchestration, and new paradigms for edge container management.]]></summary></entry><entry><title type="html">Cloudflare D1: SQLite at the Edge</title><link href="https://www.fratepietro.com/2025/cloudflare-d1-sqlite-edge/" rel="alternate" type="text/html" title="Cloudflare D1: SQLite at the Edge" /><published>2025-07-21T00:00:00+02:00</published><updated>2025-07-21T00:00:00+02:00</updated><id>https://www.fratepietro.com/2025/cloudflare-d1-sqlite-edge</id><content type="html" xml:base="https://www.fratepietro.com/2025/cloudflare-d1-sqlite-edge/"><![CDATA[<p><a href="https://developers.cloudflare.com/d1/">Cloudflare D1</a> is SQLite running at the edge—familiar SQL, global replication, sub-10ms queries from anywhere. It’s SQLite’s simplicity combined with Cloudflare’s distribution network.</p>

<p>I built a user preferences system with D1 and was surprised by how well it worked. Query latency from Sydney? 6ms. From São Paulo? 8ms. The same database, automatically replicated to edge locations, responding fast everywhere. No sharding configuration, no multi-region complexity—just SQLite that runs globally.</p>

<p>D1 makes sense for read-heavy workloads that benefit from geographic proximity: user settings, feature flags, product catalogs, metadata stores. It’s less suited for write-heavy transactional systems (those need stronger consistency guarantees).</p>

<p>Based on SQLite (the <a href="https://www.sqlite.org/mostdeployed.html">most widely deployed database</a>), D1 brings that reliability to the edge.</p>

<h2 id="why-d1">Why D1?</h2>

<p><strong>Familiar SQL</strong> - If you know SQLite, you know D1. Standard SQL syntax, no new query language.</p>

<p><strong>Global replication</strong> - Writes propagate to edge locations automatically. Reads are always local and fast.</p>

<p><strong>Workers integration</strong> - First-class integration with Cloudflare Workers. No connection pooling, no ORMs—just direct access.</p>

<p><strong>Cost-effective</strong> - $0.75/million reads, $5.00/GB storage. No per-database charges.</p>

<p><strong>Zero-configuration scaling</strong> - No sharding, no read replicas, no deployment topology. It just works.</p>

<p>Read the <a href="https://blog.cloudflare.com/introducing-d1/">D1 announcement</a> for Cloudflare’s vision.</p>

<h2 id="using-d1-with-workers">Using D1 with Workers</h2>

<p>D1 integrates natively with Cloudflare Workers:</p>

<h3 id="create-database">Create Database</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Install Wrangler</span>
npm <span class="nb">install</span> <span class="nt">-g</span> wrangler

<span class="c"># Create D1 database</span>
wrangler d1 create my-database

<span class="c"># Output: database_name = "my-database", database_id = "xxx-xxx-xxx"</span>
</code></pre></div></div>

<p>Add to <code class="language-plaintext highlighter-rouge">wrangler.toml</code>:</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">[[</span><span class="n">d1_databases</span><span class="k">]]</span>
<span class="n">binding</span> <span class="o">=</span><span class="w"> </span><span class="s">"DB"</span>  <span class="c"># Available as env.DB in Workers</span>
<span class="n">database_name</span> <span class="o">=</span><span class="w"> </span><span class="s">"my-database"</span>
<span class="n">database_id</span> <span class="o">=</span><span class="w"> </span><span class="s">"xxx-xxx-xxx"</span>
</code></pre></div></div>

<h3 id="create-tables">Create Tables</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create schema file</span>
<span class="nb">cat</span> <span class="o">&gt;</span> schema.sql <span class="o">&lt;&lt;</span> <span class="sh">'</span><span class="no">EOF</span><span class="sh">'
CREATE TABLE IF NOT EXISTS users (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    email TEXT UNIQUE NOT NULL,
    name TEXT NOT NULL,
    created_at INTEGER NOT NULL
);

CREATE INDEX idx_users_email ON users(email);

CREATE TABLE IF NOT EXISTS preferences (
    user_id INTEGER NOT NULL,
    key TEXT NOT NULL,
    value TEXT,
    PRIMARY KEY (user_id, key),
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
</span><span class="no">EOF

</span><span class="c"># Apply schema</span>
wrangler d1 execute my-database <span class="nt">--file</span><span class="o">=</span>schema.sql
</code></pre></div></div>

<h3 id="query-from-workers">Query from Workers</h3>

<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// worker.ts</span>
<span class="k">export</span> <span class="kr">interface</span> <span class="nx">Env</span> <span class="p">{</span>
    <span class="nl">DB</span><span class="p">:</span> <span class="nx">D1Database</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">export</span> <span class="k">default</span> <span class="p">{</span>
    <span class="k">async</span> <span class="nf">fetch</span><span class="p">(</span><span class="na">request</span><span class="p">:</span> <span class="nx">Request</span><span class="p">,</span> <span class="na">env</span><span class="p">:</span> <span class="nx">Env</span><span class="p">):</span> <span class="nb">Promise</span><span class="o">&lt;</span><span class="nx">Response</span><span class="o">&gt;</span> <span class="p">{</span>
        <span class="kd">const</span> <span class="nx">url</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">URL</span><span class="p">(</span><span class="nx">request</span><span class="p">.</span><span class="nx">url</span><span class="p">);</span>
        
        <span class="k">if </span><span class="p">(</span><span class="nx">url</span><span class="p">.</span><span class="nx">pathname</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">/api/users</span><span class="dl">'</span> <span class="o">&amp;&amp;</span> <span class="nx">request</span><span class="p">.</span><span class="nx">method</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">GET</span><span class="dl">'</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// List users</span>
            <span class="kd">const</span> <span class="nx">result</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">env</span><span class="p">.</span><span class="nx">DB</span>
                <span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="dl">'</span><span class="s1">SELECT id, email, name, created_at FROM users ORDER BY created_at DESC LIMIT 10</span><span class="dl">'</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">all</span><span class="p">();</span>
            
            <span class="k">return</span> <span class="nx">Response</span><span class="p">.</span><span class="nf">json</span><span class="p">(</span><span class="nx">result</span><span class="p">.</span><span class="nx">results</span><span class="p">);</span>
        <span class="p">}</span>
        
        <span class="k">if </span><span class="p">(</span><span class="nx">url</span><span class="p">.</span><span class="nx">pathname</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">/api/users</span><span class="dl">'</span> <span class="o">&amp;&amp;</span> <span class="nx">request</span><span class="p">.</span><span class="nx">method</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">POST</span><span class="dl">'</span><span class="p">)</span> <span class="p">{</span>
            <span class="c1">// Create user</span>
            <span class="kd">const</span> <span class="nx">body</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">request</span><span class="p">.</span><span class="nf">json</span><span class="p">();</span>
            
            <span class="kd">const</span> <span class="nx">result</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">env</span><span class="p">.</span><span class="nx">DB</span>
                <span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="dl">'</span><span class="s1">INSERT INTO users (email, name, created_at) VALUES (?, ?, ?) RETURNING id</span><span class="dl">'</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">bind</span><span class="p">(</span><span class="nx">body</span><span class="p">.</span><span class="nx">email</span><span class="p">,</span> <span class="nx">body</span><span class="p">.</span><span class="nx">name</span><span class="p">,</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">())</span>
                <span class="p">.</span><span class="nf">first</span><span class="p">();</span>
            
            <span class="k">return</span> <span class="nx">Response</span><span class="p">.</span><span class="nf">json</span><span class="p">({</span> <span class="na">id</span><span class="p">:</span> <span class="nx">result</span><span class="p">.</span><span class="nx">id</span> <span class="p">},</span> <span class="p">{</span> <span class="na">status</span><span class="p">:</span> <span class="mi">201</span> <span class="p">});</span>
        <span class="p">}</span>
        
        <span class="k">if </span><span class="p">(</span><span class="nx">url</span><span class="p">.</span><span class="nx">pathname</span><span class="p">.</span><span class="nf">startsWith</span><span class="p">(</span><span class="dl">'</span><span class="s1">/api/users/</span><span class="dl">'</span><span class="p">))</span> <span class="p">{</span>
            <span class="kd">const</span> <span class="nx">userId</span> <span class="o">=</span> <span class="nx">url</span><span class="p">.</span><span class="nx">pathname</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="dl">'</span><span class="s1">/</span><span class="dl">'</span><span class="p">)[</span><span class="mi">3</span><span class="p">];</span>
            
            <span class="c1">// Get user with preferences</span>
            <span class="kd">const</span> <span class="nx">user</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">env</span><span class="p">.</span><span class="nx">DB</span>
                <span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="dl">'</span><span class="s1">SELECT * FROM users WHERE id = ?</span><span class="dl">'</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">bind</span><span class="p">(</span><span class="nx">userId</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">first</span><span class="p">();</span>
            
            <span class="k">if </span><span class="p">(</span><span class="o">!</span><span class="nx">user</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">return</span> <span class="nx">Response</span><span class="p">.</span><span class="nf">json</span><span class="p">({</span> <span class="na">error</span><span class="p">:</span> <span class="dl">'</span><span class="s1">User not found</span><span class="dl">'</span> <span class="p">},</span> <span class="p">{</span> <span class="na">status</span><span class="p">:</span> <span class="mi">404</span> <span class="p">});</span>
            <span class="p">}</span>
            
            <span class="kd">const</span> <span class="nx">prefs</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">env</span><span class="p">.</span><span class="nx">DB</span>
                <span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="dl">'</span><span class="s1">SELECT key, value FROM preferences WHERE user_id = ?</span><span class="dl">'</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">bind</span><span class="p">(</span><span class="nx">userId</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">all</span><span class="p">();</span>
            
            <span class="k">return</span> <span class="nx">Response</span><span class="p">.</span><span class="nf">json</span><span class="p">({</span>
                <span class="p">...</span><span class="nx">user</span><span class="p">,</span>
                <span class="na">preferences</span><span class="p">:</span> <span class="nb">Object</span><span class="p">.</span><span class="nf">fromEntries</span><span class="p">(</span>
                    <span class="nx">prefs</span><span class="p">.</span><span class="nx">results</span><span class="p">.</span><span class="nf">map</span><span class="p">(</span><span class="nx">p</span> <span class="o">=&gt;</span> <span class="p">[</span><span class="nx">p</span><span class="p">.</span><span class="nx">key</span><span class="p">,</span> <span class="nx">p</span><span class="p">.</span><span class="nx">value</span><span class="p">])</span>
                <span class="p">),</span>
            <span class="p">});</span>
        <span class="p">}</span>
        
        <span class="k">return</span> <span class="nx">Response</span><span class="p">.</span><span class="nf">json</span><span class="p">({</span> <span class="na">error</span><span class="p">:</span> <span class="dl">'</span><span class="s1">Not found</span><span class="dl">'</span> <span class="p">},</span> <span class="p">{</span> <span class="na">status</span><span class="p">:</span> <span class="mi">404</span> <span class="p">});</span>
    <span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<h3 id="batch-operations">Batch Operations</h3>

<p>Batch multiple statements for efficiency:</p>

<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="kd">function</span> <span class="nf">createUserWithPreferences</span><span class="p">(</span>
    <span class="nx">db</span><span class="p">:</span> <span class="nx">D1Database</span><span class="p">,</span>
    <span class="nx">email</span><span class="p">:</span> <span class="kr">string</span><span class="p">,</span>
    <span class="nx">name</span><span class="p">:</span> <span class="kr">string</span><span class="p">,</span>
    <span class="nx">preferences</span><span class="p">:</span> <span class="nb">Record</span><span class="o">&lt;</span><span class="kr">string</span><span class="p">,</span> <span class="kr">string</span><span class="o">&gt;</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Prepare statements</span>
    <span class="kd">const</span> <span class="nx">insertUser</span> <span class="o">=</span> <span class="nx">db</span>
        <span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="dl">'</span><span class="s1">INSERT INTO users (email, name, created_at) VALUES (?, ?, ?) RETURNING id</span><span class="dl">'</span><span class="p">)</span>
        <span class="p">.</span><span class="nf">bind</span><span class="p">(</span><span class="nx">email</span><span class="p">,</span> <span class="nx">name</span><span class="p">,</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">());</span>
    
    <span class="c1">// Execute in batch (returns array of results)</span>
    <span class="kd">const</span> <span class="nx">results</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">db</span><span class="p">.</span><span class="nf">batch</span><span class="p">([</span>
        <span class="nx">insertUser</span><span class="p">,</span>
        <span class="p">...</span><span class="nb">Object</span><span class="p">.</span><span class="nf">entries</span><span class="p">(</span><span class="nx">preferences</span><span class="p">).</span><span class="nf">map</span><span class="p">(([</span><span class="nx">key</span><span class="p">,</span> <span class="nx">value</span><span class="p">])</span> <span class="o">=&gt;</span>
            <span class="nx">db</span><span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="dl">'</span><span class="s1">INSERT INTO preferences (user_id, key, value) VALUES (?, ?, ?)</span><span class="dl">'</span><span class="p">)</span>
                <span class="p">.</span><span class="nf">bind</span><span class="p">(</span><span class="dl">'</span><span class="s1">(SELECT last_insert_rowid())</span><span class="dl">'</span><span class="p">,</span> <span class="nx">key</span><span class="p">,</span> <span class="nx">value</span><span class="p">)</span>
        <span class="p">),</span>
    <span class="p">]);</span>
    
    <span class="kd">const</span> <span class="nx">userId</span> <span class="o">=</span> <span class="nx">results</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nx">results</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="nx">id</span><span class="p">;</span>
    <span class="k">return</span> <span class="nx">userId</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Batching reduces round trips—crucial for edge performance.</p>

<h3 id="prepared-statements">Prepared Statements</h3>

<p>Always use prepared statements (parameterized queries):</p>

<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Good: Prepared statement (prevents SQL injection)</span>
<span class="kd">const</span> <span class="nx">user</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">env</span><span class="p">.</span><span class="nx">DB</span>
    <span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="dl">'</span><span class="s1">SELECT * FROM users WHERE email = ?</span><span class="dl">'</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">bind</span><span class="p">(</span><span class="nx">email</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">first</span><span class="p">();</span>

<span class="c1">// Bad: String interpolation (SQL injection risk!)</span>
<span class="kd">const</span> <span class="nx">user</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">env</span><span class="p">.</span><span class="nx">DB</span>
    <span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="s2">`SELECT * FROM users WHERE email = '</span><span class="p">${</span><span class="nx">email</span><span class="p">}</span><span class="s2">'`</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">first</span><span class="p">();</span>
</code></pre></div></div>

<p>D1 automatically caches prepared statement plans for performance.</p>

<h2 id="production-best-practices">Production Best Practices</h2>

<h3 id="1-schema-design">1. Schema Design</h3>

<p>Keep schemas simple and focused:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Good: Compact rows, appropriate types</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">products</span> <span class="p">(</span>
    <span class="n">id</span> <span class="nb">INTEGER</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">sku</span> <span class="nb">TEXT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">UNIQUE</span><span class="p">,</span>
    <span class="n">name</span> <span class="nb">TEXT</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">price</span> <span class="nb">INTEGER</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>  <span class="c1">-- Store cents as INTEGER</span>
    <span class="n">active</span> <span class="nb">INTEGER</span> <span class="k">DEFAULT</span> <span class="mi">1</span><span class="p">,</span>  <span class="c1">-- SQLite uses INTEGER for booleans</span>
    <span class="n">created_at</span> <span class="nb">INTEGER</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">idx_products_sku</span> <span class="k">ON</span> <span class="n">products</span><span class="p">(</span><span class="n">sku</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">idx_products_active</span> <span class="k">ON</span> <span class="n">products</span><span class="p">(</span><span class="n">active</span><span class="p">)</span> <span class="k">WHERE</span> <span class="n">active</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>

<span class="c1">-- Avoid: Large TEXT/BLOB columns at edge</span>
<span class="c1">-- Store large data (images, documents) in R2, keep references in D1</span>
</code></pre></div></div>

<p><strong>Guidelines:</strong></p>
<ul>
  <li>Normalize appropriately—D1 supports JOINs efficiently</li>
  <li>Use INTEGER for timestamps (Unix epoch)</li>
  <li>Index foreign keys and frequently queried columns</li>
  <li>Keep row sizes under 1KB when possible</li>
  <li>Store large blobs in R2, reference by key</li>
</ul>

<h3 id="2-query-optimization">2. Query Optimization</h3>

<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Use EXPLAIN QUERY PLAN to understand queries</span>
<span class="kd">const</span> <span class="nx">plan</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">env</span><span class="p">.</span><span class="nx">DB</span>
    <span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="dl">'</span><span class="s1">EXPLAIN QUERY PLAN SELECT * FROM users WHERE email = ?</span><span class="dl">'</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">bind</span><span class="p">(</span><span class="dl">'</span><span class="s1">test@example.com</span><span class="dl">'</span><span class="p">)</span>
    <span class="p">.</span><span class="nf">all</span><span class="p">();</span>

<span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="nx">plan</span><span class="p">.</span><span class="nx">results</span><span class="p">);</span>
<span class="c1">// Shows if query uses indexes or table scans</span>
</code></pre></div></div>

<p><strong>Optimization tips:</strong></p>
<ul>
  <li>Use indexes for WHERE, ORDER BY, and JOIN columns</li>
  <li>Avoid <code class="language-plaintext highlighter-rouge">SELECT *</code>—specify columns you need</li>
  <li>Use LIMIT for pagination</li>
  <li>Consider denormalization for read-heavy workloads</li>
  <li>Profile queries in development with EXPLAIN</li>
</ul>

<p>See <a href="https://www.sqlite.org/optoverview.html">SQLite query optimization</a> for deep dives.</p>

<h3 id="3-migrations">3. Migrations</h3>

<p>Version your schema changes:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create migrations directory</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> migrations

<span class="c"># Migration 001: Initial schema</span>
<span class="nb">cat</span> <span class="o">&gt;</span> migrations/001_initial.sql <span class="o">&lt;&lt;</span> <span class="sh">'</span><span class="no">EOF</span><span class="sh">'
CREATE TABLE users (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    email TEXT UNIQUE NOT NULL,
    created_at INTEGER NOT NULL
);
</span><span class="no">EOF

</span><span class="c"># Migration 002: Add name column</span>
<span class="nb">cat</span> <span class="o">&gt;</span> migrations/002_add_name.sql <span class="o">&lt;&lt;</span> <span class="sh">'</span><span class="no">EOF</span><span class="sh">'
ALTER TABLE users ADD COLUMN name TEXT;
</span><span class="no">EOF

</span><span class="c"># Apply migrations in order</span>
<span class="k">for </span>migration <span class="k">in </span>migrations/<span class="k">*</span>.sql<span class="p">;</span> <span class="k">do
    </span><span class="nb">echo</span> <span class="s2">"Applying </span><span class="nv">$migration</span><span class="s2">..."</span>
    wrangler d1 execute my-database <span class="nt">--file</span><span class="o">=</span><span class="s2">"</span><span class="nv">$migration</span><span class="s2">"</span>
<span class="k">done</span>
</code></pre></div></div>

<p><strong>Migration best practices:</strong></p>
<ul>
  <li>Make migrations idempotent (use IF NOT EXISTS)</li>
  <li>Test in staging first</li>
  <li>Keep migrations small and focused</li>
  <li>Never delete migrations after deployment</li>
  <li>Document breaking changes</li>
</ul>

<h3 id="4-backups">4. Backups</h3>

<p>D1 provides automatic backups, but export critical data:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Export database to SQL</span>
wrangler d1 <span class="nb">export </span>my-database <span class="nt">--output</span><span class="o">=</span>backup.sql

<span class="c"># Or export to CSV</span>
wrangler d1 execute my-database <span class="se">\</span>
    <span class="nt">--command</span><span class="o">=</span><span class="s2">"SELECT * FROM users"</span> <span class="se">\</span>
    <span class="nt">--json</span> <span class="o">&gt;</span> users_backup.json
</code></pre></div></div>

<p>Schedule regular exports to R2 for redundancy.</p>

<h3 id="5-monitoring">5. Monitoring</h3>

<p>Track query performance:</p>

<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="kd">function</span> <span class="nf">queryWithMetrics</span><span class="p">(</span>
    <span class="nx">db</span><span class="p">:</span> <span class="nx">D1Database</span><span class="p">,</span>
    <span class="nx">query</span><span class="p">:</span> <span class="kr">string</span><span class="p">,</span>
    <span class="p">...</span><span class="nx">params</span><span class="p">:</span> <span class="kr">any</span><span class="p">[]</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="kd">const</span> <span class="nx">start</span> <span class="o">=</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">();</span>
    
    <span class="k">try</span> <span class="p">{</span>
        <span class="kd">const</span> <span class="nx">result</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">db</span><span class="p">.</span><span class="nf">prepare</span><span class="p">(</span><span class="nx">query</span><span class="p">).</span><span class="nf">bind</span><span class="p">(...</span><span class="nx">params</span><span class="p">).</span><span class="nf">all</span><span class="p">();</span>
        <span class="kd">const</span> <span class="nx">duration</span> <span class="o">=</span> <span class="nb">Date</span><span class="p">.</span><span class="nf">now</span><span class="p">()</span> <span class="o">-</span> <span class="nx">start</span><span class="p">;</span>
        
        <span class="c1">// Log slow queries</span>
        <span class="k">if </span><span class="p">(</span><span class="nx">duration</span> <span class="o">&gt;</span> <span class="mi">100</span><span class="p">)</span> <span class="p">{</span>  <span class="c1">// 100ms threshold</span>
            <span class="nx">console</span><span class="p">.</span><span class="nf">warn</span><span class="p">(</span><span class="dl">'</span><span class="s1">Slow query</span><span class="dl">'</span><span class="p">,</span> <span class="p">{</span>
                <span class="nx">query</span><span class="p">,</span>
                <span class="nx">duration</span><span class="p">,</span>
                <span class="na">rows</span><span class="p">:</span> <span class="nx">result</span><span class="p">.</span><span class="nx">results</span><span class="p">.</span><span class="nx">length</span><span class="p">,</span>
            <span class="p">});</span>
        <span class="p">}</span>
        
        <span class="k">return</span> <span class="nx">result</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">catch </span><span class="p">(</span><span class="nx">error</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nf">error</span><span class="p">(</span><span class="dl">'</span><span class="s1">Query failed</span><span class="dl">'</span><span class="p">,</span> <span class="p">{</span> <span class="nx">query</span><span class="p">,</span> <span class="nx">error</span> <span class="p">});</span>
        <span class="k">throw</span> <span class="nx">error</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Monitor:</p>
<ul>
  <li>Query latency (p50, p95, p99)</li>
  <li>Slow query frequency</li>
  <li>Error rates</li>
  <li>Database size growth</li>
  <li>Read/write ratio</li>
</ul>

<h2 id="d1-vs-other-databases">D1 vs Other Databases</h2>

<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>D1</th>
      <th>Planet Scale</th>
      <th>Neon</th>
      <th>Traditional RDS</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Latency (read)</td>
      <td>5-10ms</td>
      <td>50-200ms</td>
      <td>30-100ms</td>
      <td>50-300ms</td>
    </tr>
    <tr>
      <td>Global read</td>
      <td>✅ Automatic</td>
      <td>❌ Single region</td>
      <td>❌ Single region</td>
      <td>❌ Manual setup</td>
    </tr>
    <tr>
      <td>SQL dialect</td>
      <td>SQLite</td>
      <td>MySQL</td>
      <td>Postgres</td>
      <td>Various</td>
    </tr>
    <tr>
      <td>Scaling</td>
      <td>Automatic</td>
      <td>Automatic</td>
      <td>Automatic</td>
      <td>Manual</td>
    </tr>
    <tr>
      <td>Cost (10GB + 100M reads)</td>
      <td>~$80/mo</td>
      <td>~$40/mo</td>
      <td>~$60/mo</td>
      <td>~$200/mo</td>
    </tr>
    <tr>
      <td>Edge integration</td>
      <td>✅ Native</td>
      <td>❌</td>
      <td>❌</td>
      <td>❌</td>
    </tr>
  </tbody>
</table>

<p><strong>Choose D1 when:</strong></p>
<ul>
  <li>Read-heavy workloads at global scale</li>
  <li>Sub-10ms latency requirements</li>
  <li>Using Cloudflare Workers</li>
  <li>Simple to moderate complexity queries</li>
</ul>

<p><strong>Choose alternatives when:</strong></p>
<ul>
  <li>Strong consistency critical (use Postgres/MySQL)</li>
  <li>Complex transactions required</li>
  <li>Existing Postgres/MySQL dependency</li>
  <li>Need advanced features (triggers, stored procedures)</li>
</ul>

<h2 id="limitations">Limitations</h2>

<p>D1 is early—know the constraints:</p>

<p><strong>Size limits (as of 2025):</strong></p>
<ul>
  <li>Database size: 10GB (soft limit)</li>
  <li>Row size: ~1MB</li>
  <li>Query execution time: 30s max</li>
  <li>Batch size: 50 statements</li>
</ul>

<p><strong>Feature gaps:</strong></p>
<ul>
  <li>No full-text search (FTS) yet</li>
  <li>Limited geospatial support</li>
  <li>No database-level replication control</li>
  <li>Eventual consistency for writes (typically &lt;1s propagation)</li>
</ul>

<p>Check <a href="https://developers.cloudflare.com/d1/platform/limits/">D1 limits documentation</a> for current constraints.</p>

<h2 id="practical-use-cases">Practical Use Cases</h2>

<p><strong>1. User preferences/settings</strong></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">user_settings</span> <span class="p">(</span>
    <span class="n">user_id</span> <span class="nb">TEXT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">theme</span> <span class="nb">TEXT</span> <span class="k">DEFAULT</span> <span class="s1">'light'</span><span class="p">,</span>
    <span class="k">language</span> <span class="nb">TEXT</span> <span class="k">DEFAULT</span> <span class="s1">'en'</span><span class="p">,</span>
    <span class="n">notifications</span> <span class="nb">INTEGER</span> <span class="k">DEFAULT</span> <span class="mi">1</span><span class="p">,</span>
    <span class="n">updated_at</span> <span class="nb">INTEGER</span>
<span class="p">);</span>
</code></pre></div></div>

<p><strong>2. Feature flags</strong></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">feature_flags</span> <span class="p">(</span>
    <span class="n">flag_key</span> <span class="nb">TEXT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">enabled</span> <span class="nb">INTEGER</span> <span class="k">DEFAULT</span> <span class="mi">0</span><span class="p">,</span>
    <span class="n">rollout_percentage</span> <span class="nb">INTEGER</span> <span class="k">DEFAULT</span> <span class="mi">0</span><span class="p">,</span>
    <span class="n">updated_at</span> <span class="nb">INTEGER</span>
<span class="p">);</span>
</code></pre></div></div>

<p><strong>3. Product catalog</strong></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">products</span> <span class="p">(</span>
    <span class="n">id</span> <span class="nb">TEXT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="n">name</span> <span class="nb">TEXT</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">description</span> <span class="nb">TEXT</span><span class="p">,</span>
    <span class="n">price</span> <span class="nb">INTEGER</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">inventory</span> <span class="nb">INTEGER</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">metadata</span> <span class="nb">TEXT</span>  <span class="c1">-- JSON string</span>
<span class="p">);</span>
</code></pre></div></div>

<p><strong>4. API rate limiting</strong></p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">rate_limits</span> <span class="p">(</span>
    <span class="k">key</span> <span class="nb">TEXT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
    <span class="k">count</span> <span class="nb">INTEGER</span> <span class="k">DEFAULT</span> <span class="mi">0</span><span class="p">,</span>
    <span class="n">window_start</span> <span class="nb">INTEGER</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
    <span class="n">expires_at</span> <span class="nb">INTEGER</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="n">idx_rate_limits_expires</span> <span class="k">ON</span> <span class="n">rate_limits</span><span class="p">(</span><span class="n">expires_at</span><span class="p">);</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>D1 brings SQLite’s simplicity and SQLite’s proven reliability to the edge. For read-heavy workloads that benefit from global distribution, it’s compelling—sub-10ms queries from anywhere with zero configuration.</p>

<p>The Developer Experience is excellent: familiar SQL, direct Workers integration, no connection pooling headaches. The automatic replication is invisible and just works.</p>

<p>D1 is young. Features are still rolling out. But for the right use case—globally distributed, read-heavy data with moderate write frequency—it’s hard to beat. SQLite at the edge is a powerful primitive.</p>

<p><strong>Further Resources:</strong></p>
<ul>
  <li><a href="https://developers.cloudflare.com/d1/">Cloudflare D1 Documentation</a> - Official docs</li>
  <li><a href="https://developers.cloudflare.com/d1/get-started/">D1 Get Started Guide</a> - Quick start</li>
  <li><a href="https://developers.cloudflare.com/workers/wrangler/">Wrangler CLI</a> - D1 management</li>
  <li><a href="https://www.sqlite.org/docs.html">SQLite Documentation</a> - SQL reference</li>
  <li><a href="https://www.sqlite.org/optoverview.html">SQLite Query Optimization</a> - Performance tuning</li>
  <li><a href="https://developers.cloudflare.com/d1/platform/pricing/">D1 Pricing</a> - Cost details</li>
  <li><a href="https://developers.cloudflare.com/d1/platform/limits/">D1 Limits</a> - Current constraints</li>
</ul>

<hr />

<p><em>Cloudflare D1 from July 2025 — updated with production guidance.</em></p>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="How-To" /><category term="Cloudflare D1" /><category term="SQLite" /><category term="Edge" /><category term="Database" /><summary type="html"><![CDATA[Use Cloudflare D1 for edge databases: SQLite at the edge, global replication, query patterns, and how D1 enables low-latency database access.]]></summary></entry><entry><title type="html">Generative AI Engineering: Best Practices</title><link href="https://www.fratepietro.com/2025/generative-ai-engineering/" rel="alternate" type="text/html" title="Generative AI Engineering: Best Practices" /><published>2025-06-09T00:00:00+02:00</published><updated>2025-06-09T00:00:00+02:00</updated><id>https://www.fratepietro.com/2025/generative-ai-engineering</id><content type="html" xml:base="https://www.fratepietro.com/2025/generative-ai-engineering/"><![CDATA[<p>Generative AI engineering is less about the models (they’re commodities now) and more about the <strong>systems around them</strong>: prompts, retrieval, evaluation, caching, and monitoring. The difference between a demo and production is these unglamorous layers.</p>

<p>I’ve built multiple production GenAI systems—chatbots, coding assistants, document analysis. The models (GPT-4, Claude, Gemini) are interchangeable. The hard parts are: getting the right context into prompts, handling failures gracefully, managing costs, and measuring quality. This post covers patterns that work at scale.</p>

<p>Drawing from <a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview">Anthropic’s prompt engineering guide</a>, <a href="https://platform.openai.com/docs/guides/prompt-engineering">OpenAI’s best practices</a>, and real production experience.</p>

<h2 id="prompt-engineering-the-core-skill">Prompt Engineering: The Core Skill</h2>

<p>Prompts are your interface to LLMs. Good prompts are specific, structured, and include examples.</p>

<h3 id="the-six-principles">The Six Principles</h3>

<p>From <a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview">Anthropic’s guide</a>:</p>

<ol>
  <li><strong>Give Claude a role</strong> - Context shapes behavior</li>
  <li><strong>Use XML tags</strong> - Structure improves parsing</li>
  <li><strong>Be specific</strong> - Vague prompts get vague outputs</li>
  <li><strong>Use examples</strong> - Few-shot examples are powerful</li>
  <li><strong>Let Claude think</strong> - Chain-of-thought improves reasoning</li>
  <li><strong>Use prefill</strong> - Control output format</li>
</ol>

<h3 id="structured-prompts">Structured Prompts</h3>

<p>Always structure prompts with clear sections:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">anthropic</span> <span class="kn">import</span> <span class="n">Anthropic</span>

<span class="n">client</span> <span class="o">=</span> <span class="nc">Anthropic</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="sh">'</span><span class="s">your-key</span><span class="sh">'</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">analyze_document</span><span class="p">(</span><span class="n">document</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Analyze document with structured prompt.</span><span class="sh">"""</span>
    
    <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">You are an expert document analyst. Your task is to answer questions about documents accurately and concisely.

&lt;document&gt;
</span><span class="si">{</span><span class="n">document</span><span class="si">}</span><span class="s">
&lt;/document&gt;

&lt;question&gt;
</span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">
&lt;/question&gt;

Instructions:
1. Read the document carefully
2. Identify relevant information
3. Answer the question based only on the document
4. If the answer isn</span><span class="sh">'</span><span class="s">t in the document, say </span><span class="sh">"</span><span class="s">Not found in document</span><span class="sh">"</span><span class="s">
5. Cite specific passages when possible

Think through your answer step-by-step, then provide your final answer.</span><span class="sh">"""</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">claude-3-5-sonnet-20241022</span><span class="sh">'</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}]</span>
    <span class="p">)</span>
    
    <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span>
</code></pre></div></div>

<p><strong>Why this works:</strong></p>
<ul>
  <li>Clear role definition</li>
  <li>XML tags separate inputs</li>
  <li>Explicit instructions</li>
  <li>Step-by-step thinking</li>
  <li>Constraints on output</li>
</ul>

<h3 id="few-shot-examples">Few-Shot Examples</h3>

<p>Examples are more powerful than instructions:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">classify_sentiment</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Classify sentiment with examples.</span><span class="sh">"""</span>
    
    <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">Classify the sentiment of the following text as positive, negative, or neutral.

Examples:

Text: </span><span class="sh">"</span><span class="s">This product exceeded my expectations! Amazing quality.</span><span class="sh">"</span><span class="s">
Sentiment: positive

Text: </span><span class="sh">"</span><span class="s">Terrible experience. Would not recommend.</span><span class="sh">"</span><span class="s">
Sentiment: negative

Text: </span><span class="sh">"</span><span class="s">The item arrived on time.</span><span class="sh">"</span><span class="s">
Sentiment: neutral

Now classify this text:

Text: </span><span class="sh">"</span><span class="si">{</span><span class="n">text</span><span class="si">}</span><span class="sh">"</span><span class="s">
Sentiment:</span><span class="sh">"""</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">claude-3-5-sonnet-20241022</span><span class="sh">'</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}]</span>
    <span class="p">)</span>
    
    <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="p">.</span><span class="nf">strip</span><span class="p">()</span>
</code></pre></div></div>

<p>Three examples teach the model the pattern. For complex tasks, 5-10 examples work better.</p>

<h3 id="chain-of-thought-prompting">Chain-of-Thought Prompting</h3>

<p>For reasoning tasks, ask the model to think step-by-step:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">solve_math_problem</span><span class="p">(</span><span class="n">problem</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Solve with chain-of-thought reasoning.</span><span class="sh">"""</span>
    
    <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">Solve this math problem step-by-step.

Problem: </span><span class="si">{</span><span class="n">problem</span><span class="si">}</span><span class="s">

Let</span><span class="sh">'</span><span class="s">s solve this step by step:
1. First, identify what we</span><span class="sh">'</span><span class="s">re looking for
2. Then, break down the problem
3. Show your work
4. Finally, state the answer

Begin:</span><span class="sh">"""</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">claude-3-5-sonnet-20241022</span><span class="sh">'</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}]</span>
    <span class="p">)</span>
    
    <span class="n">reasoning</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span>
    
    <span class="c1"># Extract final answer (simplified)
</span>    <span class="n">answer</span> <span class="o">=</span> <span class="n">reasoning</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sh">'</span><span class="s">answer</span><span class="sh">'</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="nf">strip</span><span class="p">()</span>
    
    <span class="k">return</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">reasoning</span><span class="sh">'</span><span class="p">:</span> <span class="n">reasoning</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">answer</span><span class="sh">'</span><span class="p">:</span> <span class="n">answer</span><span class="p">,</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Studies show CoT improves accuracy on reasoning tasks by 20-40%. See <a href="https://arxiv.org/abs/2201.11903">Google’s CoT paper</a>.</p>

<h3 id="prompt-templates">Prompt Templates</h3>

<p>Use templates for consistency:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">string</span> <span class="kn">import</span> <span class="n">Template</span>

<span class="c1"># Define template once
</span><span class="n">SUMMARIZATION_TEMPLATE</span> <span class="o">=</span> <span class="nc">Template</span><span class="p">(</span><span class="sh">"""</span><span class="s">Summarize the following ${document_type} in ${length} words or less.

Focus on:
${focus_areas}

${document_type}:
${content}

Summary:</span><span class="sh">"""</span><span class="p">)</span>

<span class="c1"># Use with different parameters
</span><span class="n">prompt</span> <span class="o">=</span> <span class="n">SUMMARIZATION_TEMPLATE</span><span class="p">.</span><span class="nf">substitute</span><span class="p">(</span>
    <span class="n">document_type</span><span class="o">=</span><span class="sh">'</span><span class="s">research paper</span><span class="sh">'</span><span class="p">,</span>
    <span class="n">length</span><span class="o">=</span><span class="sh">'</span><span class="s">100</span><span class="sh">'</span><span class="p">,</span>
    <span class="n">focus_areas</span><span class="o">=</span><span class="sh">'</span><span class="s">- Main findings</span><span class="se">\n</span><span class="s">- Methodology</span><span class="se">\n</span><span class="s">- Conclusions</span><span class="sh">'</span><span class="p">,</span>
    <span class="n">content</span><span class="o">=</span><span class="n">paper_text</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Templates ensure consistent quality and make A/B testing easier.</p>

<h2 id="rag-retrieval-augmented-generation">RAG: Retrieval-Augmented Generation</h2>

<p>RAG solves the knowledge cutoff and hallucination problems by retrieving relevant context before generation.</p>

<h3 id="basic-rag-pipeline">Basic RAG Pipeline</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
<span class="kn">import</span> <span class="n">pinecone</span>

<span class="n">client</span> <span class="o">=</span> <span class="nc">OpenAI</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="sh">'</span><span class="s">your-key</span><span class="sh">'</span><span class="p">)</span>
<span class="n">pc</span> <span class="o">=</span> <span class="n">pinecone</span><span class="p">.</span><span class="nc">Pinecone</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="sh">'</span><span class="s">your-key</span><span class="sh">'</span><span class="p">)</span>
<span class="n">index</span> <span class="o">=</span> <span class="n">pc</span><span class="p">.</span><span class="nc">Index</span><span class="p">(</span><span class="sh">'</span><span class="s">knowledge-base</span><span class="sh">'</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">rag_query</span><span class="p">(</span><span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">top_k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Answer question using RAG.</span><span class="sh">"""</span>
    
    <span class="c1"># 1. Embed the question
</span>    <span class="n">question_embedding</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">embeddings</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">text-embedding-3-small</span><span class="sh">'</span><span class="p">,</span>
        <span class="nb">input</span><span class="o">=</span><span class="n">question</span>
    <span class="p">).</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">embedding</span>
    
    <span class="c1"># 2. Retrieve relevant documents
</span>    <span class="n">results</span> <span class="o">=</span> <span class="n">index</span><span class="p">.</span><span class="nf">query</span><span class="p">(</span>
        <span class="n">vector</span><span class="o">=</span><span class="n">question_embedding</span><span class="p">,</span>
        <span class="n">top_k</span><span class="o">=</span><span class="n">top_k</span><span class="p">,</span>
        <span class="n">include_metadata</span><span class="o">=</span><span class="bp">True</span>
    <span class="p">)</span>
    
    <span class="c1"># 3. Format context
</span>    <span class="n">context</span> <span class="o">=</span> <span class="sh">"</span><span class="se">\n\n</span><span class="sh">"</span><span class="p">.</span><span class="nf">join</span><span class="p">([</span>
        <span class="sa">f</span><span class="sh">"</span><span class="s">Document </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">:</span><span class="se">\n</span><span class="si">{</span><span class="k">match</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="sh">'</span><span class="s">text</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="k">match</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">results</span><span class="p">.</span><span class="n">matches</span><span class="p">)</span>
    <span class="p">])</span>
    
    <span class="c1"># 4. Generate answer with context
</span>    <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">Answer the question based on the provided context.

Context:
</span><span class="si">{</span><span class="n">context</span><span class="si">}</span><span class="s">

Question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">

Answer based only on the context above. If the answer isn</span><span class="sh">'</span><span class="s">t in the context, say </span><span class="sh">"</span><span class="s">I don</span><span class="sh">'</span><span class="s">t have enough information to answer that.</span><span class="sh">"</span><span class="s">

Answer:</span><span class="sh">"""</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">gpt-4o</span><span class="sh">'</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}],</span>
        <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>  <span class="c1"># Low temperature for factual answers
</span>    <span class="p">)</span>
    
    <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>
</code></pre></div></div>

<h3 id="advanced-rag-reranking">Advanced RAG: Reranking</h3>

<p>Simple vector search isn’t always accurate. Rerank with a cross-encoder:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">sentence_transformers</span> <span class="kn">import</span> <span class="n">CrossEncoder</span>

<span class="n">reranker</span> <span class="o">=</span> <span class="nc">CrossEncoder</span><span class="p">(</span><span class="sh">'</span><span class="s">cross-encoder/ms-marco-MiniLM-L-6-v2</span><span class="sh">'</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">rag_with_reranking</span><span class="p">(</span><span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">top_k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">RAG with reranking for better accuracy.</span><span class="sh">"""</span>
    
    <span class="c1"># 1. Retrieve more candidates than needed
</span>    <span class="n">results</span> <span class="o">=</span> <span class="nf">vector_search</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="n">top_k</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)</span>
    
    <span class="c1"># 2. Rerank using cross-encoder
</span>    <span class="n">pairs</span> <span class="o">=</span> <span class="p">[[</span><span class="n">question</span><span class="p">,</span> <span class="k">match</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="sh">'</span><span class="s">text</span><span class="sh">'</span><span class="p">]]</span> <span class="k">for</span> <span class="k">match</span> <span class="ow">in</span> <span class="n">results</span><span class="p">.</span><span class="n">matches</span><span class="p">]</span>
    <span class="n">scores</span> <span class="o">=</span> <span class="n">reranker</span><span class="p">.</span><span class="nf">predict</span><span class="p">(</span><span class="n">pairs</span><span class="p">)</span>
    
    <span class="c1"># 3. Sort by reranker scores
</span>    <span class="n">reranked</span> <span class="o">=</span> <span class="nf">sorted</span><span class="p">(</span><span class="nf">zip</span><span class="p">(</span><span class="n">results</span><span class="p">.</span><span class="n">matches</span><span class="p">,</span> <span class="n">scores</span><span class="p">),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    
    <span class="c1"># 4. Use top-k after reranking
</span>    <span class="n">top_results</span> <span class="o">=</span> <span class="p">[</span><span class="n">match</span> <span class="k">for</span> <span class="n">match</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">reranked</span><span class="p">[:</span><span class="n">top_k</span><span class="p">]]</span>
    
    <span class="c1"># 5. Generate with reranked context
</span>    <span class="n">context</span> <span class="o">=</span> <span class="nf">format_context</span><span class="p">(</span><span class="n">top_results</span><span class="p">)</span>
    <span class="k">return</span> <span class="nf">generate_answer</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">context</span><span class="p">)</span>
</code></pre></div></div>

<p>Reranking improves accuracy by 10-20% in my experience. See <a href="https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/CohereRerank/">LlamaIndex’s reranking guide</a>.</p>

<h3 id="hyde-hypothetical-document-embeddings">HyDE: Hypothetical Document Embeddings</h3>

<p>For complex queries, generate a hypothetical answer first:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">hyde_rag</span><span class="p">(</span><span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">RAG with hypothetical document embeddings.</span><span class="sh">"""</span>
    
    <span class="c1"># 1. Generate hypothetical answer
</span>    <span class="n">hypothetical_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">Generate a detailed answer to this question:

</span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">

Write as if you</span><span class="sh">'</span><span class="s">re answering from authoritative sources.</span><span class="sh">"""</span>

    <span class="n">hypothetical_answer</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">gpt-4o-mini</span><span class="sh">'</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">hypothetical_prompt</span><span class="p">}],</span>
        <span class="n">temperature</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span>
    <span class="p">).</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>
    
    <span class="c1"># 2. Embed and search using hypothetical answer
</span>    <span class="c1"># (hypothetical answer better matches document style)
</span>    <span class="n">embedding</span> <span class="o">=</span> <span class="nf">embed</span><span class="p">(</span><span class="n">hypothetical_answer</span><span class="p">)</span>
    <span class="n">results</span> <span class="o">=</span> <span class="n">index</span><span class="p">.</span><span class="nf">query</span><span class="p">(</span><span class="n">vector</span><span class="o">=</span><span class="n">embedding</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
    
    <span class="c1"># 3. Generate final answer with retrieved context
</span>    <span class="n">context</span> <span class="o">=</span> <span class="nf">format_context</span><span class="p">(</span><span class="n">results</span><span class="p">.</span><span class="n">matches</span><span class="p">)</span>
    <span class="k">return</span> <span class="nf">generate_answer</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">context</span><span class="p">)</span>
</code></pre></div></div>

<p>HyDE improves retrieval for questions that don’t match document phrasing. Paper: <a href="https://arxiv.org/abs/2212.10496">Precise Zero-Shot Dense Retrieval</a>.</p>

<h2 id="evaluation-measuring-quality">Evaluation: Measuring Quality</h2>

<p>LLM outputs are probabilistic. You need systematic evaluation.</p>

<h3 id="automated-evaluation-metrics">Automated Evaluation Metrics</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
<span class="kn">import</span> <span class="n">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="n">client</span> <span class="o">=</span> <span class="nc">OpenAI</span><span class="p">()</span>

<span class="k">class</span> <span class="nc">LLMEvaluator</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Evaluate LLM outputs systematically.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">evaluate_answer</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">answer</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">ground_truth</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Evaluate answer quality.</span><span class="sh">"""</span>
        
        <span class="c1"># 1. Semantic similarity (embeddings)
</span>        <span class="n">answer_emb</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">embed</span><span class="p">(</span><span class="n">answer</span><span class="p">)</span>
        <span class="n">truth_emb</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">embed</span><span class="p">(</span><span class="n">ground_truth</span><span class="p">)</span>
        <span class="n">similarity</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">dot</span><span class="p">(</span><span class="n">answer_emb</span><span class="p">,</span> <span class="n">truth_emb</span><span class="p">)</span>
        
        <span class="c1"># 2. LLM-as-judge
</span>        <span class="n">judge_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">Evaluate the quality of this answer on a scale of 1-5.

Question: </span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">

Expected Answer: </span><span class="si">{</span><span class="n">ground_truth</span><span class="si">}</span><span class="s">

Actual Answer: </span><span class="si">{</span><span class="n">answer</span><span class="si">}</span><span class="s">

Rate the answer considering:
- Accuracy (is it factually correct?)
- Completeness (does it fully answer the question?)
- Relevance (does it stay on topic?)

Provide a score (1-5) and brief explanation.

Format:
Score: [1-5]
Explanation: [your reasoning]</span><span class="sh">"""</span>

        <span class="n">judge_response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">gpt-4o</span><span class="sh">'</span><span class="p">,</span>
            <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">judge_prompt</span><span class="p">}],</span>
            <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
        <span class="p">).</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>
        
        <span class="c1"># Parse score
</span>        <span class="n">score</span> <span class="o">=</span> <span class="nf">int</span><span class="p">(</span><span class="n">judge_response</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sh">'</span><span class="s">Score:</span><span class="sh">'</span><span class="p">)[</span><span class="mi">1</span><span class="p">].</span><span class="nf">split</span><span class="p">(</span><span class="sh">'</span><span class="se">\n</span><span class="sh">'</span><span class="p">)[</span><span class="mi">0</span><span class="p">].</span><span class="nf">strip</span><span class="p">())</span>
        
        <span class="k">return</span> <span class="p">{</span>
            <span class="sh">'</span><span class="s">semantic_similarity</span><span class="sh">'</span><span class="p">:</span> <span class="n">similarity</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">llm_judge_score</span><span class="sh">'</span><span class="p">:</span> <span class="n">score</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">judge_explanation</span><span class="sh">'</span><span class="p">:</span> <span class="n">judge_response</span><span class="p">,</span>
        <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">embed</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Get embedding.</span><span class="sh">"""</span>
        <span class="k">return</span> <span class="n">client</span><span class="p">.</span><span class="n">embeddings</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">text-embedding-3-small</span><span class="sh">'</span><span class="p">,</span>
            <span class="nb">input</span><span class="o">=</span><span class="n">text</span>
        <span class="p">).</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">embedding</span>
</code></pre></div></div>

<h3 id="test-sets">Test Sets</h3>

<p>Build curated test sets:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test_cases</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span>
        <span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">What is the capital of France?</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">expected</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">Paris</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">category</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">factual</span><span class="sh">'</span><span class="p">,</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">Explain photosynthesis simply</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">expected</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">Plants convert sunlight into energy...</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">category</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">explanation</span><span class="sh">'</span><span class="p">,</span>
    <span class="p">},</span>
    <span class="c1"># ... more test cases
</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">run_evaluation</span><span class="p">(</span><span class="n">system</span><span class="p">,</span> <span class="n">test_cases</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Run systematic evaluation.</span><span class="sh">"""</span>
    <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="k">for</span> <span class="n">test</span> <span class="ow">in</span> <span class="n">test_cases</span><span class="p">:</span>
        <span class="n">answer</span> <span class="o">=</span> <span class="n">system</span><span class="p">.</span><span class="nf">answer</span><span class="p">(</span><span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">])</span>
        
        <span class="n">metrics</span> <span class="o">=</span> <span class="n">evaluator</span><span class="p">.</span><span class="nf">evaluate_answer</span><span class="p">(</span>
            <span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">],</span>
            <span class="n">answer</span><span class="p">,</span>
            <span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">expected</span><span class="sh">'</span><span class="p">]</span>
        <span class="p">)</span>
        
        <span class="n">results</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
            <span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">:</span> <span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">],</span>
            <span class="sh">'</span><span class="s">answer</span><span class="sh">'</span><span class="p">:</span> <span class="n">answer</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">metrics</span><span class="sh">'</span><span class="p">:</span> <span class="n">metrics</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">category</span><span class="sh">'</span><span class="p">:</span> <span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">category</span><span class="sh">'</span><span class="p">],</span>
        <span class="p">})</span>
    
    <span class="c1"># Aggregate by category
</span>    <span class="n">by_category</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
        <span class="n">cat</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="sh">'</span><span class="s">category</span><span class="sh">'</span><span class="p">]</span>
        <span class="k">if</span> <span class="n">cat</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">by_category</span><span class="p">:</span>
            <span class="n">by_category</span><span class="p">[</span><span class="n">cat</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">by_category</span><span class="p">[</span><span class="n">cat</span><span class="p">].</span><span class="nf">append</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="sh">'</span><span class="s">metrics</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">llm_judge_score</span><span class="sh">'</span><span class="p">])</span>
    
    <span class="c1"># Print summary
</span>    <span class="k">for</span> <span class="n">category</span><span class="p">,</span> <span class="n">scores</span> <span class="ow">in</span> <span class="n">by_category</span><span class="p">.</span><span class="nf">items</span><span class="p">():</span>
        <span class="n">avg</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">category</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">avg</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">/5.0</span><span class="sh">"</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">results</span>
</code></pre></div></div>

<h3 id="ab-testing">A/B Testing</h3>

<p>Compare prompt variants:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ab_test_prompts</span><span class="p">(</span><span class="n">variant_a</span><span class="p">,</span> <span class="n">variant_b</span><span class="p">,</span> <span class="n">test_cases</span><span class="p">,</span> <span class="n">sample_size</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">A/B test two prompt variants.</span><span class="sh">"""</span>
    
    <span class="n">results_a</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">results_b</span> <span class="o">=</span> <span class="p">[]</span>
    
    <span class="k">for</span> <span class="n">test</span> <span class="ow">in</span> <span class="n">test_cases</span><span class="p">[:</span><span class="n">sample_size</span><span class="p">]:</span>
        <span class="c1"># Test variant A
</span>        <span class="n">answer_a</span> <span class="o">=</span> <span class="nf">generate_with_prompt</span><span class="p">(</span><span class="n">variant_a</span><span class="p">,</span> <span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">])</span>
        <span class="n">score_a</span> <span class="o">=</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">],</span> <span class="n">answer_a</span><span class="p">,</span> <span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">expected</span><span class="sh">'</span><span class="p">])</span>
        <span class="n">results_a</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">score_a</span><span class="p">)</span>
        
        <span class="c1"># Test variant B
</span>        <span class="n">answer_b</span> <span class="o">=</span> <span class="nf">generate_with_prompt</span><span class="p">(</span><span class="n">variant_b</span><span class="p">,</span> <span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">])</span>
        <span class="n">score_b</span> <span class="o">=</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">question</span><span class="sh">'</span><span class="p">],</span> <span class="n">answer_b</span><span class="p">,</span> <span class="n">test</span><span class="p">[</span><span class="sh">'</span><span class="s">expected</span><span class="sh">'</span><span class="p">])</span>
        <span class="n">results_b</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">score_b</span><span class="p">)</span>
    
    <span class="c1"># Statistical comparison
</span>    <span class="kn">from</span> <span class="n">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
    <span class="n">t_stat</span><span class="p">,</span> <span class="n">p_value</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="nf">ttest_ind</span><span class="p">(</span><span class="n">results_a</span><span class="p">,</span> <span class="n">results_b</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">variant_a_mean</span><span class="sh">'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">results_a</span><span class="p">),</span>
        <span class="sh">'</span><span class="s">variant_b_mean</span><span class="sh">'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">results_b</span><span class="p">),</span>
        <span class="sh">'</span><span class="s">p_value</span><span class="sh">'</span><span class="p">:</span> <span class="n">p_value</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">winner</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">A</span><span class="sh">'</span> <span class="k">if</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">results_a</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">results_b</span><span class="p">)</span> <span class="k">else</span> <span class="sh">'</span><span class="s">B</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">significant</span><span class="sh">'</span><span class="p">:</span> <span class="n">p_value</span> <span class="o">&lt;</span> <span class="mf">0.05</span><span class="p">,</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>Use <a href="https://wandb.ai/">Weights &amp; Biases</a> or <a href="https://www.langchain.com/langsmith">LangSmith</a> for experiment tracking.</p>

<h2 id="production-best-practices">Production Best Practices</h2>

<h3 id="1-cost-optimization">1. Cost Optimization</h3>

<p>LLM costs are variable—optimize aggressively:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CostOptimizedLLM</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">LLM client with cost optimization.</span><span class="sh">"""</span>
    
    <span class="n">PRICING</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">gpt-4o</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0025</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.010</span><span class="p">,</span> <span class="sh">'</span><span class="s">quality</span><span class="sh">'</span><span class="p">:</span> <span class="mi">5</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">gpt-4o-mini</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.00015</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0006</span><span class="p">,</span> <span class="sh">'</span><span class="s">quality</span><span class="sh">'</span><span class="p">:</span> <span class="mi">4</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">claude-3-5-sonnet</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.003</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.015</span><span class="p">,</span> <span class="sh">'</span><span class="s">quality</span><span class="sh">'</span><span class="p">:</span> <span class="mi">5</span><span class="p">},</span>
    <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">cache</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="n">self</span><span class="p">.</span><span class="n">total_cost</span> <span class="o">=</span> <span class="mi">0</span>
    
    <span class="k">def</span> <span class="nf">generate</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">task_complexity</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="sh">'</span><span class="s">medium</span><span class="sh">'</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Generate with cost optimization.</span><span class="sh">"""</span>
        
        <span class="c1"># 1. Check cache
</span>        <span class="n">cache_key</span> <span class="o">=</span> <span class="nf">hash</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">cache_key</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">cache</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">cache</span><span class="p">[</span><span class="n">cache_key</span><span class="p">]</span>
        
        <span class="c1"># 2. Select model based on task
</span>        <span class="n">model</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">_select_model</span><span class="p">(</span><span class="n">task_complexity</span><span class="p">)</span>
        
        <span class="c1"># 3. Minimize token usage
</span>        <span class="n">optimized_prompt</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">_optimize_prompt</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
        
        <span class="c1"># 4. Generate
</span>        <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
            <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">optimized_prompt</span><span class="p">}],</span>
            <span class="n">max_tokens</span><span class="o">=</span><span class="n">self</span><span class="p">.</span><span class="nf">_calculate_max_tokens</span><span class="p">(</span><span class="n">task_complexity</span><span class="p">),</span>
        <span class="p">)</span>
        
        <span class="c1"># 5. Track cost
</span>        <span class="n">cost</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">_calculate_cost</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">response</span><span class="p">.</span><span class="n">usage</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">total_cost</span> <span class="o">+=</span> <span class="n">cost</span>
        
        <span class="c1"># 6. Cache result
</span>        <span class="n">result</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>
        <span class="n">self</span><span class="p">.</span><span class="n">cache</span><span class="p">[</span><span class="n">cache_key</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span>
        
        <span class="k">return</span> <span class="n">result</span>
    
    <span class="k">def</span> <span class="nf">_select_model</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">complexity</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Choose cheapest model that meets quality needs.</span><span class="sh">"""</span>
        <span class="k">if</span> <span class="n">complexity</span> <span class="o">==</span> <span class="sh">'</span><span class="s">simple</span><span class="sh">'</span><span class="p">:</span>
            <span class="k">return</span> <span class="sh">'</span><span class="s">gpt-4o-mini</span><span class="sh">'</span>
        <span class="k">elif</span> <span class="n">complexity</span> <span class="o">==</span> <span class="sh">'</span><span class="s">medium</span><span class="sh">'</span><span class="p">:</span>
            <span class="k">return</span> <span class="sh">'</span><span class="s">gpt-4o-mini</span><span class="sh">'</span>  <span class="c1"># Try cheap first
</span>        <span class="k">else</span><span class="p">:</span>
            <span class="k">return</span> <span class="sh">'</span><span class="s">gpt-4o</span><span class="sh">'</span>
    
    <span class="k">def</span> <span class="nf">_optimize_prompt</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Remove unnecessary tokens.</span><span class="sh">"""</span>
        <span class="c1"># Remove extra whitespace
</span>        <span class="n">optimized</span> <span class="o">=</span> <span class="sh">'</span><span class="s"> </span><span class="sh">'</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">prompt</span><span class="p">.</span><span class="nf">split</span><span class="p">())</span>
        <span class="c1"># Truncate if too long
</span>        <span class="k">if</span> <span class="nf">len</span><span class="p">(</span><span class="n">optimized</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">10000</span><span class="p">:</span>
            <span class="n">optimized</span> <span class="o">=</span> <span class="n">optimized</span><span class="p">[:</span><span class="mi">10000</span><span class="p">]</span> <span class="o">+</span> <span class="sh">'</span><span class="s">...</span><span class="sh">'</span>
        <span class="k">return</span> <span class="n">optimized</span>
    
    <span class="k">def</span> <span class="nf">_calculate_max_tokens</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">complexity</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
        <span class="sh">"""</span><span class="s">Set appropriate max_tokens.</span><span class="sh">"""</span>
        <span class="n">limits</span> <span class="o">=</span> <span class="p">{</span><span class="sh">'</span><span class="s">simple</span><span class="sh">'</span><span class="p">:</span> <span class="mi">256</span><span class="p">,</span> <span class="sh">'</span><span class="s">medium</span><span class="sh">'</span><span class="p">:</span> <span class="mi">512</span><span class="p">,</span> <span class="sh">'</span><span class="s">complex</span><span class="sh">'</span><span class="p">:</span> <span class="mi">2048</span><span class="p">}</span>
        <span class="k">return</span> <span class="n">limits</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">complexity</span><span class="p">,</span> <span class="mi">512</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>Cost reduction strategies:</strong></p>
<ul>
  <li>Use cheaper models (GPT-4o-mini) for simple tasks</li>
  <li>Cache aggressively (30-50% cache hit rate typical)</li>
  <li>Minimize prompt tokens (context compression)</li>
  <li>Set appropriate max_tokens</li>
  <li>Batch requests where possible</li>
</ul>

<h3 id="2-reliability-and-error-handling">2. Reliability and Error Handling</h3>

<p>LLMs fail. Handle it gracefully:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">time</span>
<span class="kn">import</span> <span class="n">random</span>
<span class="kn">from</span> <span class="n">tenacity</span> <span class="kn">import</span> <span class="n">retry</span><span class="p">,</span> <span class="n">stop_after_attempt</span><span class="p">,</span> <span class="n">wait_exponential</span>

<span class="nd">@retry</span><span class="p">(</span>
    <span class="n">stop</span><span class="o">=</span><span class="nf">stop_after_attempt</span><span class="p">(</span><span class="mi">3</span><span class="p">),</span>
    <span class="n">wait</span><span class="o">=</span><span class="nf">wait_exponential</span><span class="p">(</span><span class="n">multiplier</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="nb">min</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="nb">max</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">generate_with_retry</span><span class="p">(</span><span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Generate with automatic retries.</span><span class="sh">"""</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">gpt-4o</span><span class="sh">'</span><span class="p">,</span>
            <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}],</span>
            <span class="n">timeout</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>
    
    <span class="k">except</span> <span class="n">client</span><span class="p">.</span><span class="n">RateLimitError</span><span class="p">:</span>
        <span class="c1"># Hit rate limit, wait and retry
</span>        <span class="n">time</span><span class="p">.</span><span class="nf">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
        <span class="k">raise</span>
    
    <span class="k">except</span> <span class="n">client</span><span class="p">.</span><span class="n">APIError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="c1"># API error, retry
</span>        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">API error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">raise</span>
    
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="c1"># Unexpected error
</span>        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Unexpected error: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="sh">"</span><span class="s">I apologize, but I</span><span class="sh">'</span><span class="s">m having trouble processing your request.</span><span class="sh">"</span>
</code></pre></div></div>

<h3 id="3-monitoring">3. Monitoring</h3>

<p>Track what matters:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">structlog</span>
<span class="kn">from</span> <span class="n">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>
<span class="kn">from</span> <span class="n">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="n">logger</span> <span class="o">=</span> <span class="n">structlog</span><span class="p">.</span><span class="nf">get_logger</span><span class="p">()</span>

<span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">LLMMetrics</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Track LLM usage metrics.</span><span class="sh">"""</span>
    <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">model</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">prompt_tokens</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">completion_tokens</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">latency_ms</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">cost_usd</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">success</span><span class="p">:</span> <span class="nb">bool</span>
    <span class="n">error</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="bp">None</span>

<span class="k">def</span> <span class="nf">log_llm_request</span><span class="p">(</span><span class="n">metrics</span><span class="p">:</span> <span class="n">LLMMetrics</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Log for analysis.</span><span class="sh">"""</span>
    <span class="n">logger</span><span class="p">.</span><span class="nf">info</span><span class="p">(</span>
        <span class="sh">"</span><span class="s">llm_request</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">request_id</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">request_id</span><span class="p">,</span>
        <span class="n">model</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
        <span class="n">prompt_tokens</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">prompt_tokens</span><span class="p">,</span>
        <span class="n">completion_tokens</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">completion_tokens</span><span class="p">,</span>
        <span class="n">latency_ms</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">latency_ms</span><span class="p">,</span>
        <span class="n">cost</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">cost_usd</span><span class="p">,</span>
        <span class="n">success</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">success</span><span class="p">,</span>
        <span class="n">error</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">error</span><span class="p">,</span>
    <span class="p">)</span>

<span class="c1"># Track aggregate metrics:
# - Requests per minute
# - Average latency (p50, p95, p99)
# - Token usage per user/endpoint
# - Cost per day/user
# - Error rate by type
# - Cache hit rate
</span></code></pre></div></div>

<p>Use <a href="https://www.helicone.ai/">Helicone</a>, <a href="https://www.langchain.com/langsmith">LangSmith</a>, or <a href="https://wandb.ai/">Weights &amp; Biases</a> for LLM observability.</p>

<h3 id="4-security">4. Security</h3>

<p>Protect against prompt injection and data leakage:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">sanitize_input</span><span class="p">(</span><span class="n">user_input</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Remove potential prompt injection.</span><span class="sh">"""</span>
    <span class="c1"># Remove system-like instructions
</span>    <span class="n">dangerous_patterns</span> <span class="o">=</span> <span class="p">[</span>
        <span class="sh">'</span><span class="s">ignore previous instructions</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">disregard the above</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">system:</span><span class="sh">'</span><span class="p">,</span>
        <span class="sh">'</span><span class="s">assistant:</span><span class="sh">'</span><span class="p">,</span>
    <span class="p">]</span>
    
    <span class="n">cleaned</span> <span class="o">=</span> <span class="n">user_input</span><span class="p">.</span><span class="nf">lower</span><span class="p">()</span>
    <span class="k">for</span> <span class="n">pattern</span> <span class="ow">in</span> <span class="n">dangerous_patterns</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">pattern</span> <span class="ow">in</span> <span class="n">cleaned</span><span class="p">:</span>
            <span class="k">return</span> <span class="sh">"</span><span class="s">[Input rejected: suspicious pattern detected]</span><span class="sh">"</span>
    
    <span class="c1"># Limit length
</span>    <span class="k">if</span> <span class="nf">len</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">5000</span><span class="p">:</span>
        <span class="n">user_input</span> <span class="o">=</span> <span class="n">user_input</span><span class="p">[:</span><span class="mi">5000</span><span class="p">]</span>
    
    <span class="k">return</span> <span class="n">user_input</span>

<span class="k">def</span> <span class="nf">detect_pii</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Check for personally identifiable information.</span><span class="sh">"""</span>
    <span class="kn">import</span> <span class="n">re</span>
    
    <span class="c1"># Email
</span>    <span class="k">if</span> <span class="n">re</span><span class="p">.</span><span class="nf">search</span><span class="p">(</span><span class="sa">r</span><span class="sh">'</span><span class="s">\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b</span><span class="sh">'</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">True</span>
    
    <span class="c1"># Phone number
</span>    <span class="k">if</span> <span class="n">re</span><span class="p">.</span><span class="nf">search</span><span class="p">(</span><span class="sa">r</span><span class="sh">'</span><span class="s">\b\d{3}[-.]?\d{3}[-.]?\d{4}\b</span><span class="sh">'</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">True</span>
    
    <span class="c1"># SSN pattern
</span>    <span class="k">if</span> <span class="n">re</span><span class="p">.</span><span class="nf">search</span><span class="p">(</span><span class="sa">r</span><span class="sh">'</span><span class="s">\b\d{3}-\d{2}-\d{4}\b</span><span class="sh">'</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
        <span class="k">return</span> <span class="bp">True</span>
    
    <span class="k">return</span> <span class="bp">False</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>Generative AI engineering is systems engineering. The models are tools—the value is in how you use them. Focus on prompts, retrieval, evaluation, cost optimization, and reliability.</p>

<p>Start simple: good prompts with few-shot examples, basic RAG, automated evaluation. Add complexity only when needed. Measure everything—costs, latency, quality. Iterate based on data.</p>

<p>The best AI systems feel simple to users but are sophisticated underneath. That sophistication comes from engineering discipline, not fancy models.</p>

<p><strong>Further Resources:</strong></p>
<ul>
  <li><a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview">Anthropic Prompt Engineering</a> - Comprehensive guide</li>
  <li><a href="https://platform.openai.com/docs/guides/prompt-engineering">OpenAI Best Practices</a> - Prompting strategies</li>
  <li><a href="https://python.langchain.com/docs/get_started/introduction">LangChain Documentation</a> - RAG patterns</li>
  <li><a href="https://docs.llamaindex.ai/">LlamaIndex</a> - Advanced RAG</li>
  <li><a href="https://wandb.ai/site/solutions/llmops">Weights &amp; Biases for LLMs</a> - Experiment tracking</li>
  <li><a href="https://www.langchain.com/langsmith">LangSmith</a> - LLM observability</li>
  <li><a href="https://www.helicone.ai/">Helicone</a> - LLM monitoring</li>
  <li><a href="https://www.promptingguide.ai/">Prompt Engineering Guide</a> - Techniques and examples</li>
</ul>

<hr />

<p><em>Generative AI engineering from June 2025 — updated with production guidance.</em></p>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="Best Practices" /><category term="Generative AI" /><category term="Engineering" /><category term="LLM" /><category term="Best Practices" /><summary type="html"><![CDATA[Engineering practices for generative AI: prompt engineering, RAG patterns, evaluation, monitoring, and best practices for production AI systems.]]></summary></entry><entry><title type="html">Building AI Coding Assistants: Technical Deep Dive</title><link href="https://www.fratepietro.com/2025/building-ai-coding-assistants/" rel="alternate" type="text/html" title="Building AI Coding Assistants: Technical Deep Dive" /><published>2025-05-13T00:00:00+02:00</published><updated>2025-05-13T00:00:00+02:00</updated><id>https://www.fratepietro.com/2025/building-ai-coding-assistants</id><content type="html" xml:base="https://www.fratepietro.com/2025/building-ai-coding-assistants/"><![CDATA[<p>AI coding assistants have gone from party trick to indispensable tool in under two years. <a href="https://github.com/features/copilot">GitHub Copilot</a>, <a href="https://cursor.sh/">Cursor</a>, <a href="https://sourcegraph.com/cody">Cody</a>, and <a href="https://www.tabnine.com/">Tabnine</a> are used by millions of developers daily. Building one requires solving hard problems: understanding massive codebases, generating correct code, and integrating with developer workflows.</p>

<p>I’ve built several coding assistants—from simple autocomplete to full agentic systems. The core challenge isn’t LLMs (they’re a commodity now)—it’s everything around them: <strong>context retrieval, code analysis, execution validation, and UX</strong>.</p>

<p>This post covers the architecture that works in production, learned from systems processing millions of code generation requests.</p>

<h2 id="high-level-architecture">High-Level Architecture</h2>

<p>A production coding assistant has these components:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────┐
│  IDE Plugin │  ← User interaction
└──────┬──────┘
       │
┌──────▼──────────────┐
│  Orchestrator       │  ← Request routing, rate limiting
├─────────────────────┤
│  Context Engine     │  ← RAG, file selection
├─────────────────────┤
│  Code Analysis      │  ← AST, LSP, static analysis
├─────────────────────┤
│  LLM Service        │  ← OpenAI, Anthropic, local models
├─────────────────────┤
│  Execution Sandbox  │  ← Run and test generated code
├─────────────────────┤
│  Cache Layer        │  ← Response caching, embeddings
└─────────────────────┘
</code></pre></div></div>

<p>Each layer handles specific concerns. Let’s dive into each.</p>

<h2 id="context-is-everything-rag-for-code">Context is Everything: RAG for Code</h2>

<p>Large codebases have millions of lines. You can’t fit that in LLM context. You need intelligent retrieval.</p>

<h3 id="chunking-code">Chunking Code</h3>

<p>Unlike prose, code has structure. Chunk by semantic units:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">ast</span>
<span class="kn">from</span> <span class="n">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Dict</span>

<span class="k">class</span> <span class="nc">CodeChunker</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Chunk code by functions, classes, and top-level statements.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">chunk_python_file</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Dict</span><span class="p">]:</span>
        <span class="sh">"""</span><span class="s">Split Python file into semantic chunks.</span><span class="sh">"""</span>
        <span class="n">tree</span> <span class="o">=</span> <span class="n">ast</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
        <span class="n">chunks</span> <span class="o">=</span> <span class="p">[]</span>
        
        <span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">ast</span><span class="p">.</span><span class="nf">iter_child_nodes</span><span class="p">(</span><span class="n">tree</span><span class="p">):</span>
            <span class="n">chunk</span> <span class="o">=</span> <span class="p">{</span>
                <span class="sh">'</span><span class="s">type</span><span class="sh">'</span><span class="p">:</span> <span class="nf">type</span><span class="p">(</span><span class="n">node</span><span class="p">).</span><span class="n">__name__</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">:</span> <span class="nf">getattr</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">anonymous</span><span class="sh">'</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">code</span><span class="sh">'</span><span class="p">:</span> <span class="n">ast</span><span class="p">.</span><span class="nf">get_source_segment</span><span class="p">(</span><span class="n">code</span><span class="p">,</span> <span class="n">node</span><span class="p">),</span>
                <span class="sh">'</span><span class="s">lineno</span><span class="sh">'</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="n">lineno</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">end_lineno</span><span class="sh">'</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="n">end_lineno</span><span class="p">,</span>
            <span class="p">}</span>
            
            <span class="c1"># Add docstring if present
</span>            <span class="k">if</span> <span class="nf">isinstance</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="p">(</span><span class="n">ast</span><span class="p">.</span><span class="n">FunctionDef</span><span class="p">,</span> <span class="n">ast</span><span class="p">.</span><span class="n">ClassDef</span><span class="p">)):</span>
                <span class="n">docstring</span> <span class="o">=</span> <span class="n">ast</span><span class="p">.</span><span class="nf">get_docstring</span><span class="p">(</span><span class="n">node</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">docstring</span><span class="p">:</span>
                    <span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">docstring</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">docstring</span>
            
            <span class="n">chunks</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">chunks</span>

<span class="c1"># Usage
</span><span class="n">chunker</span> <span class="o">=</span> <span class="nc">CodeChunker</span><span class="p">()</span>
<span class="n">chunks</span> <span class="o">=</span> <span class="n">chunker</span><span class="p">.</span><span class="nf">chunk_python_file</span><span class="p">(</span><span class="nf">open</span><span class="p">(</span><span class="sh">'</span><span class="s">app.py</span><span class="sh">'</span><span class="p">).</span><span class="nf">read</span><span class="p">())</span>

<span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">chunks</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">type</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s"> (lines </span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">lineno</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">end_lineno</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s">)</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>For other languages, use tree-sitter for consistent parsing:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">tree_sitter</span> <span class="kn">import</span> <span class="n">Language</span><span class="p">,</span> <span class="n">Parser</span>
<span class="kn">import</span> <span class="n">tree_sitter_python</span>

<span class="c1"># Load Python grammar
</span><span class="n">PY_LANGUAGE</span> <span class="o">=</span> <span class="nc">Language</span><span class="p">(</span><span class="n">tree_sitter_python</span><span class="p">.</span><span class="nf">language</span><span class="p">())</span>
<span class="n">parser</span> <span class="o">=</span> <span class="nc">Parser</span><span class="p">(</span><span class="n">PY_LANGUAGE</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">extract_functions</span><span class="p">(</span><span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Dict</span><span class="p">]:</span>
    <span class="sh">"""</span><span class="s">Extract all function definitions.</span><span class="sh">"""</span>
    <span class="n">tree</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="nf">parse</span><span class="p">(</span><span class="nf">bytes</span><span class="p">(</span><span class="n">code</span><span class="p">,</span> <span class="sh">'</span><span class="s">utf8</span><span class="sh">'</span><span class="p">))</span>
    
    <span class="n">functions</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">tree</span><span class="p">.</span><span class="n">root_node</span><span class="p">.</span><span class="n">children</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="nb">type</span> <span class="o">==</span> <span class="sh">'</span><span class="s">function_definition</span><span class="sh">'</span><span class="p">:</span>
            <span class="n">functions</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="nf">child_by_field_name</span><span class="p">(</span><span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">).</span><span class="n">text</span><span class="p">.</span><span class="nf">decode</span><span class="p">(),</span>
                <span class="sh">'</span><span class="s">code</span><span class="sh">'</span><span class="p">:</span> <span class="n">code</span><span class="p">[</span><span class="n">node</span><span class="p">.</span><span class="n">start_byte</span><span class="p">:</span><span class="n">node</span><span class="p">.</span><span class="n">end_byte</span><span class="p">],</span>
                <span class="sh">'</span><span class="s">start_line</span><span class="sh">'</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="n">start_point</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
                <span class="sh">'</span><span class="s">end_line</span><span class="sh">'</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="n">end_point</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
            <span class="p">})</span>
    
    <span class="k">return</span> <span class="n">functions</span>
</code></pre></div></div>

<h3 id="embedding-and-indexing">Embedding and Indexing</h3>

<p>Use <a href="https://huggingface.co/models?other=code">code-specific embedding models</a> for better semantic search:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">sentence_transformers</span> <span class="kn">import</span> <span class="n">SentenceTransformer</span>
<span class="kn">import</span> <span class="n">pinecone</span>

<span class="c1"># Specialized code embedding model
</span><span class="n">model</span> <span class="o">=</span> <span class="nc">SentenceTransformer</span><span class="p">(</span><span class="sh">'</span><span class="s">microsoft/codebert-base</span><span class="sh">'</span><span class="p">)</span>

<span class="c1"># Initialize Pinecone
</span><span class="n">pc</span> <span class="o">=</span> <span class="n">pinecone</span><span class="p">.</span><span class="nc">Pinecone</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="sh">'</span><span class="s">your-key</span><span class="sh">'</span><span class="p">)</span>
<span class="n">index</span> <span class="o">=</span> <span class="n">pc</span><span class="p">.</span><span class="nc">Index</span><span class="p">(</span><span class="sh">'</span><span class="s">codebase</span><span class="sh">'</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">index_codebase</span><span class="p">(</span><span class="n">repo_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Index an entire codebase.</span><span class="sh">"""</span>
    <span class="k">for</span> <span class="n">filepath</span> <span class="ow">in</span> <span class="nf">glob_python_files</span><span class="p">(</span><span class="n">repo_path</span><span class="p">):</span>
        <span class="n">code</span> <span class="o">=</span> <span class="nf">open</span><span class="p">(</span><span class="n">filepath</span><span class="p">).</span><span class="nf">read</span><span class="p">()</span>
        <span class="n">chunks</span> <span class="o">=</span> <span class="n">chunker</span><span class="p">.</span><span class="nf">chunk_python_file</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
        
        <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">chunks</span><span class="p">:</span>
            <span class="c1"># Create searchable text
</span>            <span class="n">search_text</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">
</span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">type</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s"> </span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s">
</span><span class="si">{</span><span class="n">chunk</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">docstring</span><span class="sh">'</span><span class="p">,</span> <span class="sh">''</span><span class="p">)</span><span class="si">}</span><span class="s">
</span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">code</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s">
            </span><span class="sh">"""</span><span class="p">.</span><span class="nf">strip</span><span class="p">()</span>
            
            <span class="c1"># Embed
</span>            <span class="n">embedding</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="nf">encode</span><span class="p">(</span><span class="n">search_text</span><span class="p">)</span>
            
            <span class="c1"># Store in vector DB
</span>            <span class="n">index</span><span class="p">.</span><span class="nf">upsert</span><span class="p">([{</span>
                <span class="sh">'</span><span class="s">id</span><span class="sh">'</span><span class="p">:</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">filepath</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">lineno</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">values</span><span class="sh">'</span><span class="p">:</span> <span class="n">embedding</span><span class="p">.</span><span class="nf">tolist</span><span class="p">(),</span>
                <span class="sh">'</span><span class="s">metadata</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span>
                    <span class="sh">'</span><span class="s">file</span><span class="sh">'</span><span class="p">:</span> <span class="n">filepath</span><span class="p">,</span>
                    <span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">:</span> <span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">],</span>
                    <span class="sh">'</span><span class="s">type</span><span class="sh">'</span><span class="p">:</span> <span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">type</span><span class="sh">'</span><span class="p">],</span>
                    <span class="sh">'</span><span class="s">code</span><span class="sh">'</span><span class="p">:</span> <span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">code</span><span class="sh">'</span><span class="p">][:</span><span class="mi">1000</span><span class="p">],</span>  <span class="c1"># Truncate for storage
</span>                    <span class="sh">'</span><span class="s">lineno</span><span class="sh">'</span><span class="p">:</span> <span class="n">chunk</span><span class="p">[</span><span class="sh">'</span><span class="s">lineno</span><span class="sh">'</span><span class="p">],</span>
                <span class="p">}</span>
            <span class="p">}])</span>

<span class="k">def</span> <span class="nf">find_relevant_code</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">top_k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Find code relevant to query.</span><span class="sh">"""</span>
    <span class="n">query_embedding</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="nf">encode</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
    
    <span class="n">results</span> <span class="o">=</span> <span class="n">index</span><span class="p">.</span><span class="nf">query</span><span class="p">(</span>
        <span class="n">vector</span><span class="o">=</span><span class="n">query_embedding</span><span class="p">.</span><span class="nf">tolist</span><span class="p">(),</span>
        <span class="n">top_k</span><span class="o">=</span><span class="n">top_k</span><span class="p">,</span>
        <span class="n">include_metadata</span><span class="o">=</span><span class="bp">True</span>
    <span class="p">)</span>
    
    <span class="k">return</span> <span class="p">[</span>
        <span class="p">{</span>
            <span class="sh">'</span><span class="s">file</span><span class="sh">'</span><span class="p">:</span> <span class="n">r</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="sh">'</span><span class="s">file</span><span class="sh">'</span><span class="p">],</span>
            <span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">:</span> <span class="n">r</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">],</span>
            <span class="sh">'</span><span class="s">code</span><span class="sh">'</span><span class="p">:</span> <span class="n">r</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="sh">'</span><span class="s">code</span><span class="sh">'</span><span class="p">],</span>
            <span class="sh">'</span><span class="s">score</span><span class="sh">'</span><span class="p">:</span> <span class="n">r</span><span class="p">.</span><span class="n">score</span><span class="p">,</span>
        <span class="p">}</span>
        <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span><span class="p">.</span><span class="n">matches</span>
    <span class="p">]</span>

<span class="c1"># Usage
</span><span class="n">relevant_code</span> <span class="o">=</span> <span class="nf">find_relevant_code</span><span class="p">(</span><span class="sh">"</span><span class="s">How to authenticate users?</span><span class="sh">"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">code</span> <span class="ow">in</span> <span class="n">relevant_code</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Score: </span><span class="si">{</span><span class="n">code</span><span class="p">[</span><span class="sh">'</span><span class="s">score</span><span class="sh">'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s"> - </span><span class="si">{</span><span class="n">code</span><span class="p">[</span><span class="sh">'</span><span class="s">file</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">code</span><span class="p">[</span><span class="sh">'</span><span class="s">name</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="hybrid-search-combine-semantic--keyword">Hybrid Search: Combine Semantic + Keyword</h3>

<p>Pure vector search misses exact matches. Combine with keyword search:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">hybrid_search</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">top_k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Combine semantic and keyword search.</span><span class="sh">"""</span>
    <span class="c1"># Semantic search
</span>    <span class="n">semantic_results</span> <span class="o">=</span> <span class="nf">find_relevant_code</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="n">top_k</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
    
    <span class="c1"># Keyword search (simple implementation)
</span>    <span class="n">keyword_results</span> <span class="o">=</span> <span class="nf">search_by_keywords</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="n">top_k</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
    
    <span class="c1"># Merge and rank (Reciprocal Rank Fusion)
</span>    <span class="n">combined_scores</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">result</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">semantic_results</span><span class="p">,</span> <span class="mi">1</span><span class="p">):</span>
        <span class="n">combined_scores</span><span class="p">[</span><span class="n">result</span><span class="p">[</span><span class="sh">'</span><span class="s">id</span><span class="sh">'</span><span class="p">]]</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="n">rank</span> <span class="o">+</span> <span class="mi">60</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">result</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">keyword_results</span><span class="p">,</span> <span class="mi">1</span><span class="p">):</span>
        <span class="n">result_id</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="sh">'</span><span class="s">id</span><span class="sh">'</span><span class="p">]</span>
        <span class="n">combined_scores</span><span class="p">[</span><span class="n">result_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">combined_scores</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">result_id</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="n">rank</span> <span class="o">+</span> <span class="mi">60</span><span class="p">)</span>
    
    <span class="c1"># Sort by combined score
</span>    <span class="n">ranked</span> <span class="o">=</span> <span class="nf">sorted</span><span class="p">(</span><span class="n">combined_scores</span><span class="p">.</span><span class="nf">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">ranked</span><span class="p">[:</span><span class="n">top_k</span><span class="p">]</span>
</code></pre></div></div>

<p>See <a href="https://www.anthropic.com/index/contextual-retrieval">Anthropic’s guide on RAG for code</a> for more techniques.</p>

<h2 id="code-analysis-understanding-structure">Code Analysis: Understanding Structure</h2>

<p>Static analysis helps validate and improve generated code:</p>

<h3 id="language-server-protocol-lsp">Language Server Protocol (LSP)</h3>

<p><a href="https://microsoft.github.io/language-server-protocol/">LSP</a> provides IDE-like intelligence:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">pylsp.python_lsp</span> <span class="kn">import</span> <span class="n">PythonLanguageServer</span>

<span class="k">class</span> <span class="nc">CodeAnalyzer</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Analyze code using LSP.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">lsp</span> <span class="o">=</span> <span class="nc">PythonLanguageServer</span><span class="p">()</span>
    
    <span class="k">def</span> <span class="nf">get_completions</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">filepath</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">line</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">column</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Get completion suggestions at cursor.</span><span class="sh">"""</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">lsp</span><span class="p">.</span><span class="nf">completions</span><span class="p">({</span>
            <span class="sh">'</span><span class="s">textDocument</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">uri</span><span class="sh">'</span><span class="p">:</span> <span class="sa">f</span><span class="sh">'</span><span class="s">file://</span><span class="si">{</span><span class="n">filepath</span><span class="si">}</span><span class="sh">'</span><span class="p">},</span>
            <span class="sh">'</span><span class="s">position</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">line</span><span class="sh">'</span><span class="p">:</span> <span class="n">line</span><span class="p">,</span> <span class="sh">'</span><span class="s">character</span><span class="sh">'</span><span class="p">:</span> <span class="n">column</span><span class="p">}</span>
        <span class="p">})</span>
    
    <span class="k">def</span> <span class="nf">get_diagnostics</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">filepath</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Get errors and warnings.</span><span class="sh">"""</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">lsp</span><span class="p">.</span><span class="nf">lint</span><span class="p">({</span>
            <span class="sh">'</span><span class="s">textDocument</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">uri</span><span class="sh">'</span><span class="p">:</span> <span class="sa">f</span><span class="sh">'</span><span class="s">file://</span><span class="si">{</span><span class="n">filepath</span><span class="si">}</span><span class="sh">'</span><span class="p">},</span>
            <span class="sh">'</span><span class="s">text</span><span class="sh">'</span><span class="p">:</span> <span class="n">code</span>
        <span class="p">})</span>
    
    <span class="k">def</span> <span class="nf">find_references</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">filepath</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">symbol</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Find all references to a symbol.</span><span class="sh">"""</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">lsp</span><span class="p">.</span><span class="nf">references</span><span class="p">({</span>
            <span class="sh">'</span><span class="s">textDocument</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">uri</span><span class="sh">'</span><span class="p">:</span> <span class="sa">f</span><span class="sh">'</span><span class="s">file://</span><span class="si">{</span><span class="n">filepath</span><span class="si">}</span><span class="sh">'</span><span class="p">},</span>
            <span class="sh">'</span><span class="s">position</span><span class="sh">'</span><span class="p">:</span> <span class="n">self</span><span class="p">.</span><span class="nf">find_symbol_position</span><span class="p">(</span><span class="n">symbol</span><span class="p">)</span>
        <span class="p">})</span>
</code></pre></div></div>

<h3 id="type-inference">Type Inference</h3>

<p>Use <a href="https://github.com/microsoft/pyright">Pyright</a> or <a href="https://github.com/python/mypy">mypy</a> to validate generated code:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">subprocess</span>
<span class="kn">import</span> <span class="n">json</span>

<span class="k">def</span> <span class="nf">check_types</span><span class="p">(</span><span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Dict</span><span class="p">]:</span>
    <span class="sh">"""</span><span class="s">Run Pyright on code.</span><span class="sh">"""</span>
    <span class="c1"># Write code to temp file
</span>    <span class="k">with</span> <span class="n">tempfile</span><span class="p">.</span><span class="nc">NamedTemporaryFile</span><span class="p">(</span><span class="n">mode</span><span class="o">=</span><span class="sh">'</span><span class="s">w</span><span class="sh">'</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="sh">'</span><span class="s">.py</span><span class="sh">'</span><span class="p">,</span> <span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
        <span class="n">filepath</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">name</span>
    
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Run Pyright
</span>        <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span>
            <span class="p">[</span><span class="sh">'</span><span class="s">pyright</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">--outputjson</span><span class="sh">'</span><span class="p">,</span> <span class="n">filepath</span><span class="p">],</span>
            <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">text</span><span class="o">=</span><span class="bp">True</span>
        <span class="p">)</span>
        
        <span class="n">diagnostics</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="nf">loads</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">stdout</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">diagnostics</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">generalDiagnostics</span><span class="sh">'</span><span class="p">,</span> <span class="p">[])</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="n">os</span><span class="p">.</span><span class="nf">unlink</span><span class="p">(</span><span class="n">filepath</span><span class="p">)</span>

<span class="c1"># Usage
</span><span class="n">code</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
def add(a: int, b: int) -&gt; int:
    return a + b

result = add(</span><span class="sh">"</span><span class="s">5</span><span class="sh">"</span><span class="s">, 10)  # Type error!
</span><span class="sh">"""</span>

<span class="n">errors</span> <span class="o">=</span> <span class="nf">check_types</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
<span class="k">for</span> <span class="n">error</span> <span class="ow">in</span> <span class="n">errors</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Line </span><span class="si">{</span><span class="n">error</span><span class="p">[</span><span class="sh">'</span><span class="s">range</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">start</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">line</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">error</span><span class="p">[</span><span class="sh">'</span><span class="s">message</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="security-scanning">Security Scanning</h3>

<p>Detect security issues with <a href="https://github.com/PyCQA/bandit">Bandit</a>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">bandit</span>
<span class="kn">from</span> <span class="n">bandit.core</span> <span class="kn">import</span> <span class="n">manager</span>

<span class="k">def</span> <span class="nf">security_scan</span><span class="p">(</span><span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Dict</span><span class="p">]:</span>
    <span class="sh">"""</span><span class="s">Scan for security issues.</span><span class="sh">"""</span>
    <span class="c1"># Create temp file
</span>    <span class="k">with</span> <span class="n">tempfile</span><span class="p">.</span><span class="nc">NamedTemporaryFile</span><span class="p">(</span><span class="n">mode</span><span class="o">=</span><span class="sh">'</span><span class="s">w</span><span class="sh">'</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="sh">'</span><span class="s">.py</span><span class="sh">'</span><span class="p">,</span> <span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
        <span class="n">filepath</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">name</span>
    
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Run Bandit
</span>        <span class="n">b</span> <span class="o">=</span> <span class="n">manager</span><span class="p">.</span><span class="nc">BanditManager</span><span class="p">(</span><span class="n">bandit</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="nc">BanditConfig</span><span class="p">(),</span> <span class="sh">'</span><span class="s">file</span><span class="sh">'</span><span class="p">)</span>
        <span class="n">b</span><span class="p">.</span><span class="nf">discover_files</span><span class="p">([</span><span class="n">filepath</span><span class="p">])</span>
        <span class="n">b</span><span class="p">.</span><span class="nf">run_tests</span><span class="p">()</span>
        
        <span class="n">issues</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">b</span><span class="p">.</span><span class="nf">get_issue_list</span><span class="p">():</span>
            <span class="n">issues</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">severity</span><span class="sh">'</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">severity</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">confidence</span><span class="sh">'</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">confidence</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">text</span><span class="sh">'</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">text</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">line</span><span class="sh">'</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">lineno</span><span class="p">,</span>
            <span class="p">})</span>
        
        <span class="k">return</span> <span class="n">issues</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="n">os</span><span class="p">.</span><span class="nf">unlink</span><span class="p">(</span><span class="n">filepath</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="execution-and-validation">Execution and Validation</h2>

<p>Generate code, run it, validate results:</p>

<h3 id="test-driven-generation">Test-Driven Generation</h3>

<p>Generate code and tests together:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">anthropic</span> <span class="kn">import</span> <span class="n">Anthropic</span>

<span class="n">client</span> <span class="o">=</span> <span class="nc">Anthropic</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="sh">'</span><span class="s">your-key</span><span class="sh">'</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">generate_with_tests</span><span class="p">(</span><span class="n">spec</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">context</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Generate function and tests.</span><span class="sh">"""</span>
    <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"""</span><span class="s">Generate a Python function and pytest tests for:

</span><span class="si">{</span><span class="n">spec</span><span class="si">}</span><span class="s">

Context from codebase:
</span><span class="si">{</span><span class="n">context</span><span class="si">}</span><span class="s">

Return:
1. The function implementation
2. At least 3 pytest test cases
3. Docstring with examples

Format as Python code blocks.</span><span class="sh">"""</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">'</span><span class="s">claude-3-5-sonnet-20241022</span><span class="sh">'</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="mi">2000</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="sh">'</span><span class="s">role</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">content</span><span class="sh">'</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}]</span>
    <span class="p">)</span>
    
    <span class="c1"># Parse response (simplified)
</span>    <span class="n">code</span> <span class="o">=</span> <span class="nf">extract_code_blocks</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">function</span><span class="sh">'</span><span class="p">:</span> <span class="n">code</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
        <span class="sh">'</span><span class="s">tests</span><span class="sh">'</span><span class="p">:</span> <span class="n">code</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span>
    <span class="p">}</span>

<span class="k">def</span> <span class="nf">validate_generated_code</span><span class="p">(</span><span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">tests</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Run tests against generated code.</span><span class="sh">"""</span>
    <span class="c1"># Combine code and tests
</span>    <span class="n">full_code</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">code</span><span class="si">}</span><span class="se">\n\n</span><span class="si">{</span><span class="n">tests</span><span class="si">}</span><span class="sh">"</span>
    
    <span class="c1"># Write to temp file
</span>    <span class="k">with</span> <span class="n">tempfile</span><span class="p">.</span><span class="nc">NamedTemporaryFile</span><span class="p">(</span><span class="n">mode</span><span class="o">=</span><span class="sh">'</span><span class="s">w</span><span class="sh">'</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="sh">'</span><span class="s">.py</span><span class="sh">'</span><span class="p">,</span> <span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">full_code</span><span class="p">)</span>
        <span class="n">filepath</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">name</span>
    
    <span class="k">try</span><span class="p">:</span>
        <span class="c1"># Run pytest
</span>        <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span>
            <span class="p">[</span><span class="sh">'</span><span class="s">pytest</span><span class="sh">'</span><span class="p">,</span> <span class="n">filepath</span><span class="p">,</span> <span class="sh">'</span><span class="s">-v</span><span class="sh">'</span><span class="p">],</span>
            <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">timeout</span><span class="o">=</span><span class="mi">10</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="n">result</span><span class="p">.</span><span class="n">returncode</span> <span class="o">==</span> <span class="mi">0</span>
    <span class="k">except</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">TimeoutExpired</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">False</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="n">os</span><span class="p">.</span><span class="nf">unlink</span><span class="p">(</span><span class="n">filepath</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="sandboxed-execution">Sandboxed Execution</h3>

<p>Use <a href="https://e2b.dev/">E2B</a> for secure code execution:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">e2b</span> <span class="kn">import</span> <span class="n">Sandbox</span>

<span class="k">def</span> <span class="nf">execute_safely</span><span class="p">(</span><span class="n">code</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">inputs</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Execute code in sandbox.</span><span class="sh">"""</span>
    <span class="k">with</span> <span class="nc">Sandbox</span><span class="p">()</span> <span class="k">as</span> <span class="n">sandbox</span><span class="p">:</span>
        <span class="c1"># Write code
</span>        <span class="n">sandbox</span><span class="p">.</span><span class="n">filesystem</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="sh">'</span><span class="s">main.py</span><span class="sh">'</span><span class="p">,</span> <span class="n">code</span><span class="p">)</span>
        
        <span class="c1"># Execute with inputs
</span>        <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">input_data</span> <span class="ow">in</span> <span class="n">inputs</span><span class="p">:</span>
            <span class="n">result</span> <span class="o">=</span> <span class="n">sandbox</span><span class="p">.</span><span class="nf">run_code</span><span class="p">(</span>
                <span class="n">code</span><span class="p">,</span>
                <span class="n">env_vars</span><span class="o">=</span><span class="p">{</span><span class="sh">'</span><span class="s">INPUT</span><span class="sh">'</span><span class="p">:</span> <span class="n">input_data</span><span class="p">},</span>
                <span class="n">timeout</span><span class="o">=</span><span class="mi">5</span>
            <span class="p">)</span>
            
            <span class="n">results</span><span class="p">.</span><span class="nf">append</span><span class="p">({</span>
                <span class="sh">'</span><span class="s">stdout</span><span class="sh">'</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">stdout</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">stderr</span><span class="sh">'</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">stderr</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">exit_code</span><span class="sh">'</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">exit_code</span><span class="p">,</span>
                <span class="sh">'</span><span class="s">error</span><span class="sh">'</span><span class="p">:</span> <span class="n">result</span><span class="p">.</span><span class="n">error</span><span class="p">,</span>
            <span class="p">})</span>
        
        <span class="k">return</span> <span class="n">results</span>

<span class="c1"># Usage
</span><span class="n">code</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
import os
print(f</span><span class="sh">"</span><span class="s">Hello {os.getenv(</span><span class="sh">'</span><span class="s">INPUT</span><span class="sh">'</span><span class="s">, </span><span class="sh">'</span><span class="s">World</span><span class="sh">'</span><span class="s">)}!</span><span class="sh">"</span><span class="s">)
</span><span class="sh">"""</span>

<span class="n">results</span> <span class="o">=</span> <span class="nf">execute_safely</span><span class="p">(</span><span class="n">code</span><span class="p">,</span> <span class="p">[</span><span class="sh">'</span><span class="s">Alice</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">Bob</span><span class="sh">'</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">results</span><span class="p">):</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Run </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">r</span><span class="p">[</span><span class="sh">'</span><span class="s">stdout</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="iterative-refinement">Iterative Refinement</h3>

<p>If code fails tests, refine iteratively:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">iterative_generation</span><span class="p">(</span><span class="n">spec</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">max_iterations</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Generate and refine code until tests pass.</span><span class="sh">"""</span>
    <span class="n">context</span> <span class="o">=</span> <span class="nf">find_relevant_code</span><span class="p">(</span><span class="n">spec</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">max_iterations</span><span class="p">):</span>
        <span class="c1"># Generate code
</span>        <span class="n">result</span> <span class="o">=</span> <span class="nf">generate_with_tests</span><span class="p">(</span><span class="n">spec</span><span class="p">,</span> <span class="n">context</span><span class="p">)</span>
        <span class="n">code</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="sh">'</span><span class="s">function</span><span class="sh">'</span><span class="p">]</span>
        <span class="n">tests</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="sh">'</span><span class="s">tests</span><span class="sh">'</span><span class="p">]</span>
        
        <span class="c1"># Validate
</span>        <span class="k">if</span> <span class="nf">validate_generated_code</span><span class="p">(</span><span class="n">code</span><span class="p">,</span> <span class="n">tests</span><span class="p">):</span>
            <span class="k">return</span> <span class="n">code</span>
        
        <span class="c1"># If failed, add error context and retry
</span>        <span class="n">errors</span> <span class="o">=</span> <span class="nf">check_types</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
        <span class="n">context</span> <span class="o">+=</span> <span class="sa">f</span><span class="sh">"</span><span class="se">\n\n</span><span class="s">Previous attempt had errors:</span><span class="se">\n</span><span class="si">{</span><span class="n">errors</span><span class="si">}</span><span class="sh">"</span>
    
    <span class="k">raise</span> <span class="nc">Exception</span><span class="p">(</span><span class="sh">"</span><span class="s">Failed to generate working code</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="production-considerations">Production Considerations</h2>

<h3 id="cost-optimization">Cost Optimization</h3>

<p>LLM costs add up fast at scale:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CostTracker</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Track and optimize LLM costs.</span><span class="sh">"""</span>
    
    <span class="n">PRICING</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">gpt-4o</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0025</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.010</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">gpt-4o-mini</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.00015</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.0006</span><span class="p">},</span>
        <span class="sh">'</span><span class="s">claude-3-5-sonnet</span><span class="sh">'</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.003</span><span class="p">,</span> <span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">:</span> <span class="mf">0.015</span><span class="p">},</span>
    <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">total_cost</span> <span class="o">=</span> <span class="mi">0</span>
    
    <span class="k">def</span> <span class="nf">calculate_cost</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">model</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">input_tokens</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">output_tokens</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Calculate request cost.</span><span class="sh">"""</span>
        <span class="n">pricing</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">PRICING</span><span class="p">[</span><span class="n">model</span><span class="p">]</span>
        <span class="n">cost</span> <span class="o">=</span> <span class="p">(</span>
            <span class="p">(</span><span class="n">input_tokens</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span> <span class="o">*</span> <span class="n">pricing</span><span class="p">[</span><span class="sh">'</span><span class="s">input</span><span class="sh">'</span><span class="p">]</span> <span class="o">+</span>
            <span class="p">(</span><span class="n">output_tokens</span> <span class="o">/</span> <span class="mi">1000</span><span class="p">)</span> <span class="o">*</span> <span class="n">pricing</span><span class="p">[</span><span class="sh">'</span><span class="s">output</span><span class="sh">'</span><span class="p">]</span>
        <span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">total_cost</span> <span class="o">+=</span> <span class="n">cost</span>
        <span class="k">return</span> <span class="n">cost</span>

<span class="c1"># Optimization strategies:
# 1. Use cheaper models for simple tasks (autocomplete)
# 2. Cache responses aggressively
# 3. Minimize context with smart retrieval
# 4. Use streaming to show results faster
</span></code></pre></div></div>

<h3 id="response-caching">Response Caching</h3>

<p>Cache at multiple levels:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">hashlib</span>
<span class="kn">import</span> <span class="n">redis</span>

<span class="k">class</span> <span class="nc">CacheLayer</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Multi-level caching for coding assistant.</span><span class="sh">"""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="n">self</span><span class="p">.</span><span class="n">redis</span> <span class="o">=</span> <span class="n">redis</span><span class="p">.</span><span class="nc">Redis</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="sh">'</span><span class="s">localhost</span><span class="sh">'</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">6379</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">memory_cache</span> <span class="o">=</span> <span class="p">{}</span>  <span class="c1"># In-memory for ultra-fast access
</span>    
    <span class="k">def</span> <span class="nf">get_cached_response</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">context</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
        <span class="sh">"""</span><span class="s">Get cached response.</span><span class="sh">"""</span>
        <span class="c1"># Create cache key
</span>        <span class="n">cache_key</span> <span class="o">=</span> <span class="n">hashlib</span><span class="p">.</span><span class="nf">sha256</span><span class="p">(</span>
            <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">context</span><span class="si">}</span><span class="sh">"</span><span class="p">.</span><span class="nf">encode</span><span class="p">()</span>
        <span class="p">).</span><span class="nf">hexdigest</span><span class="p">()</span>
        
        <span class="c1"># Check memory cache
</span>        <span class="k">if</span> <span class="n">cache_key</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">memory_cache</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">memory_cache</span><span class="p">[</span><span class="n">cache_key</span><span class="p">]</span>
        
        <span class="c1"># Check Redis
</span>        <span class="n">cached</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">redis</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">cache_key</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">cached</span><span class="p">:</span>
            <span class="n">response</span> <span class="o">=</span> <span class="n">cached</span><span class="p">.</span><span class="nf">decode</span><span class="p">()</span>
            <span class="n">self</span><span class="p">.</span><span class="n">memory_cache</span><span class="p">[</span><span class="n">cache_key</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span>  <span class="c1"># Promote to memory
</span>            <span class="k">return</span> <span class="n">response</span>
        
        <span class="k">return</span> <span class="bp">None</span>
    
    <span class="k">def</span> <span class="nf">cache_response</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">context</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">response</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">ttl</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3600</span><span class="p">):</span>
        <span class="sh">"""</span><span class="s">Cache response.</span><span class="sh">"""</span>
        <span class="n">cache_key</span> <span class="o">=</span> <span class="n">hashlib</span><span class="p">.</span><span class="nf">sha256</span><span class="p">(</span>
            <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">context</span><span class="si">}</span><span class="sh">"</span><span class="p">.</span><span class="nf">encode</span><span class="p">()</span>
        <span class="p">).</span><span class="nf">hexdigest</span><span class="p">()</span>
        
        <span class="c1"># Store in both layers
</span>        <span class="n">self</span><span class="p">.</span><span class="n">memory_cache</span><span class="p">[</span><span class="n">cache_key</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span>
        <span class="n">self</span><span class="p">.</span><span class="n">redis</span><span class="p">.</span><span class="nf">setex</span><span class="p">(</span><span class="n">cache_key</span><span class="p">,</span> <span class="n">ttl</span><span class="p">,</span> <span class="n">response</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="model-selection">Model Selection</h3>

<p>Use different models for different tasks:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">select_model</span><span class="p">(</span><span class="n">task_type</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Choose model based on task.</span><span class="sh">"""</span>
    <span class="k">if</span> <span class="n">task_type</span> <span class="o">==</span> <span class="sh">'</span><span class="s">autocomplete</span><span class="sh">'</span><span class="p">:</span>
        <span class="k">return</span> <span class="sh">'</span><span class="s">gpt-4o-mini</span><span class="sh">'</span>  <span class="c1"># Fast, cheap
</span>    <span class="k">elif</span> <span class="n">task_type</span> <span class="o">==</span> <span class="sh">'</span><span class="s">explain</span><span class="sh">'</span><span class="p">:</span>
        <span class="k">return</span> <span class="sh">'</span><span class="s">gpt-4o-mini</span><span class="sh">'</span>  <span class="c1"># Good enough
</span>    <span class="k">elif</span> <span class="n">task_type</span> <span class="o">==</span> <span class="sh">'</span><span class="s">generate_complex</span><span class="sh">'</span><span class="p">:</span>
        <span class="k">return</span> <span class="sh">'</span><span class="s">claude-3-5-sonnet</span><span class="sh">'</span>  <span class="c1"># Best quality
</span>    <span class="k">elif</span> <span class="n">task_type</span> <span class="o">==</span> <span class="sh">'</span><span class="s">refactor</span><span class="sh">'</span><span class="p">:</span>
        <span class="k">return</span> <span class="sh">'</span><span class="s">gpt-4o</span><span class="sh">'</span>  <span class="c1"># Balance of speed/quality
</span>    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="sh">'</span><span class="s">gpt-4o-mini</span><span class="sh">'</span>  <span class="c1"># Default to cheap
</span></code></pre></div></div>

<h3 id="monitoring">Monitoring</h3>

<p>Track what matters:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">structlog</span>
<span class="kn">from</span> <span class="n">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>

<span class="n">logger</span> <span class="o">=</span> <span class="n">structlog</span><span class="p">.</span><span class="nf">get_logger</span><span class="p">()</span>

<span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">RequestMetrics</span><span class="p">:</span>
    <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">task_type</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">model</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">input_tokens</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">output_tokens</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">latency_ms</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">cache_hit</span><span class="p">:</span> <span class="nb">bool</span>
    <span class="n">cost_usd</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">success</span><span class="p">:</span> <span class="nb">bool</span>

<span class="k">def</span> <span class="nf">log_request</span><span class="p">(</span><span class="n">metrics</span><span class="p">:</span> <span class="n">RequestMetrics</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Log request metrics for analysis.</span><span class="sh">"""</span>
    <span class="n">logger</span><span class="p">.</span><span class="nf">info</span><span class="p">(</span>
        <span class="sh">"</span><span class="s">coding_assistant_request</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">request_id</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">request_id</span><span class="p">,</span>
        <span class="n">task</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">task_type</span><span class="p">,</span>
        <span class="n">model</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
        <span class="n">input_tokens</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">input_tokens</span><span class="p">,</span>
        <span class="n">output_tokens</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">output_tokens</span><span class="p">,</span>
        <span class="n">latency_ms</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">latency_ms</span><span class="p">,</span>
        <span class="n">cache_hit</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">cache_hit</span><span class="p">,</span>
        <span class="n">cost</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">cost_usd</span><span class="p">,</span>
        <span class="n">success</span><span class="o">=</span><span class="n">metrics</span><span class="p">.</span><span class="n">success</span><span class="p">,</span>
    <span class="p">)</span>

<span class="c1"># Track aggregate metrics
# - Requests per minute
# - Cache hit rate
# - P50/P95/P99 latency
# - Cost per user
# - Success rate
</span></code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>Building a production coding assistant is 20% LLM calls and 80% everything else: context retrieval, code analysis, validation, caching, and monitoring. The LLM is a commodity—the value is in the system around it.</p>

<p>Start with strong RAG (code-aware chunking, hybrid search), validate generated code (tests, type checking, security), and optimize costs (caching, model selection). Test extensively with real codebases.</p>

<p>The best coding assistants feel invisible—they understand context, generate correct code, and integrate seamlessly into developer workflow. That requires careful engineering at every layer.</p>

<p><strong>Further Resources:</strong></p>
<ul>
  <li><a href="https://github.blog/2023-05-17-inside-github-building-the-worlds-largest-ai-powered-developer-tool/">GitHub Copilot Architecture</a> - How Copilot works</li>
  <li><a href="https://cursor.sh/">Cursor</a> - Leading AI IDE</li>
  <li><a href="https://sourcegraph.com/cody">Sourcegraph Cody</a> - Code AI assistant</li>
  <li><a href="https://github.com/microsoft/CodeBERT">CodeBERT</a> - Code understanding model</li>
  <li><a href="https://tree-sitter.github.io/">Tree-sitter</a> - Universal code parser</li>
  <li><a href="https://microsoft.github.io/language-server-protocol/">Language Server Protocol</a> - IDE features as a protocol</li>
  <li><a href="https://e2b.dev/">E2B</a> - Code execution sandbox</li>
  <li><a href="https://github.com/continuedev/continue">Continue.dev</a> - Open source coding assistant</li>
</ul>

<hr />

<p><em>Updated May 2025 — practical implementation notes for production AI coding assistants.</em></p>]]></content><author><name>Antonello Fratepietro</name><email>antonello.f at gmail dot com</email></author><category term="Deep Dive" /><category term="AI" /><category term="Coding Assistants" /><category term="LLM" /><category term="Code Generation" /><summary type="html"><![CDATA[Build AI coding assistants: code generation, context management, tool integration, code analysis, and technical patterns for production coding assistants.]]></summary></entry></feed>