MARS: GPU-Resident Memory for Real-Time Embodied AI

Every real-time AI system – autonomous vehicles, humanoid robots, AR/VR headsets – needs to answer the same question thousands of times per second: what do I already know that’s relevant to what I’m seeing right now?

That’s a memory retrieval problem. And every production stack today solves it with a bespoke C++ circular buffer, hand-tuned per project, per modality, per deadline. Waymo has one. Figure has one. Apple Vision Pro has one. Tesla Optimus has one.

I built MARS to investigate whether a general-purpose GPU-resident substrate can replace them.

Read the full paper (PDF)

The gap nobody talks about

FAISS GPU and cuVS CAGRA are excellent at what they do: find the K most similar vectors in a static corpus. But real-time perception isn’t a static corpus. Sensor data streams in at 30-1000 Hz. Older detections become irrelevant. New ones must be queryable immediately.

I ran two experiments on an A100 SXM4 to quantify this:

Experiment 1 – Temporal relevance. 9,000 memories across 200 tracked objects in an AV perception scenario. FAISS Flat returned results where 49.3% of top-10 entries were stale – older than the current tracking window. Its Temporal Precision@10 was 0.218. MARS, with temporal decay fused directly into the retrieval kernel, scored 0.910.

Experiment 2 – Streaming insertion. 60 Hz frame rate, 10 new detections per frame. FAISS (rebuilding every 1 second) missed 93.2% of recent detections because pending inserts remain invisible until index rebuild. MARS sees 100% immediately.

These aren’t edge cases. They’re the default behavior of every GPU search library when you put it in a sensor-rate loop.

Four CUDA kernels, sub-millisecond

MARS stores text, audio, image, and sensor embeddings in a shared 768-D space as nodes in a Neural Shortcut Network (NSN) with cross-modal bridges. The retrieval pipeline runs entirely on GPU-resident data:

cuBLAS SGEMV – cosine similarity via matrix-vector multiply
Temporal rerank – score * exp(-lambda * age), fused per-element
CUB radix sort – parallel top-K selection
Warp-cooperative BFS – cross-modal graph expansion via the NSN

The key insight: temporal decay and importance scoring happen inside the retrieval kernels, not as a post-processing step. This eliminates the extra round-trip that makes “FAISS + post-hoc rerank” impractical at 60 Hz.

Measured results

Same-hardware comparison on A100 SXM4 80GB (D=768, K=10, single-query p99):

System	N=2.4K	N=10K	N=50K	Temporal	Streaming
FAISS GPU Flat	0.10 ms	0.12 ms	0.35 ms	No	No
FAISS GPU IVF	0.13 ms	0.15 ms	0.28 ms	No	No
cuVS CAGRA	2.60 ms	2.29 ms	2.47 ms	No	No
MARS	0.26 ms	0.34 ms	0.44 ms	Yes	Yes

MARS GPU kernel time (0.10 ms at N=2.4K) matches FAISS Flat. The wall-clock gap is kernel launch overhead from the additional temporal rerank and importance stages.

All four demonstrator p99 deadlines met:

Workload	Rate	Budget	Measured p99
AV perception	60 Hz	1 ms	0.87 ms
Humanoid robot	1 kHz	1 ms	0.76 ms
AR/VR spatial	90 Hz	5 ms	1.56 ms
Voice agent	30 Hz	20 ms	0.88 ms

The Neural Shortcut Network

What makes MARS more than “FAISS with a timestamp column” is the graph structure. Memories are nodes in a CSR-format graph built in five phases:

Ring lattice (k=6 local neighbors)
Hierarchical skip connections (powers of 2)
Hub supernodes at sqrt(N) intervals
Small-world rewiring (Watts-Strogatz, p=0.15)
Cross-modal bridges – every node gets one edge to each other modality

Phase 5 is the critical one. It means a query that starts with an audio embedding can reach relevant visual and text memories through graph traversal, without maintaining separate per-modality indices. The warp-cooperative BFS kernel explores these bridges in the same sub-millisecond budget.

What it’s not

MARS is not a vector database. It’s not competing with Pinecone or pgvector. Same conceptual layer – indexing, similarity, retrieval – but different latency envelope, different durability model, different deployment target. Think cuBLAS vs LAPACK: same operations, different hardware.

The working set is seconds to hours of recent sensor data, bounded to fit in GPU VRAM. For billion-record archives, use a vector DB. For the hot working memory of a system that needs to make decisions 60 times per second, that’s what MARS is for.

Scaling to 1M memories

The FP16 + CUDA Graph extension scales to 1M memories at 6.5 ms p99 on a $449 RTX 5060 Ti. At that point you’re covering roughly 15 minutes of multi-sensor data at AV rates – more than enough working memory for any real-time loop.

Try it

git clone https://github.com/antonellof/MARS.git
cd MARS
make tests        # host-only unit tests, no GPU needed
make && make check  # full build + hardware validation on any CUDA GPU

The code is MIT licensed. The paper has the full methodology, kernel pseudocode, and ablation studies.

I’m particularly interested in feedback from anyone building real-time perception pipelines. The hypothesis – that a general-purpose GPU-resident substrate can replace bespoke circular buffers – needs validation from people who’ve actually shipped those buffers.

Links: