MemCompiler: Compile, Don't Inject — State-Conditioned Memory for Embodied Agents

Xin Ding¹^*, Xinrui Wang²^*, Yifan Yang³, Hao Wu⁴, Shiqi Jiang³, Qianxi Zhang³, Liang Mi⁴, Hanxin Zhu¹, Kun Li⁵, Yunxin Liu⁵, Zhibo Chen¹^†, Ting Cao⁵^†

¹University of Science and Technology of China ²Huazhong University of Science and Technology
³Microsoft Research ⁴Nanjing University ⁵Institute for AI Industry Research (AIR), Tsinghua University

^* Equal contribution ^† Corresponding author

Paper Code 🤗 Model

Abstract

Existing memory systems for embodied agents typically inject retrieved memory as static context at episode start, a paradigm we term Ahead-of-time Monolithic Memory Injection (AMMI). However, this static design quickly becomes misaligned with the agent's evolving state and may degrade lightweight executors below the no-memory baseline. To address this, we propose MemCompiler, which reframes memory utilization as State-Conditioned Memory Compilation. A learned Memory Compiler reads a structured Brief State capturing the agent's current execution state and dynamically selects and compiles only relevant memory into executable guidance. This guidance is delivered through a text channel and a latent Soft-Mem channel that preserves perceptual information not expressible in text. Across AlfWorld, EmbodiedBench, and ScienceWorld, MemCompiler consistently improves over no-memory across open-source backbones (up to +129%), matches or approaches frontier closed-source systems, and reduces per-step latency by ~60%, demonstrating that state-aware memory compilation improves both effectiveness and efficiency.

Method

Figure 1: Two paradigms for memory utilization in embodied agents. (a) AMMI injects the full task memory M at episode start directly, the Executor must attend over all of M, burying valuable experience under irrelevant entries. (b) SCMC (Ours) retains M as a source library and compiles only state-relevant content m^*,t at each step, ensuring the Executor receives precisely what it needs. The colored bars denote Qwen, GPT-5.2, Gemini-3-Flash, and our method, respectively, with performance measured across benchmarks.

Figure 2: Overview of MemCompiler at step t. The Memory Compiler reads runtime state s_t = (o_t, b_t) and task memory M (retrieved once at episode start), then delivers the compiled result m^*,t to the Executor through two parallel channels: a text channel (m^*,t_text) and a latent soft channel (m^*,t_soft), which are fused at the Executor's embedding level. Brief State b_t is dynamically maintained via structured operations and fed back as part of the next step's state. Note, at each step the Memory Compiler decides among four output types (Section 3.2): EXPERIENCE (m^*,t only), BRIEF (Δb_t only), HYBRID (both jointly), or NOACTION (neither, when no entry in M is currently applicable).

Experiments

EmbodiedBench Results

ALFWorld & ScienceWorld Results

Attention Analysis

Mean pre-softmax attention logit (higher values mean more attention weight) of the Executor over memory tokens across AlfWorld episodes (steps 0–29). AMMI (blue) exhibits monotonic decay, reflecting increasing misalignment between static memory and the agent's evolving state. In contrast, SCMC (orange) maintains stable high attention throughout, indicating consistent alignment between memory and the Executor's current needs.

Citation

@misc{ding2026memcompiler, title={MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents}, author={Xin Ding and Xinrui Wang and Yifan Yang and Hao Wu and Shiqi Jiang and Qianxi Zhang and Liang Mi and Hanxin Zhu and Kun Li and Yunxin Liu and Zhibo Chen and Ting Cao}, year={2026}, eprint={2605.07594}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2605.07594}, }