Our brain-inspired planning and manipulation framework models neural cognition structure as a multimodal multiagent system, structured to emulate neural specialization from the human brain, as illustrated in the figure. For complex and long-horizon tasks, the system dynamically activates distinct agents for processing language, vision, and episodic memory inputs. These agents function analogously to sensory cortices, extracting salient features and contextual information. The fused multimodal representations are routed to a central planning module inspired by the prefrontal cortex, where high-level decision-making and task decomposition occur. This output is further refined through a correction mechanism modeled on the inferior olivary nucleus, introducing predictive error feedback for robust adaptation. In contrast, simple or reactive tasks bypass high-level planning and instead utilize a streamlined single-agent pathway akin to spinal or reflex arcs in biological systems. These rely on a 'past work memory' module that recalls previously successful execution patterns, slightly modified to fit current contexts, ensuring both rapid response and task relevance. Task routing between these two pathways is governed by a context-aware task classifier that estimates cognitive load and task complexity. This architecture supports real-time adaptability and efficiency, integrating deliberative and reflexive behaviors. The structure graph accompanying this section outlines these modules and their interactions, reflecting both the functional and anatomical parallels to the human brain. The system's implementation integrates three interacting components: (1) a multi-agent neural structure for high-level planning, (2) an asynchronous pipeline for hierarchical task management, and (3) a reactive VLA system for real-time control execution.
An effective & useful simulation-based evaluation should demonstrate good correlations in policy ranking & performance with real evaluations.
To measure such correlations, one can apply the traditional Pearson correlation metric ("r"), but it has the following limitations: (1) Pearson correlation only assess the linear fit between real-and-sim performances, while for simulated evaluation we don't necessarily need linear correlations, as long as sim eval reflects real-world performance improvements between different policies (middle-right); (2) Pearson correlation does not reflect the range of values it is computed over. For policy sets that perform closely in real (far-right), Pearson r may change drastically based on small real-world performance differences, which can often be attributed to the inherent noise in real-world evaluations.
Thus, we introduce the Mean Maximum Rank Violation (MMRV) metric (lower the better) to better assess the real-and-sim policy ranking consistency. The key underlying quantity is the rank violation between two policies, which weighs the significance of the simulator incorrectly ranking the policies by the corresponding margin in real-world performance. MMRV then aggregates the N^2 rank violations by averaging the worst-case rank violation for each policy.
Visual discrepancies between real-world and simulated environments can comprise a distribution shift that adversely affects a learned policy’s behavior, rendering simulated evaluation unreliable. Our goal is to match the simulator visuals to those of the real-world environment with only a modest amount of manual effort. Our proposed Visual Matching consists of (1) green screening, i.e. segmenting out interactive simulated assets and overlaying them onto real-world backgrounds; and (2) texture matching, which involves projecting real object textures onto simulation assets and tuning robot arm colors using real videos.
The goal of mitigating the control gap between simulated and real-world environments is to ensure that policy actions executed in simulation yields comparable effects on the robot’s end-effector as those observed when executed on the real robot. We perform system identification (SysID) for closing the control gap between real and simulated environments on a small sample of trajectories from the real world dataset.
Real World Rollout
Control without SysID
Control with SysID