Abstract

Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action.

We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments.

Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency.

DynaNav Overview

TIC-VLA Framework

The architecture adopts a decoupled dual-system design with a fast action expert and a slow reasoning VLM. A shared vision encoder provides real-time observations to the policy and time-lagged observations to the VLM, where the delay arises naturally from slow inference. The delayed semantic-control interface (including delayed VLM KV cache features and latency metadata) is explicitly recorded. The Transformer-based action expert takes as input the current observation, robot state, and delayed semantic-control interface data to generate actions from learnable action queries via cross-attention. Multi-stage training combines imitation learning with delayed inference and reinforcement learning to ensure robustness to real-world, time-sensitive conditions.

DynaNav Benchmark

DynaNav is a physics-accurate, photo-realistic benchmark for language-guided robot navigation in dynamic, human-centric environments. It features 85 episodes across Hospital, Office, Warehouse, and Outdoor scenes, with controlled variation in crowd density, navigation distance, and robot platform (Nova Carter or Spot). Built in Isaac Sim with realistic human motion and physics-based control, DynaNav evaluates robustness under dense human-robot interactions under physical constraints.

DynaNav Overview

Results

Simulation-DynaNav

Real-World Deployments