VLA Models: Reengineering the Autonomous Mobile Robot Pipeline: Part 3

Picking up from Part 2, we now focus on the final elements to round out the series:

VII. Challenges and Open Problems

A clear-eyed assessment of where VLAs stand relative to the requirements of production AMR deployment reveals a set of significant open challenges. Progress is rapid, but the gap between research demonstrations and certified industrial deployment remains substantial.

Technical Challenges

Inference latency: Large VLA models (7B–70B parameters) require 100–500ms per inference pass on current hardware. AMR control loops — particularly those involving obstacle avoidance in dynamic human-shared environments — require sub-50ms response times. Mitigation strategies include model distillation to 1–3B parameter deployable models, INT8/INT4 quantisation, speculative decoding, and the use of dedicated accelerators (NVIDIA Jetson Orin, Qualcomm Robotics RB6). The hybrid architecture described above also mitigates this by reserving the VLA for higher-latency semantic decisions while the classical controller runs at full rate.

3D spatial reasoning: Current VLAs are predominantly trained on 2D camera images and exhibit limited capability for precise 3D spatial reasoning — a core requirement for AMRs navigating tight corridors, docking with precision, or coordinating manipulation with base motion. Integrating depth cameras, LiDAR point clouds, and structured-light data natively into VLA architectures is an active research area, with tokenised 3D representations and BEV (bird’s-eye view) encoders under investigation.

Continuous long-horizon planning: VLAs excel at short-horizon reactive behaviour but struggle with tasks requiring consistent, coherent plans over minutes or hours — such as a multi-stop delivery mission in a large facility. Approaches under investigation include hierarchical VLA architectures (a slow high-level planner feeding a fast low-level policy), memory-augmented transformers, and retrieval-augmented generation for long-context task state.

Hallucination and overconfidence: VLAs inherit the hallucination failure modes of their LLM backbones. A robot that confidently navigates toward a path that does not exist, or misidentifies an object with high confidence, represents a safety risk in AMR deployments. Conformal prediction, ensemble uncertainty estimation, and retrieval-augmented grounding are promising mitigations, but none are yet mature for safety-critical deployment.

Safety certification and regulatory compliance: Industrial AMR deployments are subject to ISO 3691-4 (industrial trucks), IEC 62061 (functional safety), and increasingly to emerging standards for AI-enabled safety systems. Current VLA architectures are difficult to certify under these frameworks because they do not expose the kind of deterministic, inspectable behaviour that traditional safety analysis requires. Runtime safety monitors — classical systems that supervise and can override VLA outputs — are the most promising near-term approach to bridging this gap.

Domain adaptation data efficiency: While VLAs reduce the per-task data requirement compared to training from scratch, fine-tuning a VLA on a specific facility, robot model, and task distribution still requires hundreds to thousands of demonstrations. Collecting this data is expensive, particularly in active facilities. Simulation-to-real transfer, data augmentation via generative world models, and active learning approaches to demonstration collection are all active research areas.

Multi-robot coordination: Most VLA work to date focuses on single-robot policies. AMR fleets — common in warehousing, logistics, and hospital environments — require coordinated behaviour across tens or hundreds of robots. How VLA policies compose in multi-agent settings, and how a centralised VLA-based dispatcher communicates with individual VLA-based robot policies, are largely open questions.

Readiness Assessment by Challenge

VIII. The Road Ahead — What to Watch

The VLA trajectory for autonomous mobile robots over the next 18–36 months is shaped by several converging research and commercial developments. The following represent the most consequential near-term transitions.

  1. Native 3D and multimodal sensor VLAs: The next generation of mobile robot VLAs will natively ingest tokenised LiDAR point clouds, depth maps, and event camera streams alongside RGB video — providing the geometric precision that current camera-centric VLAs lack. Work on 3D scene tokenisation (EmbodiedScan, OpenScene) is laying the foundation for this transition.
  2. Generative world models as inner simulators for planning: Models like UniSim and RoboDreamer demonstrate that a robot can use a generative world model to simulate the consequences of candidate actions before executing them — a capability that directly addresses the long-horizon planning weakness. VLAs with an integrated world model can evaluate ‘what would happen if I turned left here’ without physically executing the action.
  3. Continuous improvement from fleet data: AMR fleets generate enormous quantities of operational data. Reinforcement learning from operator feedback (RLHF), inverse RL from expert teleoperation, and online fine-tuning from deployment logs can continuously improve deployed VLA policies — creating an improvement flywheel unavailable to classical modular stacks. The challenge is doing this safely, without introducing regression in certified behaviours.
  4. Standardised VLA fine-tuning pipelines for industrial deployment: Frameworks including LeRobot (Hugging Face), NVIDIA Isaac GR00T, and emerging offerings from AMR platform vendors are building the tooling necessary to fine-tune general VLA models on facility-specific data with minimal ML expertise. These will be the delivery mechanism for VLA capabilities in commercial AMR deployments.
  5. Safety standards evolution: Standards bodies including ISO TC 299 (Robotics) and IEC TC 44 (Safety of Machinery) are actively working on frameworks for AI-enabled robot safety. The emergence of runtime monitoring standards — which treat the learned policy as a subsystem supervised by a certified safety layer — is likely to be the pathway that enables VLA-driven AMRs to reach ISO 3691-4 compliance without requiring full formal verification of the neural network.
  6. Human-robot collaboration at language level: As VLAs mature, the interaction model between operators and AMRs will shift from programming to conversation. Operators will describe tasks, exceptions, and preferences in natural language; robots will report progress, flag ambiguities, and ask for clarification through the same channel. This changes the skill profile required for AMR deployment and maintenance, with implications for workforce training and operator interface design.

IX. Conclusion

The classical AMR stack — SLAM, perception, planning, behaviour, control — is a remarkable engineering achievement that has enabled the deployment of hundreds of thousands of robots in structured environments around the world. It is also, increasingly, a ceiling. The environments that robot operators want to enter next — hospitals, construction sites, retail floors, outdoor campuses — are too dynamic, too semantically rich, and too variable for the modular, rule-based approach to scale without heroic and unsustainable engineering effort.

VLA models do not merely add a capability to the existing stack. They represent a different theory of how robot intelligence should be built: not engineered top-down from first principles, but trained bottom-up from large and diverse data, inheriting the generalisation capabilities that make large pre-trained models so powerful in language and vision. The evidence from systems like RT-2, pi-zero, and Mobile ALOHA suggests that this approach can produce behaviours that the classical stack simply cannot reach — novel instruction following, semantic scene reasoning, and cross-embodiment transfer — without sacrificing the action quality needed for real deployment.

The path from current research systems to production AMR deployments is not short. Latency, spatial reasoning, long-horizon planning, safety certification, and multi-robot coordination are all open problems requiring significant further work. The hybrid architecture — VLA for semantic reasoning, classical controller for real-time execution — is the most practical near-term bridge. But the direction of travel is clear.

For AMR engineers, the relevant question is no longer whether VLAs will reshape the stack, but how quickly, and how to position current development work to benefit from that transition rather than be disrupted by it. The most durable investment is in the data infrastructure, evaluation frameworks, and deployment tooling that will remain valuable regardless of which specific model architecture prevails — because the shift from engineered pipelines to trained systems is already underway.

Author Details

rani malhotra

Rani Malhotra heads the Applied Research Center for Autonomous Machines at the Infosys Center for Emerging Technology Solutions. She works across emerging technologies and their intersections, including Robotics, Physical AI, Human Machine Interactions and Smart Systems.

Leave a Comment

Your email address will not be published. Required fields are marked *