Mobile delivery robots need to maneuver through unstructured environments and navigate around wavering bikes, parked vehicles, or other roadside hazards. This requires leveraging AI models that can absorb large amounts of information. Transformers have made it possible to do this in the context of Large Language Models (LLMs). We have found that the same is true for building capable neural robot navigation models. Analogously to how LLMs operate on word tokens, our neural navigation model operates on image patches. The model is trained to solve multiple pretext tasks, including learning a path planning policy in simulation. The resulting end-to-end trained mobility agent is not only capable of navigating in multiple challenging scenarios in simulation but also transfers surprisingly well to the real world and is capable of running in real time onboard our robot, powered by NVIDIA Accelerated Computing.

Interpretability Through Attention

A key challenge with end-to-end models is interpreting their behavior and understanding why a model took a particular action. We have found that for a neural navigation model built using transformer blocks, the transformer’s attention mechanism provides a window into the model’s “thinking” and can be used to interpret the actions taken by it. For instance, we trained a neural navigation model consisting of two parts:

  • A vision transformer, which learns how to see; and
  • A policy transformer, which learns how to move.

The vision transformer takes camera images divided into square patches as inputs and extracts a visual representation using self-attention. This learned visual representation is then queried via cross-attention in our policy transformer. During inference, the attention weights in the policy transformer can be used to interpret where the model was paying attention at each time step. This is used to interpret why the agent undertook a certain behavior. For instance, in the image below, we can see that the agent’s planned path is most strongly influenced by the parked car and the lane boundary. 

Visualizing attention maps from Vayu Drive running in simulation for our delivery robot. Left: An image from the robot’s front camera. Right: The policy transformer’s attention map overlaid on the image. The color within each patch indicates the level of attention it received (dark blue = 0 and red=1). The lane boundary on the left and the parked car on the right are strongly attended to.
The agent is attempting to circumvent a trash can placed in the bike lane. While doing so, it attends to the trash can and the lane boundaries.
As the agent goes around the trash can, it finds an orange traffic cone ahead. In this situation, the agent attends to the cone, along with the lane boundary and the trash can to find its way.
Attention map of the policy transformer overlaid on the front camera image of our robot driving in simulation.

Sim2Real Transfer

Another key result is that the navigation policy also transfers to the real world. The image below shows that the model attends to the left lane boundary, parked cars and an oncoming vehicle when navigating through a bike lane.

Attention map from Vayu Drive as our delivery robot navigates in the real world. The lane boundary and cars in the scene are strongly attended to.
In another situation, the model can be seen attending to the bike lane boundary along with parked vehicles and a large construction vehicle.
The model can also avoid vehicles parked in the bike lane. For example, in the image below it is planning to go to the left of the parked truck.
After the maneuver is completed, the robot returns to the bike lane.
Real-world navigation showing the attention map of the policy transformer overlaid on the front camera image of our robot.

The ability to effect sim2real transfer bodes well for the scalability of our approach because we can create virtually unlimited amounts of relevant and adversarial training data using very open source mobility simulators, and not have to rely on extensive real-world data collection and labeling infrastructure.


Finally, we have found that transformer-based driving agents can run quite efficiently on current state-of-the art embedded SOCs. Our real-time inference of the neural path planning model relies on an NVIDIA Jetson AGX Orin, which runs on board our delivery robot. The AGX Orin is capable of up to 275 TOPS in INT8, which provides sufficient compute power and low latencies for such advanced mobility agents.

We are hiring!

We are hiring ML and Simulation Engineers to scale up the training of our driving agent. If you are interested in working with us, please reach out at