Vision-Sound-Language-Action

HEAR.

A sound-centric manipulation framework that extends VLA toward robots that see, hear, remember, and react.

From VLA to VSLA

Robots need to listen during action, not only before action.

Standard VLA policies work well when the useful evidence stays visible: a cup remains on the table, a drawer stays open, or a target object remains in view. Sound-centric tasks are different. The decisive cue may be a short beep, collision click, process sound, or spoken confirmation that happens between two policy queries.

HEAR extends the control setting to Vision-Sound-Language-Action. The robot receives delayed multi-view vision, causal audio, language, and proprioception, then must preserve fleeting evidence across open-loop action chunks before deciding when to move.

HEAR sound-centric manipulation overview

Problem setting

Sound-causal manipulation

The policy must remember acoustic cues and avoid visually plausible but acoustically premature actions.

Long-horizon process

Moka coffee

A process-monitoring task where completion is more clearly indicated by sound than by a static camera view.

Framework

Memory, multimodal reasoning, temporal grounding, and smooth action.

HEAR full framework overview

System overview

HEAR architecture

HEAR processes streaming sound, reasons over multimodal context, predicts near-future audio, and generates smooth action chunks.

HEAR Historizer module

Historizer

Causal audio memory

A stateful transformer compresses recent audio packets so short cues can survive execution gaps.

HEAR Envisioner module

Envisioner

Multimodal reasoning

High-level and low-level reasoning fuse vision, language, proprioception, current sound, and audio memory.

HEAR Advancer module

Advancer and Realizer

Temporal prediction to action chunks

Future-audio prediction grounds long waiting phases, while conditional flow matching produces smooth robot actions.

VSLA

Delayed decisions

Each decision is conditioned on vision, causal audio, language, and robot state rather than a static audio snapshot.

OpenX-Sound

Audio-augmented pretraining

The project releases a synchronized sound extension of Open X-Embodiment for large-scale robot trajectory learning.

HEAR-Bench

Sound-causal evaluation

The benchmark penalizes premature visual actions and tests whether the policy truly waits for the required acoustic cue.

Training and Evaluation

Sound-centric tasks cover transient events, speech, process sounds, and contact feedback.

The simulation benchmark includes alarm, microwave, speech timing, interruption, material checking, water pouring, and boiling tasks. These tasks stress waiting, transient triggers, prosody, impact acoustics, and long-horizon process monitoring.

Real-robot deployment adds room reverberation, background noise, mechanical ego-noise, and visual aliasing. The tasks include coffee monitoring, answering a phone, shaking bottles for active acoustic sensing, and real alarm-clock reaction.

Simulation

81% average success

The source project reports strong gains on sound-causal simulation tasks.

Real robot

54% average success

The real-robot suite emphasizes acoustic domain shift and physically grounded sound cues.

Dataset

OpenX-Sound

Audio-augmented robot trajectories released on Hugging Face.

Experiment Videos

Selected sound-centric demonstrations copied into the local page.

Process sound

Moka Coffee

The robot must monitor the acoustic transition before pouring.

Transient trigger

Microwave

A waiting task where the robot should react only after the relevant sound occurs.

Speech timing

Check Yes

The policy conditions progress on spoken confirmation rather than a static visual state.

Interruption

Interrupt

The robot must preserve and react to a brief acoustic event during execution.

Full video set

External-view and onboard recordings

The original page includes additional large real-robot videos and robot-perspective input recordings.

Robustness

Added disturbances

Ablations test traffic noise, white noise, robot ego-noise, dialogue noise, microphone shifts, object shifts, distractors, and speech variants.

Design principle

Reject premature action

The benchmark rewards policies that stay sound-causal rather than acting on visually tempting shortcuts.