Creating believable Virtual Humans (VH) requires the coherent integration of perception, reasoning, and action mediated by language. A central challenge is to combine these components into a control loop grounded in interactive 3D environments. To this end, we present A.D.A.M.O. (Agent for language-Driven Actions with Multimodal Observations), a visual-symbolic framework for language-driven Virtual Humans that leverages a pretrained Vision-Language Model (VLM) with tool calling to unify perception, reasoning, and action within a single control loop. A.D.A.M.O. maintains a dual visual-symbolic world model that combines egocentric visual input with a synchronized symbolic state to support grounded, task-oriented behavior from natural language prompts. To support diagnostic evaluation, we introduce a controlled task suite organized according to a Capability-Difficulty (C-D) taxonomy, which decomposes spatial tasks into procedural and linguistic complexity. Experiments in controlled scenes show that semantic labeling strongly influences task completion and failure modes by reducing perceptual ambiguity while shifting failures toward downstream execution, whereas reasoning errors remain comparatively rare.

A.D.A.M.O. (Agent for language-Driven Actions with Multimodal Observations): A Visual-Symbolic Framework for Virtual Humans / Pecora, Alessandro Emmanuel; Calzolari, Stefano; Strada, Francesco; Bottino, Andrea. - ELETTRONICO. - (In corso di stampa). ( International Conference on Computer Animation, Social Agents, and Extended Reality (CASAXR26) Geneva (Switzerland) 1-3 June 2026).

A.D.A.M.O. (Agent for language-Driven Actions with Multimodal Observations): A Visual-Symbolic Framework for Virtual Humans

Alessandro Emmanuel, Pecora;Stefano, Calzolari;Francesco, Strada;Andrea, Bottino
In corso di stampa

Abstract

Creating believable Virtual Humans (VH) requires the coherent integration of perception, reasoning, and action mediated by language. A central challenge is to combine these components into a control loop grounded in interactive 3D environments. To this end, we present A.D.A.M.O. (Agent for language-Driven Actions with Multimodal Observations), a visual-symbolic framework for language-driven Virtual Humans that leverages a pretrained Vision-Language Model (VLM) with tool calling to unify perception, reasoning, and action within a single control loop. A.D.A.M.O. maintains a dual visual-symbolic world model that combines egocentric visual input with a synchronized symbolic state to support grounded, task-oriented behavior from natural language prompts. To support diagnostic evaluation, we introduce a controlled task suite organized according to a Capability-Difficulty (C-D) taxonomy, which decomposes spatial tasks into procedural and linguistic complexity. Experiments in controlled scenes show that semantic labeling strongly influences task completion and failure modes by reducing perceptual ambiguity while shifting failures toward downstream execution, whereas reasoning errors remain comparatively rare.
In corso di stampa
File in questo prodotto:
File Dimensione Formato  
CASA_XR_2026_ADAMO.pdf

accesso riservato

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.69 MB
Formato Adobe PDF
1.69 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3010332