Creating believable Virtual Humans (VH) requires the coherent integration of perception, reasoning, and action mediated by language. A central challenge is to combine these components into a control loop grounded in interactive 3D environments. To this end, we present A.D.A.M.O. (Agent for language-Driven Actions with Multimodal Observations), a visual-symbolic framework for language-driven Virtual Humans that leverages a pretrained Vision-Language Model (VLM) with tool calling to unify perception, reasoning, and action within a single control loop. A.D.A.M.O. maintains a dual visual-symbolic world model that combines egocentric visual input with a synchronized symbolic state to support grounded, task-oriented behavior from natural language prompts. To support diagnostic evaluation, we introduce a controlled task suite organized according to a Capability-Difficulty (C-D) taxonomy, which decomposes spatial tasks into procedural and linguistic complexity. Experiments in controlled scenes show that semantic labeling strongly influences task completion and failure modes by reducing perceptual ambiguity while shifting failures toward downstream execution, whereas reasoning errors remain comparatively rare.
A.D.A.M.O. (Agent for language-Driven Actions with Multimodal Observations): A Visual-Symbolic Framework for Virtual Humans / Pecora, Alessandro Emmanuel; Calzolari, Stefano; Strada, Francesco; Bottino, Andrea. - ELETTRONICO. - (In corso di stampa). ( International Conference on Computer Animation, Social Agents, and Extended Reality (CASAXR26) Geneva (Switzerland) 1-3 June 2026).
A.D.A.M.O. (Agent for language-Driven Actions with Multimodal Observations): A Visual-Symbolic Framework for Virtual Humans
Alessandro Emmanuel, Pecora;Stefano, Calzolari;Francesco, Strada;Andrea, Bottino
In corso di stampa
Abstract
Creating believable Virtual Humans (VH) requires the coherent integration of perception, reasoning, and action mediated by language. A central challenge is to combine these components into a control loop grounded in interactive 3D environments. To this end, we present A.D.A.M.O. (Agent for language-Driven Actions with Multimodal Observations), a visual-symbolic framework for language-driven Virtual Humans that leverages a pretrained Vision-Language Model (VLM) with tool calling to unify perception, reasoning, and action within a single control loop. A.D.A.M.O. maintains a dual visual-symbolic world model that combines egocentric visual input with a synchronized symbolic state to support grounded, task-oriented behavior from natural language prompts. To support diagnostic evaluation, we introduce a controlled task suite organized according to a Capability-Difficulty (C-D) taxonomy, which decomposes spatial tasks into procedural and linguistic complexity. Experiments in controlled scenes show that semantic labeling strongly influences task completion and failure modes by reducing perceptual ambiguity while shifting failures toward downstream execution, whereas reasoning errors remain comparatively rare.| File | Dimensione | Formato | |
|---|---|---|---|
|
CASA_XR_2026_ADAMO.pdf
accesso riservato
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.69 MB
Formato
Adobe PDF
|
1.69 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3010332
