Question answering from visually rich documents (VRDs) is the task of retrieving the correct answer to a natural language question by considering the content of textual and visual elements in the document, as well as the pages’ layout. To answer closed-ended questions that require a deep understanding of the hierarchical relationships between the elements, i.e., the full document-level understanding (FDU) task, state-of-the-art graph-based approaches to FDU model the pairwise element relationships in a graph model. Although they incorporate logical links (e.g., a caption refers to a figure) and spatial ones (e.g., a caption is placed below the figure), they currently disregard the semantic similarity among multimodal document elements, thus potentially yielding suboptimal scoring of the elements’ relevance to the input question. In this paper, we propose GRAS-FDU, a new graph attention network tailored to FDU. GATS-FDU is trained to jointly consider multiple document facets, i.e., the local, spatial, and semantic elements’ relationships. The results show that our approach achieves superior performance compared to several baseline methods.
A Graph Attention Network Combining Multifaceted Element Relationships for Full Document-Level Understanding / Vaiani, Lorenzo; Napolitano, Davide; Cagliero, Luca. - In: COMPUTERS. - ISSN 2073-431X. - 14:9(2025). [10.3390/computers14090362]
A Graph Attention Network Combining Multifaceted Element Relationships for Full Document-Level Understanding
Vaiani, Lorenzo;Napolitano, Davide;Cagliero, Luca
2025
Abstract
Question answering from visually rich documents (VRDs) is the task of retrieving the correct answer to a natural language question by considering the content of textual and visual elements in the document, as well as the pages’ layout. To answer closed-ended questions that require a deep understanding of the hierarchical relationships between the elements, i.e., the full document-level understanding (FDU) task, state-of-the-art graph-based approaches to FDU model the pairwise element relationships in a graph model. Although they incorporate logical links (e.g., a caption refers to a figure) and spatial ones (e.g., a caption is placed below the figure), they currently disregard the semantic similarity among multimodal document elements, thus potentially yielding suboptimal scoring of the elements’ relevance to the input question. In this paper, we propose GRAS-FDU, a new graph attention network tailored to FDU. GATS-FDU is trained to jointly consider multiple document facets, i.e., the local, spatial, and semantic elements’ relationships. The results show that our approach achieves superior performance compared to several baseline methods.File | Dimensione | Formato | |
---|---|---|---|
computers-14-00362.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Creative commons
Dimensione
421.1 kB
Formato
Adobe PDF
|
421.1 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3003704