This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations, addressing the complexities inherent in form documents. Additionally, we introduce new inter-grained and cross-grained loss functions to further refine diverse multi-teacher knowledge distillation transfer process, presenting distribution gaps and a harmonised understanding of form documents. Through a comprehensive evaluation across publicly available form document understanding datasets, our proposed model consistently outperforms existing baselines, showcasing its efficacy in handling the intricate structures and content of visually complex form documents.
3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding / Ding, Yihao; Vaiani, Lorenzo; Han, Caren; Lee, Jean; Garza, Paolo; Poon, Josiah; Cagliero, Luca. - (2024), pp. 15233-15244. (Intervento presentato al convegno Association for Computational Linguistics 2024 tenutosi a Bangkok, Thailand and virtual meeting nel August 11-16, 2024).
3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding
Lorenzo Vaiani;Paolo Garza;Luca Cagliero
2024
Abstract
This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations, addressing the complexities inherent in form documents. Additionally, we introduce new inter-grained and cross-grained loss functions to further refine diverse multi-teacher knowledge distillation transfer process, presenting distribution gaps and a harmonised understanding of form documents. Through a comprehensive evaluation across publicly available form document understanding datasets, our proposed model consistently outperforms existing baselines, showcasing its efficacy in handling the intricate structures and content of visually complex form documents.File | Dimensione | Formato | |
---|---|---|---|
2402.17983v2.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
3.86 MB
Formato
Adobe PDF
|
3.86 MB | Adobe PDF | Visualizza/Apri |
2024.findings-acl.903.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Creative commons
Dimensione
3.93 MB
Formato
Adobe PDF
|
3.93 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2990379