Idiomatic expressions present significant challenges for natural language understanding systems as their meaning often diverge from the literal interpretation. While prior works have focused on textual idiom detection, the role of visual content in reasoning about idiomaticity remains underexplored. This study introduces a Chain-of-Thought reasoning framework that enhances idiomatic comprehension by ranking images based on their relevance to a compound expression in context, requiring the system to distinguish between idiomatic and literal meanings.We comprehensively evaluate our approach by quantitatively analyzing the performance improvements achieved integrating textual and visual information in the ranking process through different prompting settings. Our empirical findings provide insights into the capabilities of visual Large Language Models to establish meaningful correlations between idiomatic content and its visual counterpart, suggesting promising directions for multimodal language understanding.
PoliTo at SemEval-2025 Task 1: Beyond Literal Meaning: A Chain-of-Though Approach for Multimodal Idiomacity Understanding / Napolitano, Davide; Vaiani, Lorenzo; Cagliero, Luca. - (2025), pp. 2071-2076. (Intervento presentato al convegno 19th International Workshop on Semantic Evaluation tenutosi a Vienna (AT) nel July 27-August 1, 2025).
PoliTo at SemEval-2025 Task 1: Beyond Literal Meaning: A Chain-of-Though Approach for Multimodal Idiomacity Understanding
Napolitano, Davide;Vaiani, Lorenzo;Cagliero, Luca
2025
Abstract
Idiomatic expressions present significant challenges for natural language understanding systems as their meaning often diverge from the literal interpretation. While prior works have focused on textual idiom detection, the role of visual content in reasoning about idiomaticity remains underexplored. This study introduces a Chain-of-Thought reasoning framework that enhances idiomatic comprehension by ranking images based on their relevance to a compound expression in context, requiring the system to distinguish between idiomatic and literal meanings.We comprehensively evaluate our approach by quantitatively analyzing the performance improvements achieved integrating textual and visual information in the ranking process through different prompting settings. Our empirical findings provide insights into the capabilities of visual Large Language Models to establish meaningful correlations between idiomatic content and its visual counterpart, suggesting promising directions for multimodal language understanding.File | Dimensione | Formato | |
---|---|---|---|
2025.semeval-1.269.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Creative commons
Dimensione
170.31 kB
Formato
Adobe PDF
|
170.31 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3003705