Large Language Models are increasingly used to generate web interfaces from natural language specifications. Automated accessibility tools evaluate these interfaces by detecting syntactic violations such as missing attributes, but cannot assess whether accessibility content is actually meaningful—an image with alt="image" passes every check yet conveys nothing to screen reader users. We investigate the prevalence of such semantic accessibility violations in LLM-generated interfaces. Analyzing 300 UIs produced by three commercial models, we identify 541 semantic violations across six fault types. We validate an LLM-as-judge approach through controlled fault injection, achieving recall rates of 80–92%, and triangulate with a preliminary human annotation study. Our findings suggest that LLM judges can extend accessibility evaluation into the semantic dimension that automated tools miss, opening opportunities for their integration as reward signals in accessibility-aware development workflows.

Measuring the Semantic Accessibility Gap in LLM-Generated Web UIs / Calò, Tommaso; Alexandra-Elena, Gurita; De Russis, Luigi. - ELETTRONICO. - (2026), pp. 1-5. ( CHI '26: CHI Conference on Human Factors in Computing Systems Barcelona (ESP) 13–17 April 2026) [10.1145/3772363.3799364].

Measuring the Semantic Accessibility Gap in LLM-Generated Web UIs

Calò, Tommaso;De Russis, Luigi
2026

Abstract

Large Language Models are increasingly used to generate web interfaces from natural language specifications. Automated accessibility tools evaluate these interfaces by detecting syntactic violations such as missing attributes, but cannot assess whether accessibility content is actually meaningful—an image with alt="image" passes every check yet conveys nothing to screen reader users. We investigate the prevalence of such semantic accessibility violations in LLM-generated interfaces. Analyzing 300 UIs produced by three commercial models, we identify 541 semantic violations across six fault types. We validate an LLM-as-judge approach through controlled fault injection, achieving recall rates of 80–92%, and triangulate with a preliminary human annotation study. Our findings suggest that LLM judges can extend accessibility evaluation into the semantic dimension that automated tools miss, opening opportunities for their integration as reward signals in accessibility-aware development workflows.
2026
979-8-4007-2281-3
File in questo prodotto:
File Dimensione Formato  
3772363.3799364.pdf

accesso aperto

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Creative commons
Dimensione 5.34 MB
Formato Adobe PDF
5.34 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3008330