Automated black-box test case generation remains challenging due to diverse implementation behaviours and complex boundary conditions. This work investigates the effectiveness of multi-agent Large Language Model (LLM) architectures employing collaborative and competitive interaction patterns, compared against a single-agent baseline, within the context of function-level black-box testing. Experiments involving the HumanEval benchmark show that prompt engineering is the primary driver of performance, with rule-augmented few-shot prompting yielding improvements up to 20-30% in both coverage and Execution Success Rate (ESR) against all the other tested strategies — i.e., zero-shot and conventional few-shot prompting. Although multi-agent architectures achieve superior test coverage — peaking at 99.54% — they yield an ESR comparable to single-agent frameworks (96.88% versus 96.89%). Crucially, this comes at the expense of a threefold to fourfold increase in token expenditure. In conclusion, the single-agent configuration paired with optimised rule-augmented few-shot prompting provides the most effective balance between accuracy, coverage, and computational efficiency.
Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering / Arnaudo, Anna; Coppola, Riccardo; Morisio, Maurizio; Giobergia, Flavio; Nguyen, Van-Thanh; Chen, Enrico; Ma, Xiaoning; Ji, Xiaoquan; Mai, Minh-Thai. - ELETTRONICO. - (2026). ( 19th IEEE International Conference on Software Testing, Verification and Validation (ICST) 2026 Daejeon, Republic of Korea 18-22nd May 2026).
Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering
Anna Arnaudo;Riccardo Coppola;Maurizio Morisio;Flavio Giobergia;Van-Thanh Nguyen;Enrico Chen;
2026
Abstract
Automated black-box test case generation remains challenging due to diverse implementation behaviours and complex boundary conditions. This work investigates the effectiveness of multi-agent Large Language Model (LLM) architectures employing collaborative and competitive interaction patterns, compared against a single-agent baseline, within the context of function-level black-box testing. Experiments involving the HumanEval benchmark show that prompt engineering is the primary driver of performance, with rule-augmented few-shot prompting yielding improvements up to 20-30% in both coverage and Execution Success Rate (ESR) against all the other tested strategies — i.e., zero-shot and conventional few-shot prompting. Although multi-agent architectures achieve superior test coverage — peaking at 99.54% — they yield an ESR comparable to single-agent frameworks (96.88% versus 96.89%). Crucially, this comes at the expense of a threefold to fourfold increase in token expenditure. In conclusion, the single-agent configuration paired with optimised rule-augmented few-shot prompting provides the most effective balance between accuracy, coverage, and computational efficiency.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3010756
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo
