Automated black-box test case generation remains challenging due to diverse implementation behaviours and complex boundary conditions. This work investigates the effectiveness of multi-agent Large Language Model (LLM) architectures employing collaborative and competitive interaction patterns, compared against a single-agent baseline, within the context of function-level black-box testing. Experiments involving the HumanEval benchmark show that prompt engineering is the primary driver of performance, with rule-augmented few-shot prompting yielding improvements up to 20-30% in both coverage and Execution Success Rate (ESR) against all the other tested strategies — i.e., zero-shot and conventional few-shot prompting. Although multi-agent architectures achieve superior test coverage — peaking at 99.54% — they yield an ESR comparable to single-agent frameworks (96.88% versus 96.89%). Crucially, this comes at the expense of a threefold to fourfold increase in token expenditure. In conclusion, the single-agent configuration paired with optimised rule-augmented few-shot prompting provides the most effective balance between accuracy, coverage, and computational efficiency.
Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering / Arnaudo, Anna; Coppola, Riccardo; Morisio, Maurizio; Giobergia, Flavio; Nguyen, Van-Thanh; Chen, Enrico; Ma, Xiaoning; Ji, Xiaoquan; Mai, Minh-Thai. - ELETTRONICO. - (In corso di stampa). ( 19th IEEE International Conference on Software Testing, Verification and Validation (ICST) 2026 Daejeon (KOR) 18-22nd May 2026).
Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering
Anna Arnaudo;Riccardo Coppola;Maurizio Morisio;Flavio Giobergia;Van-Thanh Nguyen;Enrico Chen;
In corso di stampa
Abstract
Automated black-box test case generation remains challenging due to diverse implementation behaviours and complex boundary conditions. This work investigates the effectiveness of multi-agent Large Language Model (LLM) architectures employing collaborative and competitive interaction patterns, compared against a single-agent baseline, within the context of function-level black-box testing. Experiments involving the HumanEval benchmark show that prompt engineering is the primary driver of performance, with rule-augmented few-shot prompting yielding improvements up to 20-30% in both coverage and Execution Success Rate (ESR) against all the other tested strategies — i.e., zero-shot and conventional few-shot prompting. Although multi-agent architectures achieve superior test coverage — peaking at 99.54% — they yield an ESR comparable to single-agent frameworks (96.88% versus 96.89%). Crucially, this comes at the expense of a threefold to fourfold increase in token expenditure. In conclusion, the single-agent configuration paired with optimised rule-augmented few-shot prompting provides the most effective balance between accuracy, coverage, and computational efficiency.| File | Dimensione | Formato | |
|---|---|---|---|
|
ASTA_2026____Automated_Black_Box_Testing__A_Comparative_Study_of_LLM_Agent_Architectures_and_Prompt_Engineering (5).pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
445.63 kB
Formato
Adobe PDF
|
445.63 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3010756
