Automated black-box test case generation remains challenging due to diverse implementation behaviours and complex boundary conditions. This work investigates the effectiveness of multi-agent Large Language Model (LLM) architectures employing collaborative and competitive interaction patterns, compared against a single-agent baseline, within the context of function-level black-box testing. Experiments involving the HumanEval benchmark show that prompt engineering is the primary driver of performance, with rule-augmented few-shot prompting yielding improvements up to 20-30% in both coverage and Execution Success Rate (ESR) against all the other tested strategies — i.e., zero-shot and conventional few-shot prompting. Although multi-agent architectures achieve superior test coverage — peaking at 99.54% — they yield an ESR comparable to single-agent frameworks (96.88% versus 96.89%). Crucially, this comes at the expense of a threefold to fourfold increase in token expenditure. In conclusion, the single-agent configuration paired with optimised rule-augmented few-shot prompting provides the most effective balance between accuracy, coverage, and computational efficiency.

Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering / Arnaudo, Anna; Coppola, Riccardo; Morisio, Maurizio; Giobergia, Flavio; Nguyen, Van-Thanh; Chen, Enrico; Ma, Xiaoning; Ji, Xiaoquan; Mai, Minh-Thai. - ELETTRONICO. - (In corso di stampa). ( 19th IEEE International Conference on Software Testing, Verification and Validation (ICST) 2026 Daejeon (KOR) 18-22nd May 2026).

Automated Black-Box Testing: A Comparative Study of LLM Agent Architectures and Prompt Engineering

Anna Arnaudo;Riccardo Coppola;Maurizio Morisio;Flavio Giobergia;Van-Thanh Nguyen;Enrico Chen;
In corso di stampa

Abstract

Automated black-box test case generation remains challenging due to diverse implementation behaviours and complex boundary conditions. This work investigates the effectiveness of multi-agent Large Language Model (LLM) architectures employing collaborative and competitive interaction patterns, compared against a single-agent baseline, within the context of function-level black-box testing. Experiments involving the HumanEval benchmark show that prompt engineering is the primary driver of performance, with rule-augmented few-shot prompting yielding improvements up to 20-30% in both coverage and Execution Success Rate (ESR) against all the other tested strategies — i.e., zero-shot and conventional few-shot prompting. Although multi-agent architectures achieve superior test coverage — peaking at 99.54% — they yield an ESR comparable to single-agent frameworks (96.88% versus 96.89%). Crucially, this comes at the expense of a threefold to fourfold increase in token expenditure. In conclusion, the single-agent configuration paired with optimised rule-augmented few-shot prompting provides the most effective balance between accuracy, coverage, and computational efficiency.
In corso di stampa
File in questo prodotto:
File Dimensione Formato  
ASTA_2026____Automated_Black_Box_Testing__A_Comparative_Study_of_LLM_Agent_Architectures_and_Prompt_Engineering (5).pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: Pubblico - Tutti i diritti riservati
Dimensione 445.63 kB
Formato Adobe PDF
445.63 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3010756