Evaluating the Role of Large Language Models in Test Configuration Code Generation: An Empirical Study
2025 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
Background: Automating software development tasks is becoming increasingly relevant as software systems become complex, and Large Language Models (LLMs) have been gaining popularity for their ability to generate code from textual descriptions, offering potential benefits in various coding scenarios
Problem Statement: Despite progress in using LLMs for code generation, research on XML-style test configuration code generation remains limited. This study explores how LLMs can be leveraged for XML-style test configuration code generation.
Objectives: This study explores using LLMs to generate XML-style test configuration code from test specifications. By fine-tuning existing LLMs, the research evaluates the effectiveness of the models in generating test configuration code. The study further investigates the efficacy of using the LLM-generated code as a starting point in a real-world industrial scenario compared to manually writing the code.
Methods: This study employs a multi-method Empirical Study, incorporating an Experiment and a Coding Workshop to assess the effectiveness of LLMs at generating XML configuration code and understand its impact in a real-world context.
Results: The results indicate that Mistral-7B outperforms Phi-3 and Code LLaMA in model performance and structural similarity to the ground-truth code. The coding workshop showed that using LLM-generated code as a starting point reduced coding time by an average of 40.75 minutes for Code 1 and 4 minutes for Code 2 compared to coding from scratch. It also resulted in a lower Tree Edit Distance, though the improvements were not always consistent. Developers also raised concerns about trust, reliability, and domain-specific optimization. While LLMs could quickly generate code, they required additional effort for comprehension and refinement.
Conclusions: The study finds Mistral-7B to be the most effective among the three LLMs for XML-styled test configuration code generation. While it may reduce initial effort when used as a starting point, manual refinement is still needed for accuracy and domain alignment; hence, developers do not completely trust this approach. Future research could explore advanced LLMs, improved validation, and alternative fine-tuning methods.
Place, publisher, year, edition, pages
2025. , p. 50
Keywords [en]
Large Language Models, Generative AI, PEFT QLoRA, Test Configuration Code Generation, Test Code Automation, Fine Tuning.
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:bth-27683OAI: oai:DiVA.org:bth-27683DiVA, id: diva2:1949520
External cooperation
Ericsson
Subject / course
PA2534 Master's Thesis (120 credits) in Software Engineering
Educational program
PAADA Master Qualification Plan in Software Engineering 120,0 hp
Supervisors
Examiners
2025-04-072025-04-022025-04-07Bibliographically approved