Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Measuring the quality of generative AI systems: Mapping metrics to quality characteristics — Snowballing literature review
Blekinge Institute of Technology, Faculty of Computing, Department of Software Engineering.ORCID iD: 0000-0001-5949-1375
Blekinge Institute of Technology, Faculty of Computing, Department of Software Engineering.ORCID iD: 0000-0001-7526-3727
Örebro University.
Blekinge Institute of Technology, Faculty of Computing, Department of Software Engineering.ORCID iD: 0000-0002-3646-235x
2025 (English)In: Information and Software Technology, ISSN 0950-5849, E-ISSN 1873-6025, Vol. 186, article id 107802Article, review/survey (Refereed) Published
Abstract [en]

Context: Generative Artificial Intelligence (GenAI) and the use of Large Language Models (LLMs) have revolutionized tasks that previously required significant human effort, which has attracted considerable interest from industry stakeholders. This growing interest has accelerated the integration of AI models into various industrial applications. However, the model integration introduces challenges to product quality, as conventional quality measuring methods may fail to assess GenAI systems. Consequently, evaluation techniques for GenAI systems need to be adapted and refined. Examining the current state and applicability of evaluation techniques for the GenAI system outputs is essential.

Objective: This study aims to explore the current metrics, methods, and processes for assessing the outputs of GenAI systems and the potential of risky outputs.

Method: We performed a snowballing literature review to identify metrics, evaluation methods, and evaluation processes from 43 selected papers.

Results: We identified 28 metrics and mapped these metrics to four quality characteristics defined by the ISO/IEC 25023 standard for software systems. Additionally, we discovered three types of evaluation methods to measure the quality of system outputs and a three-step process to assess faulty system outputs. Based on these insights, we suggested a five-step framework for measuring system quality while utilizing GenAI models.

Conclusion: Our findings present a mapping that visualizes candidate metrics to be selected for measuring quality characteristics of GenAI systems, accompanied by step-by-step processes to assist practitioners in conducting quality assessments. 

Place, publisher, year, edition, pages
Elsevier, 2025. Vol. 186, article id 107802
Keywords [en]
Evaluation, GenAI, Generative AI, Large language model, LLM, Metric, Quality characteristics, Artificial intelligence, Computer software, ISO Standards, Mapping, Quality control, Artificial intelligence systems, Generative artificial intelligence, Language model, Quality characteristic, Reviews
National Category
Artificial Intelligence
Identifiers
URN: urn:nbn:se:bth-28306DOI: 10.1016/j.infsof.2025.107802ISI: 001519902000001Scopus ID: 2-s2.0-105008505516OAI: oai:DiVA.org:bth-28306DiVA, id: diva2:1981380
Part of project
SERT- Software Engineering ReThought, Knowledge Foundation
Funder
Knowledge Foundation, 20180010Available from: 2025-07-04 Created: 2025-07-04 Last updated: 2025-12-03Bibliographically approved
In thesis
1. Quality Evaluation of Generative AI Systems: Processes, Metrics, Methods, and Frameworks for Industrial Software Engineering
Open this publication in new window or tab >>Quality Evaluation of Generative AI Systems: Processes, Metrics, Methods, and Frameworks for Industrial Software Engineering
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Generative Artificial Intelligence (GenAI) is being rapidly adopted in software engineering, introducing a paradigm shift toward human-AI co-creation. However, the non-deterministic, probabilistic, and often black-box nature of GenAI models presents challenges for traditional software quality assurance. Conventional verification and validation techniques are insufficient to handle outputs that are neither predictably correct nor incorrect, but rather stochastically plausible. This discrepancy creates an urgent need for practical processes, metrics, and new governance frameworks to evaluate and manage the quality of GenAI systems in industrial environments.This thesis examines how industrial organizations adopt GenAI, identify metrics, and evaluate system qualities in alignment with ISO quality standards. Case studies were employed to explore real-world adoption processes, identify context-specific industrial metrics, and uncover practical insights within organizations. A snowballing literature review was conducted to systematically identify, categorize, and synthesize academic metrics for evaluating the output of GenAI systems. Finally, a controlled experiment was designed to quantitatively test the efficiency (e.g., E2E generation time) and effectiveness (e.g., accuracy) of GenAI agent choices. The main contributions of this thesis are a synthesized actionable model and framework grounded in both industrial practice and quality standards. The first contribution is a four-stage adoption model, denoted as the IMRM model (Innovate → considerations, Measure → metrics, Realize → values, Manage → improvements) that integrates early-stage risk assessment (e.g., legal, security, and licensing) andquality evaluation throughout the GenAI adoption and usage.The second contribution presents a detailed framework that connects risks andmetrics to concrete decision support, justifying the business value (e.g., quality gates) and technical trade-offs of GenAI solutions. The third contribution provides a structured mapping of GenAI quality to ISO/IEC 25010, 25023, and 25059 characteristics, attempting to ground practical evaluation needs within a standardized vocabulary. This thesis concludes that a structured quality evaluation process, which prioritizes risks and context, is a valuable approach intended to support building the business confidence required to leverage GenAI for efficient and effective software engineering in industry.

Place, publisher, year, edition, pages
Karlskrona: Blekinge Tekniska Högskola, 2026. p. 232
Series
Blekinge Institute of Technology Doctoral Dissertation Series, ISSN 1653-2090 ; 2026:01
Keywords
Quality Evaluation, Metrics, Artificial Intelligence, AI, Generative AI, Empirical Software Engineering
National Category
Software Engineering
Identifiers
urn:nbn:se:bth-28958 (URN)
Public defence
2026-01-29, J1630, Karlskrona, 13:00 (English)
Opponent
Supervisors
Available from: 2025-12-08 Created: 2025-12-03 Last updated: 2025-12-18Bibliographically approved

Open Access in DiVA

fulltext(3375 kB)348 downloads
File information
File name FULLTEXT01.pdfFile size 3375 kBChecksum SHA-512
8cbd2dd2795297e6529ab3a9e623e9680d589a68c37acd5387675c087e06e6ca4def138b053eb0545d8f32a2274a58aa883b98206087de288de4c65e3a7d0a25
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Yu, LiangAlégroth, EmilGorschek, Tony
By organisation
Department of Software Engineering
In the same journal
Information and Software Technology
Artificial Intelligence

Search outside of DiVA

GoogleGoogle Scholar
Total: 348 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 539 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf