Software

18 pages, 312 KB

Open AccessArticle

Investigating the Refactoring Capabilities of Small Open-Weight Language Models

by Tamás Márton, Balázs Szalontai, Balázs Pintér and Tibor Gregorics

Software 2026, 5(2), 19; https://doi.org/10.3390/software5020019 - 29 Apr 2026

Viewed by 213

Refactoring is essential for developing maintainable software. Using Large Language Models in software engineering is widespread, but compared to well-established domains such as code generation, reliable refactoring is still relatively underexplored. In this paper, we perform a broad analysis on the refactoring capabilities [...] Read more.

Refactoring is essential for developing maintainable software. Using Large Language Models in software engineering is widespread, but compared to well-established domains such as code generation, reliable refactoring is still relatively underexplored. In this paper, we perform a broad analysis on the refactoring capabilities of small open-weight language models (SLMs) by evaluating 12 models on 3453 Python programs. Our study focuses on the two defining aspects of refactoring: behavior preservation and code quality improvement. We evaluate these properties using unit tests and various code metrics. Across models ranging from 0.5B to 8 B parameters, most models improve code quality. Larger models are more reliable, as they preserve behavior more consistently. Reasoning models often make more significant changes while refactoring. Allowing models to generate reasoning traces improves performance, but only for models larger than 4B. For smaller models, reasoning in fact reduces refactoring reliability. The difficulty of the underlying task affects refactoring performance, with more complex tasks associated with higher failure rates. Our results indicate that current open SLMs can support refactoring tasks, especially larger ones with reasoning capabilities, but they are best used with human oversight. Full article

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

► Show Figures

Figure 1

36 pages, 1713 KB

Open AccessArticle

Software Unfairness Detection in Machine Learning-Based Systems: A Systematic Mapping Study

by Roa Alharbi and Noureddine Abbadeni

Software 2026, 5(2), 18; https://doi.org/10.3390/software5020018 - 27 Apr 2026

Viewed by 303

Abstract

Machine learning-based systems are increasingly deployed in high-stakes domains, such as healthcare, finance, law, and e-commerce, where their predictions directly influence critical decisions. Although these systems offer powerful data-driven support, they also introduce serious concerns related to fairness, bias, and discrimination. As a [...] Read more.

Machine learning-based systems are increasingly deployed in high-stakes domains, such as healthcare, finance, law, and e-commerce, where their predictions directly influence critical decisions. Although these systems offer powerful data-driven support, they also introduce serious concerns related to fairness, bias, and discrimination. As a result, detecting and addressing unfairness in machine learning software has become a central research challenge. This study presents a systematic mapping of research on software unfairness detection in machine learning systems, with the aim of consolidating existing fairness definitions, identifying major problem types, examining testing approaches, reviewing commonly used datasets, and highlighting open research gaps. A structured search was conducted across five major digital libraries and additional sources, covering publications from 2010 to 2025. From 1805 initially identified records, 67 primary studies met the inclusion and quality assessment criteria. The findings show that research activity has grown significantly since 2019, reaching a peak in 2022. Most studies were published in conference proceedings, accounting for 52% of the primary studies, followed by journals and workshop proceedings, which accounted for 42% and 6% of the primary studies. The literature encompasses multiple research themes, with 36% of the primary studies focusing on the analysis of existing fairness methods, 22% addressing bias mitigation strategies, 30% investigating testing techniques, and 12% proposing or evaluating evaluation frameworks. Fairness testing was conducted across multiple testing levels, including unit, integration, and system testing. Integration-level testing was the most prevalent, accounting for approximately 37.9% of the studies, followed by system-level testing at 27.3% and unit-level testing at 12.1%. Additionally, 22.7% of the studies applied fairness testing across more than one testing level. Frequently used datasets included COMPAS, Adult Census Income, and German Credit. Widely adopted tools, such as IBM AI Fairness 360, Themis, and Aequitas, were also identified. Overall, the systematic mapping study (SMS) highlights the progress made in fairness research while emphasizing the need for stronger integration of fairness into practical machine learning development. Full article

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

► Show Figures

Figure 1

42 pages, 754 KB

Open AccessSystematic Review

Decision-Making in Agile Software Engineering: A Systematic Literature Review of Models, Methods, Actors and Lifecycle Contexts

by Hannes Salin and Yves Rybarczyk

Software 2026, 5(2), 17; https://doi.org/10.3390/software5020017 - 20 Apr 2026

Viewed by 584

Abstract

Decision-making is a central activity in agile software engineering (SE), yet research on how decisions are made and supported in agile contexts remains fragmented across models, methods, roles, and lifecycle stages. While prior studies have examined isolated aspects such as prioritization or planning, [...] Read more.

Decision-making is a central activity in agile software engineering (SE), yet research on how decisions are made and supported in agile contexts remains fragmented across models, methods, roles, and lifecycle stages. While prior studies have examined isolated aspects such as prioritization or planning, a comprehensive synthesis of decision-making as a phenomenon in agile SE is lacking. This systematic literature review addresses this gap by consolidating and structuring existing research on agile decision-making and to identify dominant patterns, gaps, and future research directions. A systematic search was conducted in IEEE Xplore, ACM Digital Library, Scopus, and Web of Science, complemented by backward and forward snowballing, covering publications from 2014 to 2024. In total, 42 studies were included and analyzed using a structured coding scheme covering decision models, methods, actors, lifecycle contexts, and research methodologies. The results reveal a strong concentration of analytical and hybrid decision-making models in planning and requirements activities, while decision-making in coding, testing, and operations remains underexplored. Software developers are the most frequently studied decision-making actors, whereas managers are mainly discussed as external stakeholders rather than active decision-makers within agile workflows. The main contributions of this study are the following: a structured synthesis of agile decision-making research over multiple analytical dimensions, the identification of key research gaps in lifecycle coverage and actor perspectives, and the proposal of a coherent nomenclature for decision-making in agile SE. These contributions provide a foundation for future empirical studies and support the development of more comprehensive theories of decision-making in agile software engineering organizations. Full article

► Show Figures

Figure 1

45 pages, 1655 KB

Open AccessArticle

Adaptive Self-Prompting in Agentic LLM Frameworks for Code Fault Detection

by Maher Muhtadi, Qusay H. Mahmoud and Akramul Azim

Software 2026, 5(2), 16; https://doi.org/10.3390/software5020016 - 16 Apr 2026

Viewed by 824

Abstract

Large language models (LLMs) have demonstrated strong capability for code understanding and vulnerability detection. However, most existing approaches rely on static prompting and treat the model as a passive predictor, limiting adaptability under uncertainty, particularly in embedded and cyber-physical systems (CPS). This paper [...] Read more.

Large language models (LLMs) have demonstrated strong capability for code understanding and vulnerability detection. However, most existing approaches rely on static prompting and treat the model as a passive predictor, limiting adaptability under uncertainty, particularly in embedded and cyber-physical systems (CPS). This paper introduces adaptive self-prompting as a core mechanism for agentic LLM-based fault detection in C-language embedded code. We propose two complementary frameworks: Agentic Retrieval-Augmented Generation (A-RAG), which performs confidence-triggered, reasoning-conditioned retrieval from CWE and SEI CERT knowledge bases at inference time, and Agentic Supervised Fine-Tuning (A-SFT), which internalizes improvements through a self-evaluation sweep that refines instructions and training exemplars during fine-tuning. Experiments are conducted on a unified dataset constructed from the Toyota ITC benchmark and a curated subset of Big-Vul aligned to embedded code-relevant CWE categories. Results show that adaptive self-prompting substantially improves predictive performance and error calibration compared to static Retrieval-Augmented Generation (RAG), conventional fine-tuning, and encoder-based baselines, achieving up to 86.3% F1 score while significantly reducing high-confidence misclassifications. These findings demonstrate that confidence-aware reflection and adaptive reasoning enhance both robustness and safety in LLM-based fault detection for embedded and CPS software. Full article

► Show Figures

Figure 1

26 pages, 833 KB

Open AccessArticle

Design of a RAG-Based Customer Service Chatbot Enhanced with Knowledge Graph and GPT Evaluation: A Case Study in the Import Trade Industry

by Nien-Lin Hsueh and Wei-Che Lin

Software 2026, 5(2), 15; https://doi.org/10.3390/software5020015 - 2 Apr 2026

Viewed by 1556

Abstract

Amid the wave of digital transformation and customer service automation, traditional chatbots are increasingly challenged by their inability to handle unstructured data and complex queries. This issue is particularly critical in the import trade industry, where customer service representatives must respond promptly to [...] Read more.

Amid the wave of digital transformation and customer service automation, traditional chatbots are increasingly challenged by their inability to handle unstructured data and complex queries. This issue is particularly critical in the import trade industry, where customer service representatives must respond promptly to diverse inquiries involving quality anomalies, order tracking, and product substitution. Existing rule-based or keyword-driven chatbots often fail to provide accurate responses, resulting in reduced customer satisfaction and increased operational burdens. This study proposes and implements a “Retrieval-Augmented Generation (RAG)-based Customer Service Chatbot,” integrating the RAG framework with a Neo4j-based knowledge graph, specifically tailored for the import trade domain. The system constructs a dedicated QA dataset, knowledge graph, and dynamic learning mechanism. It semantically vectorizes internal documents, meeting records, quality assurance procedures, and historical dialogues, establishing interrelated knowledge nodes to enhance the chatbot’s comprehension and response accuracy. The study also incorporates GPT-based response evaluation and a high-score caching strategy, enabling dynamic learning and knowledge enhancement. Experiments were conducted using 101 representative enterprise-level queries across six categories, reflecting real-world operational scenarios and inquiry needs. The results demonstrate that the combination of knowledge graphs and RAG technology effectively reduces AI hallucinations and improves response coverage and accuracy, thereby addressing complex problems in customer service applications. This paper not only presents a feasible AI implementation model for the import trading industry but also offers a practical architectural reference for domain-specific knowledge management in the import trade and allied sectors. Full article

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

► Show Figures

Figure 1

27 pages, 423 KB

Open AccessArticle

Wybe: Design of a Programming Language

by Peter Schachte

Software 2026, 5(2), 14; https://doi.org/10.3390/software5020014 - 31 Mar 2026

Viewed by 363

Abstract

We propose a set of design principles to guide the design of a programming language intended for general, practical use. These principles center around supporting the development of robust programs, supporting independent development, the evolution of separate components of an application, and developing [...] Read more.

We propose a set of design principles to guide the design of a programming language intended for general, practical use. These principles center around supporting the development of robust programs, supporting independent development, the evolution of separate components of an application, and developing programs with adequate performance. We identify one key principle, interface integrity, as the most important characteristic of declarative programming languages. Following these principles has led to the development of the Wybe programming language, which provides a range of features common in functional, procedural, and logic programming languages. In particular, we argue that it provides much of the benefit of declarative programming languages, while providing much of the flexibility of imperative programming. Full article

(This article belongs to the Topic Software Engineering and Applications)

► Show Figures

Figure 1

32 pages, 1440 KB

Open AccessArticle

Interaction Effects of Team Attributes and System Attributes in Software Maintenance Productivity

by Michel Benaroch

Software 2026, 5(2), 13; https://doi.org/10.3390/software5020013 - 31 Mar 2026

Viewed by 318

Abstract

We investigate interaction effects of system attributes—age and volatility—and team attributes—instability and skill-diversity—on software maintenance productivity in the context of lifecycle maintenance involving multiple serial tasks (projects), unlike extant work’s focus on single maintenance tasks. Given the knowledge intensity of software maintenance, we [...] Read more.

We investigate interaction effects of system attributes—age and volatility—and team attributes—instability and skill-diversity—on software maintenance productivity in the context of lifecycle maintenance involving multiple serial tasks (projects), unlike extant work’s focus on single maintenance tasks. Given the knowledge intensity of software maintenance, we apply knowledge creation theory to identify knowledge needs and challenges that system and team attributes create, then develop two theoretical predictions. First, team attributes adversely affect maintenance productivity while system attributes do not exhibit direct negative effects. Second, interactions of system and team attributes have offsetting (substitutive) effects on productivity since their knowledge needs and challenges overlap. We test our predictions using archival data on three years of maintenance work across 426 mission-critical systems at a Fortune 100 company, encompassing over 7500 maintenance tasks executed by thousands of maintainers. Our analysis yields two key insights. First, interactions diminish substantially the strong negative direct effects of team instability and skill-diversity on maintenance productivity—by as much as 20%. System volatility exhibits a small direct effect (0.38% productivity decline), while system age shows no significant direct effect. Second, interaction effects indicate that productivity declines as instable and skill-diverse teams are sharper when working on younger and less volatile systems—the opposite of conventional wisdom. For example, assigning a team with above-average instability to an above-average age system improves productivity by 3.06% through substitutive effects. Our findings demonstrate that congruence between system and team attributes can improve maintenance productivity, with substantial economic implications: organizations should strategically match team configurations to system characteristics rather than attempting to eliminate team instability or diversity universally. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Software, Volume 5, Issue 2 (June 2026) – 7 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI