A Scoping Review and Assessment Framework for Technical Debt in the Development and Operation of AI/ML Competition Platforms

Sklavenitis, Dionysios; Kalles, Dimitris

doi:10.3390/app15137165

Open AccessReview

A Scoping Review and Assessment Framework for Technical Debt in the Development and Operation of AI/ML Competition Platforms^†

by

Dionysios Sklavenitis

^*

and

Dimitris Kalles

School of Science & Technology, Hellenic Open University, 26335 Patras, Greece

^*

Author to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a conference paper entitled “Measuring Technical Debt in AI-Based Competition Platforms”, presented at the SETN 2024 Conference, Athens, Greece, 11–13 September 2024.

Appl. Sci. 2025, 15(13), 7165; https://doi.org/10.3390/app15137165

Submission received: 18 May 2025 / Revised: 18 June 2025 / Accepted: 19 June 2025 / Published: 25 June 2025

(This article belongs to the Special Issue Innovative Artificial Intelligence Methods, Tools and Methodologies to Address Challenging Real-World Problems)

Download

Browse Figures

Versions Notes

Abstract

Featured Application

The typology and stakeholder-oriented assessment framework proposed in this study provide practical tools for organizers and participants of Artificial Intelligence/Machine Learning (AI/ML) competition platforms to systematically identify and manage technical debt. Educational institutions can integrate the framework into AI/ML courses to raise awareness about sustainability challenges, while organizers of research-driven or industrial competitions can utilize the questionnaire to proactively detect and mitigate potential platform-level liabilities. Furthermore, the methodology is adaptable for broader AI development environments beyond competition settings, promoting more maintainable, transparent, and inclusive AI ecosystems.

Abstract

Technical debt (TD) has emerged as a significant concern in the development of AI/ML applications, where rapid experimentation, evolving objectives, and complex data pipelines often introduce hidden quality and maintainability issues. Within this broader context, AI/ML competition platforms face heightened risks due to time-constrained environments and evolving requirements. Despite its relevance, TD in such competitive settings remains underexplored and lacks systematic investigation. This study addresses two research questions: (RQ1) What are the most significant types of technical debt recorded in AI-based systems? and (RQ2) How can we measure the technical debt of an AI-based competition platform? We present a scoping review of 100 peer-reviewed publications related to AI/ML competitions, aiming to map the landscape of TD manifestations and management practices. Through thematic analysis, the study identifies 18 distinct types of technical debt, each accompanied by a definition, rationale, and example grounded in competition scenarios. Based on this typology, a stakeholder-oriented assessment framework is proposed, including a detailed questionnaire and a methodology for the quantitative evaluation of TD across multiple categories. A novel contribution is the introduction of Accessibility Debt, which addresses the challenges associated with the ease and speed of immediate use of the AI/ML competition platforms. The review also incorporates bibliometric insights, revealing the fragmented and uneven treatment of TD across the literature. The findings offer a unified conceptual foundation for future work and provide practical tools for both organizers and participants to systematically detect, interpret, and address technical debt in competitive AI settings, ultimately promoting more sustainable and trustworthy AI research environments.

Keywords:

technical debt; AI-based systems; AI/ML competition platforms; SE4AI; SE4ML; AI-training; AI-education; scoping review; typology; assessment framework; quantitative evaluation; software sustainability; platform maintainability

1. Introduction

Recent advances in Artificial Intelligence (AI) have led to the integration of AI components into an increasing number of software engineering projects, giving rise to what are now referred to as AI-based systems [1,2]. Although these projects are often executed by experienced development teams, they introduce several new challenges distinct from those encountered in traditional software systems [3]. A key emerging challenge is the accumulation of new forms of technical debt (TD) specific to AI development, which compounds the existing debt typically present in conventional software [4], ultimately affecting system sustainability, maintainability, and evolvability. Technical debt in AI systems encompasses long-term maintenance difficulties and hidden costs resulting from rapid development, insufficient abstractions, and evolving data dependencies [5]. It is increasingly recognized that addressing such debt requires robust modularization, continuous monitoring, and improved design practices, all of which complicate the long-term adaptation of AI-based systems. Several recent studies have sought to describe and classify the types and characteristics of technical debt in AI systems [6], proposing guidelines, best practices, and software engineering principles to mitigate these risks [7].

A particular area of interest within this landscape is the development and operation of AI/ML competition platforms, whether commercial, academic, research-oriented, or hybrid [6,8,9,10,11,12]. An AI/ML competition platform is a structured software environment designed to host and evaluate artificial intelligence challenges, typically involving tasks in machine learning or reinforcement learning. While such platforms are often delivered online, they may also function as standalone frameworks. Common features include predefined datasets, submission mechanisms, scoring systems, and access to computational resources. These platforms’ primary aim is to identify effective AI solutions through competitive evaluation, while simultaneously serving educational and research purposes [13,14,15]. They vary widely by scientific domain (e.g., medicine, robotics, environmental science), target community (e.g., researchers, students, professionals), platform infrastructure (e.g., open-source, closed-source, cloud-based), feature set (e.g., code sharing, custom metrics), and pricing model (free or paid) [6]. Incentives offered to participants include monetary prizes, career opportunities, and academic credit. Competitions typically span predefined periods, ranging from hours to several months [12,16], and emphasize rapid prototyping and iterative solution development. Participants often prioritize experimentation and responsiveness over producing polished final products. Upon competition completion, submissions are evaluated against predefined criteria such as algorithmic efficiency and resource optimization [16]. Among academic institutions, AI/ML competitions have gained significant popularity due to their engaging, experiential learning approach [17]. They not only accelerate research but also enhance student participation and instructional quality [14,16,18]. Nonetheless, a recurring observation is that student participants, particularly in academic settings, often engage only once to meet course requirements, frequently without adhering to established software engineering practices for AI (SE4AI) [4,19,20,21]. This limited engagement and absence of structured development processes may lead to technical debt accumulation both in participant codebases and in the hosting competition platforms. Moreover, competition organizers currently lack structured methods for quantifying technical debt present within their platforms, further impeding their ability to maintain long-term system sustainability and credibility.

Motivated by these challenges, the present study seeks to systematically explore and address technical debt in AI/ML competition platforms. We conduct a scoping review of 100 peer-reviewed publications related to AI/ML systems and competition settings. Through thematic analysis, we identify and define 18 distinct types of technical debt, each accompanied by rationale, definition, and practical examples grounded in AI/ML competitions. A novel contribution of this work is the proposal of Accessibility Debt as a new category, focusing on addressing the challenges associated with the ease and speed of immediate use of the AI/ML competition platforms.

Building on the developed typology, we introduce a stakeholder-oriented assessment framework, including a structured questionnaire designed to facilitate the quantitative evaluation of technical debt across the identified categories. This dual-pronged approach—conceptual mapping and practical measurement—aims to support competition organizers, participants, and educators in systematically detecting, prioritizing, and mitigating technical debt in competitive AI development environments.

To clarify the scope of this work, our contribution combines (i) a new typology of 18 technical debt categories, including the novel concept of Accessibility Debt, and (ii) a lightweight, stakeholder-specific questionnaire that enables the assessment of these debt types in practical settings. This study thus offers a unified framework that bridges conceptual categorization and applied evaluation, enabling stakeholders to better understand and manage technical debt in AI/ML competitions.

The remainder of this paper is structured as follows: Section 2 reviews related work on technical debt in AI/ML systems and competition platforms. Section 3 describes the scoping review methodology and the process for identifying and categorizing debt types. Section 4 presents the main findings across the identified debt categories. Section 5 introduces the stakeholder-oriented questionnaire and the proposed method for quantifying technical debt. Section 6 discusses the broader implications of our findings, while Section 7 concludes the paper and outlines directions for future research.

This article is a revised and expanded version of a peer-reviewed conference paper entitled “Measuring Technical Debt in AI-Based Competition Platforms,” which was presented at the 13th Hellenic Conference on Artificial Intelligence (SETN 2024), Athens, Greece, 11–13 September 2024 [22]. The current version includes a significantly expanded literature review, a comprehensive scoping review methodology, the proposal of an additional technical debt category (Accessibility Debt), and a quantitative assessment framework tailored to AI/ML competition platforms.

Research Questions

To guide our scoping review and assessment framework development, we defined the following research questions:

RQ1: What are the most significant types of technical debt recorded in AI-based systems?
RQ2: How can we measure the technical debt of an AI—based competition platform?

2. Background and Related Work

As AI technologies have been increasingly integrated into software applications, developers have encountered new technical challenges that extend beyond traditional software engineering concerns [23]. This intersection of AI and software engineering has motivated the emergence of new fields, notably Software Engineering for AI (SE4AI) and Software Engineering for Machine Learning (SE4ML) [1], aimed at adapting and extending conventional engineering practices. AI/ML competition platforms, in parallel, have played a critical role by providing dynamic environments that not only facilitate AI research and innovation but also foster the development of essential soft skills [14,24,25] such as collaboration, communication, ethical reasoning [26], and human-centered design.

2.1. Technical Debt in AI/ML Systems

Technical debt in AI/ML systems extends beyond traditional code-related issues. It also includes challenges linked to data quality [27,28] model interpretability and maintainability, reproducibility, infrastructure scalability, and ethical considerations [5,29]. As Artificial Intelligence systems evolve continuously, they require frequent retraining, adaptation to new data, and ongoing reassessment of their operational environments. This dynamic nature makes them particularly vulnerable to accumulating multiple forms of technical debt. The idea of “hidden technical debt” in machine learning was first introduced by Sculley et al. [5], who highlighted key risks such as pipeline jungles, entanglement, and boundary erosion. Later research [20,30,31] extended this concept by proposing taxonomies that classify AI-specific forms of technical debt, including data debt, model debt, infrastructure debt, and ethics-related debt.

The 18 debt types analyzed in this paper are grounded in prior work on AI-specific technical debt [5,20,30,31] and were further shaped through a thematic scoping review of 100 peer-reviewed studies (see Section 3.1). Our typology builds upon foundational classifications proposed by Li et al. [32] and Rios et al. [33], which primarily address traditional software engineering contexts. Among the few efforts aiming to extend these frameworks to AI-based systems, the taxonomy by Bogner et al. [30] stands out, having introduced four AI-specific debt types: data debt, model debt, configuration debt, and ethics debt. Our work complements and significantly extends these prior efforts by offering a broader and competition—specific taxonomy composed of 18 plus one distinct categories. Notably, we introduce Algorithm Debt, Self-Admitted Technical Debt (SATD), and Accessibility Debt—types not previously addressed in existing frameworks. Furthermore, we refine closely related categories such as Architecture and Design Debt, incorporate stakeholder roles for each debt type, and introduce a severity-based weighting system to support operationalization within real-world assessment processes. By aligning debt categories with concrete platform examples and assessment logic, our framework bridges conceptual classification with actionable diagnostics, directly addressing key gaps in the literature.

The following subsections discuss the major categories of technical debt observed in AI/ML systems.

2.1.1. Requirements Debt in AI/ML Systems

Requirements debt arises from the misalignment between traditional requirements engineering (RE) approaches and the uncertainty-prone, data-driven nature of AI/ML systems [3,34,35]. Unlike conventional systems where specifications can be precisely defined, AI applications require iterative refinement based on model behavior and evolving datasets. Gaps in data quality, transparency, and ethical considerations introduce long-term challenges. To mitigate requirements debt, researchers propose continuous requirements engineering, early stakeholder involvement, and dynamic adaptation mechanisms throughout the AI system lifecycle.

2.1.2. Architectural Debt in AI/ML Systems

Architectural debt arises when ad hoc design decisions, absence of AI-specific architectural patterns, or tight coupling between components hinder maintainability and scalability in AI/ML competition platforms. Unlike traditional systems, ML-enabled infrastructures often suffer from rigid orchestration of data pipelines, model serving layers, and submission logic, impeding system evolution [36,37]. Empirical studies highlight issues such as lack of abstraction, inadequate layering, and missing separation between learning and inference stages [37,38]. Suggested mitigation strategies include use of modular design patterns, containerization of pipeline stages, and architectural refactoring driven by operational feedback [39].

2.1.3. Design Debt in AI/ML Systems

Design debt refers to suboptimal design choices in software components, APIs, or interaction flows that reduce clarity, reusability, and consistency. In AI/ML systems, this includes poor alignment between training and inference paths, tangled input/output handling, and undocumented design rationales [36,38]. Design patterns tailored for ML systems—such as early preprocessing filters or result inspectors—are often missing or misapplied [37]. Over time, such choices increase rework and hinder maintainability, especially when design concerns are entangled with model logic or competition-specific artifacts [31].

2.1.4. Data Debt in AI/ML Systems

Data debt reflects the accumulation of quality, availability, and governance issues in datasets, which directly affect model performance and generalization [27,40]. Outdated, biased, or insufficiently curated datasets can lead to model drift, reproducibility failures, and degraded decision-making capabilities [41,42]. To address data debt, proactive data validation, continuous data pipeline monitoring, bias detection frameworks, and automated schema enforcement techniques are recommended [28,43,44].

2.1.5. Algorithm Debt in AI/ML Systems

Algorithm debt refers to the challenges that emerge when learning algorithms are unnecessarily complex, poorly tuned, or inefficiently implemented, ultimately affecting system performance, reproducibility, and long-term maintainability [29,45]. Common causes include non-optimized code, dependence on unstable algorithmic frameworks, and the use of non-transparent or ad hoc hyperparameter tuning strategies. To reduce algorithm debt, best practices involve designing modular and reusable algorithmic components, establishing reproducible experimental protocols, benchmarking rigorously against reliable baselines, and incorporating automated hyperparameter optimization pipelines.

2.1.6. Model Debt in AI/ML Systems

Model debt refers to long-term maintainability challenges that arise throughout the lifecycle of machine learning systems, including their development, deployment, and monitoring [1,5,46]. It is often caused by issues such as concept drift, overfitting, feedback loops, and insufficient versioning of models and associated datasets [47]. Addressing model debt typically involves implementing automated monitoring tools, establishing retraining pipelines, managing metadata to track model lineage, and validating performance continuously against evolving real-world data [48,49].

2.1.7. Infrastructure Debt in AI/ML Systems

Infrastructure debt in AI/ML systems often stems from ad hoc integration of machine learning components, limited scalability in deployment pipelines, poor reproducibility, and underdeveloped monitoring tools [30,50]. These challenges are further compounded by manual setup processes, inconsistent tooling, and fragmented DevOps practices that increase the likelihood of operational failures. Mitigating infrastructure debt involves adopting MLOps principles, including automated resource provisioning, containerization, CI/CD pipelines tailored for ML workflows, and the use of scalable, cloud-based environments to support continuous delivery and observability.

2.1.8. Test Debt and Quality Assurance in AI/ML Systems

Test debt in AI/ML systems often arises from the limitations of traditional software testing methods when applied to non-deterministic machine learning models [2,7,51,52]. Issues such as the Oracle Problem [53], challenges in validating robustness, and vulnerability to adversarial examples can lead to incomplete or insufficient test coverage [2,54]. To address this form of debt, researchers emphasize the need for AI-specific testing approaches, including metamorphic testing, property-based validation, adversarial robustness evaluation, and the integration of continuous validation pipelines within ML deployment workflows [51,55,56].

2.1.9. Build Debt in AI/ML Systems

Build debt in AI/ML systems often results from fragmented and manually managed build and deployment pipelines, which hinder maintainability and introduce delays in operational workflows [29,57]. In AI development, this form of debt is further intensified by complex software dependencies, frequent changes in ML frameworks, and the absence of standardized execution environments. Effective mitigation strategies include the use of containerized builds, reproducible configurations, automated dependency resolution, and continuous integration practices tailored to the needs of machine learning systems.

2.1.10. Versioning Debt in AI/ML Systems

Versioning debt arises when inadequate version control mechanisms for code, datasets, models, and pipelines undermine reproducibility, traceability, and auditability [48,58,59]. Unlike traditional software projects, AI systems require fine-grained tracking of training data, model artifacts, feature transformations, and experimental setups [60]. Mitigating versioning debt involves integrating ML-specific version control tools, automated metadata tracking systems, and structured model lifecycle management frameworks [61].

2.1.11. Configuration Debt in AI/ML Systems

Configuration debt results from poorly managed, undocumented, or hardcoded settings that compromise system portability, reproducibility, and maintainability [3,5,62]. AI/ML systems often face evolving hyperparameters, environment drift, and inconsistent deployment setups [4,63,64]. The best practices to mitigate configuration debt include configuration-as-code methodologies, automated validation of settings, schema-based configuration management, and standardized environment versioning [5,30].

2.1.12. Code Debt in AI/ML Systems

Code debt in AI/ML systems originates from unstructured, prototype-focused development approaches, reliance on Jupyter notebooks, and lack of code modularization and standardization [4,65,66,67,68]. This leads to brittle codebases, poor reusability, and high maintenance costs. To address code debt, developers should adopt modular coding practices, enforce code review processes, document experimental pipelines systematically, and integrate continuous integration/continuous deployment (CI/CD) practices into AI projects [19,61,69].

2.1.13. Process Debt in AI/ML Systems

Process debt in AI/ML systems often arises from unstructured development workflows, insufficient lifecycle governance, and weak coordination between AI experts and software engineers [70,71,72]. These shortcomings frequently lead to issues such as inconsistent versioning of data and models, manual and error-prone handovers, and limited traceability across the development pipeline. Addressing process debt involves embracing MLOps principles [38], applying structured AI development lifecycles like CRISP-ML(Q) [49], integrating monitoring and automation tools, and fostering close interdisciplinary collaboration throughout the AI engineering process.

2.1.14. Documentation Debt in AI/ML Systems

Documentation debt arises when AI/ML systems suffer from incomplete, inconsistent, or outdated documentation practices [71,73]. Fragmented documentation reduces maintainability, hinders collaboration, and complicates system reproducibility [74,75]. The lack of structured model cards, pipeline documentation, and data provenance records exacerbates the issue [74]. To address documentation debt, standardized documentation frameworks, integration of documentation into development workflows, and regular audits are essential [73,74].

2.1.15. People and Social Debt in AI/ML Systems

People and social debt in AI/ML systems arises when ineffective communication, unclear roles, or weak knowledge sharing hinder collaboration among teams and stakeholders [76,77,78]. These issues can lead to misaligned expectations, inconsistent coding practices, and a lack of shared understanding about system goals—factors that often contribute to the silent accumulation of technical debt. Recent empirical studies have also highlighted how communication breakdowns and organizational silos significantly hinder ML collaboration and contribute to hidden technical debt [71]. Mitigating this form of debt requires strengthening interdisciplinary collaboration, clarifying responsibilities, adopting structured communication protocols, and ensuring that social and technical aspects of development are aligned throughout the lifecycle [77,78].

2.1.16. Ethics Debt in AI/ML Systems

Ethics debt in AI systems arises when foundational ethical principles—such as fairness, transparency, and accountability—are not systematically integrated throughout the development lifecycle [79,80]. Overlooking these dimensions can lead to biased or discriminatory outcomes, unintended societal consequences, and exposure to legal or regulatory risks. Proactively managing ethics debt involves embedding responsible AI practices from the design phase onward, adopting ethics-by-design methodologies, and establishing continuous auditing mechanisms to monitor compliance and impact [26,81,82].

2.1.17. Self-Admitted Technical Debt (SATD) in AI/ML Systems

Self-Admitted Technical Debt (SATD) refers to cases where developers explicitly acknowledge technical limitations, incomplete implementations, or intentional workarounds in comments, documentation, or tracking systems [63,83,84]. In AI systems, SATD commonly appears in the form of simplified evaluation procedures, incomplete data preprocessing, or experimental code segments that were never fully integrated [29]. Addressing SATD effectively involves using automated tools for early detection, incorporating structured refactoring routines, and embedding SATD management into the broader governance practices of AI development projects.

2.1.18. Defect Debt in AI/ML Systems

Defect debt in AI/ML systems arises from unresolved anomalies, latent bugs, or hidden defects that accumulate over time, affecting system reliability and performance, [7,29,30]. Due to model non-determinism and dependency complexity, defect diagnosis in AI systems is challenging. Addressing defect debt requires systematic defect tracking, rigorous validation and testing pipelines, continuous monitoring of model behavior, and collaborative defect resolution across development and data engineering teams [7,30].

The multifaceted nature of technical debt in AI/ML systems underscores the need for structured management approaches that span data quality, model integrity, algorithmic soundness, architectural coherence, infrastructural reliability, and ethical and social responsibility. Recognizing and addressing these various forms of debt enables developers, platform organizers, and researchers to mitigate long-term risks, support sustainable AI development, and enhance system maintainability and trust. The following subsections discuss how emerging practices in software engineering for AI (SE4AI), explainable artificial intelligence (XAI), and AI/ML competition platforms intersect with and influence the management of technical debt.

2.2. Software Engineering for AI (SE4AI) and Software Engineering for Machine Learning (SE4ML)

The development of AI-based systems and ML-based software applications presents unique challenges that extend beyond the scope of traditional software engineering. Unlike conventional systems that follow explicit programmatic rules, AI systems rely on data-driven inference, leading to increased uncertainty, evolving requirements, and lifecycle complexity [1]. These differences have motivated the emergence of Software Engineering for AI (SE4AI) and Software Engineering for Machine Learning (SE4ML) as dedicated subfields aimed at adapting and extending classical engineering practices to the AI/ML context.

SE4AI and SE4ML propose structured methodologies, best practices, and toolsets to address the specific needs of AI development [85], including robustness, maintainability, explainability, and ethical considerations. Several key differences distinguish AI/ML system engineering from traditional software development:

Requirements Engineering: In AI/ML systems, requirements are not fully specified up-front but evolve through iterative learning cycles based on data and model feedback [86]. Techniques such as incremental requirement refinement and data-centric requirement analysis are essential.
Software Design and Architecture: Managing complex data pipelines, model versioning, deployment workflows, and continuous retraining requires specialized design patterns tailored for AI systems [23]. To prevent the accumulation of technical debt, AI/ML architectures should prioritize modularity, scalability, and the ability to adapt across evolving data and model requirements.
Testing and Quality Assurance: Traditional software testing approaches fall short in the face of non-deterministic AI behavior. Ensuring system reliability demands the use of AI-specific testing techniques, including data validation, model evaluation, metamorphic testing, and uncertainty quantification [23].
Deployment and Maintenance: Addressing real-world phenomena such as data drift, model degradation, and evolving user feedback requires continuous monitoring, automated retraining pipelines, and adaptive update mechanisms. MLOps frameworks offer the foundational infrastructure to support these ongoing lifecycle management needs efficiently [87].
Ethical and Social Considerations: The development of AI systems raises important ethical concerns, including algorithmic bias, lack of transparency in decision-making, and fairness in outcomes. Responsible AI frameworks advocate for embedding ethical safeguards throughout the design, training, and deployment phases to ensure systems align with broader societal values [79].
Training and Skill Development: Contemporary SE4AI curricula emphasize the need for interdisciplinary competencies that combine technical software engineering skills with principles from AI governance, ethics, and human-centered design [88].

By incorporating these evolving engineering practices, SE4AI and SE4ML aim to improve the robustness, scalability, and long-term maintainability of AI/ML systems, while also mitigating forms of technical debt that are unique to data-driven and adaptive applications.

2.3. Explainable Artificial Intelligence (XAI) in AI-Based Systems

Explainable Artificial Intelligence (XAI) aims to enhance the transparency, interpretability, and trustworthiness of artificial intelligence models, addressing critical concerns related to fairness, accountability, and societal impact [89]. As AI systems grow in complexity, particularly with the widespread adoption of deep learning and reinforcement learning [90], the opacity of model decision-making increases, making it difficult for stakeholders to understand, validate, or contest outcomes [91].

The lack of interpretability presents substantial challenges, especially in high-stakes domains such as healthcare, finance, and autonomous systems, where explainability is essential for regulatory compliance, ethical accountability, and public trust [92]. XAI seeks to bridge this gap by developing techniques that provide human-understandable explanations without compromising model performance.

XAI methods are broadly categorized into two groups [93]:

Intrinsic Interpretability: Designing models that are inherently understandable, such as decision trees, linear models, and rule-based systems. These models trade some performance for enhanced transparency.
Post Hoc Explainability: Applying interpretability techniques after model training, using methods such as feature attribution (e.g., SHAP, LIME), surrogate models, counterfactual explanations, and visualization tools to provide insights into model behavior.

Within reinforcement learning (RL), the subfield of Explainable Reinforcement Learning (XRL) has emerged to clarify how agents learn policies and make decisions in dynamic, often adversarial environments [94]. Still, a fundamental trade-off persists, as enhancing model interpretability can come at the cost of reduced predictive performance or increased system complexity.

Ultimately, XAI contributes not only to enhancing user trust and regulatory compliance but also plays a crucial role in reducing technical debt. Systems that lack transparency are harder to validate, debug, maintain, and audit, thereby increasing long-term maintenance burdens. Embedding explainability principles early in AI system design thus represents a proactive strategy for sustainable AI development [91,92,95].

2.4. AI/ML Competition Platforms

AI/ML competition platforms have become critical infrastructures for promoting innovation, education, and research on the sustainable development of AI and ML systems. These platforms offer controlled, reproducible, and benchmark-driven environments where researchers, students, and practitioners are invited to design, submit, and evaluate AI/ML solutions across structured tasks. Through their design, they enhance transparency, foster collaboration, and provide valuable mechanisms for comparative evaluation [9,96,97]. As such, they serve as a bridge between theoretical knowledge and hands-on implementation, especially in the context of software engineering for AI (SE4AI).

2.4.1. Technical Infrastructure and Platform Design

The infrastructure and architecture of AI competition platforms vary widely depending on their purpose, scope, and targeted domain. Platforms such as CodaLab and AICrowd offer open-source frameworks for organizing scientific challenges, supporting multi-phase competition flows, automated evaluation mechanisms, and seamless integration with cloud-based execution environments [97,98]. These platforms often enable the configuration of scoring systems, hidden test sets, and submission tracking, thereby supporting a wide range of AI-related tasks including supervised learning, reinforcement learning, and even multi-agent coordination.

In contrast, domain-specific platforms such as the AI World Cup focus on real-time simulation-based scenarios, such as robot soccer, where agents interact in dynamic and adversarial environments using deep reinforcement learning [9]. To ensure fairness and reproducibility, most modern platforms rely on containerized execution environments, often based on Docker or Kubernetes technologies. These environments isolate each participant’s submission and standardize hardware and software configurations, minimizing variability and system-level bias.

Platforms typically support both result-based and code-based submissions. The former allows for quick leaderboard rankings through metrics evaluation, while the latter is increasingly favored for its emphasis on transparency, reproducibility, and post-competition validation. This technical design facilitates integration with academic curricula and industrial pipelines, making these platforms suitable for both research and pedagogical use.

2.4.2. AI Competitions as a Catalyst for Research and Innovation

AI competitions such as Kaggle, General Video Game Artificial Intelligence (GVGAI), and the AI World Cup have served as effective testbeds for developing general-purpose, high-performing AI agents across diverse domains, including time series forecasting, natural language processing, robotics, and multi-agent systems [9,99,100]. On platforms like Kaggle, participants must tackle real-world data challenges—such as noise, inconsistency, and high entropy—by applying best practices in SE4ML, including modular architectures, rigorous cross-validation, reproducible pipelines, and model-focused debugging strategies [6]. These environments foster interdisciplinary innovation by integrating concepts from fields like game design, human–computer interaction, and AI ethics. Notably, game jams have been employed as experimental spaces for AI prototyping, encouraging creativity, rapid iteration, and flexible problem-solving approaches [101,102].

Several domain-specific initiatives further exemplify how AI-based competitions can act as catalysts for both research and education. The RLGame ecosystem has been explored as a platform for integrating AI and software engineering practices in computing education, helping students gain hands-on experience while reflecting on technical trade-offs, including documentation debt and maintainability [14,103]. Similarly, the Geometry Friends platform, used in collaborative AI competitions, encourages strategic coordination among agents and developers, while simultaneously exposing participants to constraints that surface architectural and planning debt [104]. The General Board Game (GBG) framework also serves as a research-grounded environment for developing general AI agents across diverse games, facilitating comparative learning and reproducibility [8].

Another notable case is the RoboCup Standard Platform League’s Drop-in Player Competition [105], which promotes large-scale ad hoc teamwork among heterogeneous robotic agents without prior coordination. By requiring each robot to dynamically adapt its strategy to unknown teammates, this competition has offered a rich environment for studying real-time decision-making, agent modeling, and coordination strategies across multiple organizations over several years.

In addition, platforms like Project Malmo, built on the Minecraft environment, provide flexible and immersive AI experimentation spaces that support goal-driven agent design, real-time decision-making, and curriculum-learning approaches [106]. These environments simulate open-world challenges that are directly relevant to the development of robust, explainable, and adaptable AI systems.

Beyond performance optimization, the structured and competitive nature of AI challenges and encourages participants to consider architecture robustness, experiment versioning, and model reproducibility—core dimensions in technical debt management.

On sites like Kaggle, the public leaderboard serves as a feedback mechanism that lets users track their progress in almost real-time while also warning about the dangers of overfitting or over-optimization [100]. These dynamics help develop an engineering discipline in line with the principles of software engineering for AI (SE4AI) and reflect pressures encountered in the real world when deploying AI systems.

2.4.3. Educational Benefits and Skill Development

AI competitions and game jams have become integral to education in data science, machine learning, and the engineering of AI systems. Their structured and time-constrained format encourages the development of essential skills such as agile thinking, iterative design, and collaborative problem solving—competencies that are vital in both academic and industry settings [107]. For students, participation in these events provides hands-on experience with realistic AI challenges, while reinforcing sound software engineering practices in complex and often uncertain environments.

A particularly promising direction within this landscape is the use of Serious Slow Game Jams, which shift the focus from rapid prototyping to reflective design, interdisciplinary collaboration, and early identification of technical risks [108]. In these events, participants work closely with domain experts, test critical assumptions early, and iteratively improve their solutions. Not only does this process foster innovation, it also raises awareness of how initial design choices can influence system maintainability and reliability over time.

In parallel, initiatives such as DSLE [10] and AI4Science [95] are embedding structured competition frameworks into formal curricula, where contests are used not merely as evaluation mechanisms but as immersive educational tools. These platforms support the teaching of modular code design, pipeline validation, version control, and responsible AI deployment—reinforcing best practices aligned with both SE4AI and SE4ML.

2.4.4. Challenges and Considerations in AI Competitions

Despite their many advantages, AI competition platforms also face important challenges. A key concern is ensuring fairness and transparency throughout the evaluation process. Bias in datasets, inconsistencies in evaluation metrics, and limited mechanisms for reproducibility can distort results and inadvertently favor participants with greater computational resources or prior experience [96]. The widespread use of public leaderboards, while offering timely feedback, can sometimes encourage “leaderboard chasing” behaviors—where participants optimize narrowly for leaderboard position at the expense of model generalizability. These practices can lead to both technical and ethical debt over time.

Another persistent issue relates to resource accessibility. On platforms like Kaggle, access to advanced computing features such as Graphic Processing Unit (GPU) acceleration is often restricted to paid users or those with higher platform rankings, creating entry barriers for newcomers and participants from under-resourced institutions [6]. Such disparities risk reinforcing existing inequities in AI education and participation. Furthermore, the competitive nature of these platforms—often characterized by tight deadlines and performance pressure—can incentivize shortcuts and self-admitted technical debt (SATD), ultimately undermining the maintainability and long-term robustness of the developed systems [80,82].

Beyond these common limitations, some AI competition environments may also lack structured mechanisms for reflecting on technical decision-making. For instance, early versions of game-based environments such as RLGame exposed students to AI development under time pressure, but often without explicit prompts to document or justify architectural trade-offs—an issue that, if unaddressed, leads to design and documentation debt [14]. Similarly, in the Geometry Friends collaborative platform, participants often encounter strategic coordination challenges that bring to the surface fragile or inflexible planning components, particularly when agent behavior is hard-coded or tightly coupled [104].

The GBG framework highlights another layer of challenge: while it encourages experimentation with diverse agents across games, it also requires careful modularization and consistent evaluation practices to avoid reinforcing non-generalizable design patterns and evaluation bias [8]. In all these cases, the absence of integrated support for version control, modular pipelines, or structured reflection may hinder participants from recognizing the long-term consequences of their design decisions.

These issues highlight the need for AI competitions to be designed with inclusive infrastructure, blind evaluation phases, and opportunities for reflection during development. Embedding clear criteria for fairness, explainability, and sustainability within competition guidelines can lead to not only better-performing systems but also more ethically aligned and responsible AI practices. Platforms that emphasize reproducibility, transparent scoring mechanisms, and awareness of technical debt offer participants a more valuable and reflective learning experience [9,10,16].

2.4.5. Future Directions

Emerging trends in AI competitions point toward greater emphasis on accountability, transparency, and education. Competitions focused on explainable AI (XAI) are gaining traction, challenging participants to prioritize model interpretability alongside accuracy [92]. Likewise, tournaments on multi-agent reinforcement learning are expanding, offering valuable testbeds for cooperative and adversarial AI systems that mimic real-world complexity [9].

Platforms such as the Grand Challenges in Medical AI are applying these models to sensitive domains like healthcare and public welfare, underscoring the social responsibility of AI research [96]. Concurrently, universities and industry partners are increasingly embedding competitions into curricular [71,109] and extracurricular training programs. These efforts aim to institutionalize good practices in technical debt management, ensure long-term model sustainability, and promote interdisciplinary skills that are vital for the next generation of AI professionals.

By embracing these future directions, AI competition platforms are well positioned to become inclusive, pedagogically rich, and technically rigorous environments that advance the field of responsible artificial intelligence.

In order to systematically capture and analyze the diverse manifestations of technical debt in AI/ML competition platforms, a structured methodology was required. The next section outlines our scoping review process and the steps taken to develop a robust typology of debt types and their corresponding assessment instruments.

3. Materials and Methods

Our methodology was designed to systematically address the two research questions defined in Section Research Questions. Specifically, the scoping review process aims to answer RQ1 by identifying and categorizing the various types of technical debt manifesting in AI-based systems and competition platforms. Building on these findings, RQ2 is addressed through the development of a stakeholder-oriented questionnaire that enables the quantification and structured evaluation of technical debt across identified categories.

3.1. Scoping Review Framework

This study follows the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines to ensure methodological rigor and transparency [110]. The review process was conducted in multiple phases: identification, screening, eligibility assessment, and final inclusion. Figure 1 presents the PRISMA flow diagram summarizing the study selection process.

In addition to database-driven systematic search, supplementary search strategies such as backward and forward snowballing and manual exploration of citation networks were employed [111,112] to capture additional relevant studies that may not have been indexed through standard queries. All studies, regardless of their identification pathway, were subjected to the same eligibility criteria and thematic analysis framework. The scoping review method was chosen to capture recurring patterns of technical debt across varied AI/ML competition settings, in the absence of an existing domain-specific typology.

3.2. Search Strategy and Data Sources

An initial comprehensive search was conducted across five major scholarly databases: Google Scholar, ACM Digital Library, IEEE Xplore, Scopus, Springer.

These databases were selected based on their broad coverage of AI, machine learning, software engineering, and technical debt research, ensuring a representative corpus.

Search queries combined keywords and logical operators to retrieve relevant studies addressing technical debt within AI-based systems or AI/ML competition platforms. The main search terms included

“Technical Debt”
“Artificial Intelligence”
“Machine Learning”
“Software Engineering”
“AI-Based Systems”

Forward and backward snowballing was systematically applied on key publications to further enhance the corpus. Table 1 summarizes the search strings, and the number of results retrieved per database.

3.3. Study Selection and Screening

A systematic screening and selection process was followed, inspired by the PRISMA-ScR guidelines [110]. This scoping review was prospectively registered on the Open Science Framework (OSF) under the code https://osf.io/bjfcq (accessed on 18 June 2025). After initial retrieval, duplicate entries were removed. Remaining records were first screened by titles and abstracts based on predefined inclusion and exclusion criteria. Full-text articles of potentially relevant studies were then retrieved and assessed for final eligibility.

3.3.1. Inclusion Criteria

Studies were selected based on the following criteria:

Studies published between January 2012 and February 2024.
Peer-reviewed journal articles and conference papers.
Studies explicitly addressing technical debt in AI-based systems or AI/ML competition platforms.
Studies providing empirical, theoretical, or practical insights into AI-related technical debt management.

3.3.2. Exclusion Criteria

Non-English publications.
Publications prior to 2012.
Non-peer-reviewed articles, editorials, or non-scientific sources.
Studies focused exclusively on general software engineering technical debt without AI-specific considerations.

This multi-phase screening ensured that only high-quality and directly relevant studies were included in the final corpus.

The PRISMA flow diagram (Figure 1) summarizes the process, presenting the number of records identified, screened, excluded, and included at each stage, along with reasons for exclusions where applicable.

3.3.3. PRISMA-Based Selection Process

To ensure a comprehensive and systematic identification of relevant studies, a rigorous multi-phase study selection process was conducted following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines.

Identification: An extensive search was performed across five databases (Google Scholar, ACM Digital Library, IEEE Xplore, Scopus, and Springer) using a predefined set of keywords related to technical debt in AI-based systems and competition platforms. Additionally, manual searches and snowballing techniques were employed to capture relevant studies not retrieved through database queries.

Screening: Duplicate records were first removed. The remaining studies were screened by evaluating their titles and abstracts based on predefined inclusion and exclusion criteria. This initial filtering aimed to eliminate studies clearly unrelated to technical debt in AI or AI-based competitions.

Eligibility: Full-text articles of potentially relevant studies were retrieved and assessed thoroughly against the eligibility criteria. Only studies that explicitly addressed technical debt in AI contexts were retained for inclusion.

Inclusion: The final inclusion of studies was determined through consensus between the researchers. Each article was independently reviewed by the first author, who annotated key observations related to technical debt manifestations and mitigation strategies. The second author evaluated these annotations for accuracy and consistency. Any discrepancies between reviewers were resolved through discussion, ensuring methodological rigor and minimizing selection bias.

Visualization: The PRISMA flow diagram (Figure 1) provides a visual summary of the study identification, screening, eligibility assessment, and inclusion process, including the number of records at each stage and reasons for exclusion where applicable.

3.4. Supplementary Search Strategy

To enhance comprehensiveness beyond the database-driven retrieval, additional studies were identified through supplementary search strategies. Specifically, backward and forward snowballing techniques were applied to key primary studies. Manual citation network exploration and expert-driven selection were also performed to capture the emerging or under-indexed literature, particularly in niche domains such as educational platforms, game jams, and reinforcement learning competitions.

Through these supplementary methods, an additional 28 peer-reviewed publications were included, augmenting the robustness of the scoping review findings. All supplementary records were manually screened to ensure uniqueness and to avoid duplication, and they were subjected to the same eligibility criteria and thematic classification process as the database-derived records. Although the exact number of duplicates identified was not recorded systematically, our manual logs suggest that approximately 8–10 overlapping records were identified and excluded during this process.

3.5. Data Extraction and Classification

The final corpus consisted of 100 peer-reviewed publications: 72 studies selected through the systematic PRISMA-based search and 28 studies incorporated via supplementary strategies (see Table 2).

Data extraction focused on the following aspects:

Technical debt types addressed (e.g., data debt, model debt, algorithm debt, etc.).
Research methodologies employed (e.g., case studies, empirical analysis, theoretical frameworks).
Identified challenges and proposed mitigation strategies.
Application contexts relevant to AI-based competition platforms.

A thematic analysis was conducted to classify the extracted studies by drawing on existing technical debt frameworks in AI and ML systems [20,30,31], while also extending and consolidating them into a unified typology. This enriched categorization not only integrates prior classifications but also introduces new debt types observed during the review. It served as the foundation for the stakeholder-oriented assessment questionnaire presented in the following section.

To ensure the consistency and reliability of the classification process, a subset of 30 papers (representing 30% of the total sample) was independently annotated by both authors with respect to the two most prominent technical debt types discussed in each study. The level of agreement was evaluated using Cohen’s kappa, a widely used measure for inter-rater reliability on categorical data. The resulting score of κ = 0.857 is considered almost perfect, indicating a very high level of agreement [113]. (See Table A1 in Appendix B for the full agreement matrix.) Minor disagreements were resolved through discussion before finalizing the thematic synthesis.

Each study was assigned one or more technical debt categories based on its primary thematic content and stated contributions. The assignment was informed by and aligned with our unified typology and was based on a close reading of abstracts, objectives, and methodological sections. This ensured a consistent and meaningful categorization across the entire corpus.

Table 2. Studies included in the scoping review.

No	Title		Author(s)	Year	Technical Debt Type
1	Algorithm Debt: Challenges and Future Paths	[114]	EIO Simon, M Vidoni, FH Fard	2023	Algorithm
2	Machine Learning Algorithms, Real-World Applications and Research Directions	[115]	Sarker	2021	Algorithm
3	Toward understanding deep learning framework bugs	[116]	Chen, J., Liang, Y., Shen, Q., Jiang, J. & Li, S.	2023	Algorithm
4	Understanding software-2.0: A study of machine learning library usage and evolution	[117]	Dilhara, M., Ketkar, A. & Dig, D.	2021	Algorithm
5	A survey on deep reinforcement learning architectures, applications and emerging trends	[118]	Balhara et al.	2022	Architectural
6	Adapting Software Architectures to Machine Learning Challenges	[20]	Serban, A. Visser, J.	2022	Architectural
7	An Empirical Study of Software Architecture for Machine Learning	[119]	Serban, A. Visser, J.	2021	Architectural
8	Architecting the Future of Software Engineering	[120]	Carleton, A., Shull, F. & Harper, E.	2022	Architectural
9	Architectural Decisions in AI-Based Systems: An Ontological View	[121]	Franch, X.	2022	Architectural
10	Architecture Decisions in AI-based Systems Development: An Empirical Study	[122]	Zhang, B., Liu, T., Liang, P., Wang, C., Shahin, M. & Yu, J.	2023	Architectural
11	Engineering AI Systems: A Research Agenda	[123]	Bosch, J., Olsson, H. H. & Crnkovic, I.	2021	Architectural
12	Machine Learning Architecture and Design Patterns	[124]	Washizaki et al.	2020	Architectural
13	Software Architecture for ML-based Systems: What Exists and What Lies Ahead	[125]	Muccini, H. & Vaidhyanathan, K.	2021	Architectural
14	Software Engineering for AI-Based Systems: A Survey	[1]	Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, A. & Wagner, S.	2022	Architectural
15	Comprehending the Use of Intelligent Techniques to Support Technical Debt Management	[57]	Albuquerque, D., Guimaraes, E., Tonin, G., Perkusich, M., Almeida, H. & Perkusich, A.	2022	Build
16	Searching for Build Debt Experiences Managing Technical Debt at Google	[126]	Morgenthaler et al.	2012	Build
17	Better Code, Better Sharing: On the Need of Analyzing Jupyter Notebooks	[67]	Wang, J., Li, L. & Zeller, A.	2020	Code
18	Characterizing TD and Antipatterns in AI-Based Systems: A Systematic Mapping Study	[30]	Bogner, J., Verdecchia, R. & Gerostathopoulos, I.	2021	Code
19	Code and Architectural Debt in Artificial Intelligence Systems	[19]	G Recupito, F Pecorelli, G Catolino et al.	2024	Code
20	Code Smells for Machine Learning Applications	[127]	Zhang, Haiyin Cruz, Luis Deursen, Arie Van	2022	Code
21	Code Smells in Machine Learning Systems	[66]	Gesi, J., Liu, S., Li, J., Ahmed, I., Nagappan, N., Lo, D. & Bao, L.	2022	Code
22	How does machine learning change software development practices?	[86]	Wan, Z., Xia, X., Lo, D. & Murphy, G. C.	2019	Code
23	Software Engineering for Machine Learning: A Case Study	[21]	Amershi et al.	2019	Code
24	Studying the Machine Learning Lifecycle and Improving Code Quality of Machine Learning Applications	[69]	Haakman, M. P. A.	2020	Code
25	The prevalence of code smells in machine learning projects	[65]	Van Oort, B., Cruz, L., Aniche, M. & Van Deursen, A.	2021	Code
26	A Software Engineering Perspective on Engineering Machine Learning Systems: State of the Art and Challenges	[62]	Giray	2021	Code
27	An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems	[4]	Tang, Y., Khatchadourian, R., Bagherzadeh, M., Singh, R., Stewart, A. & Raja, A.	2021	Configuration
28	Challenges in Deploying Machine Learning: A Survey of Case Studies	[64]	Paleyes, Urma, Lawrence	2022	Configuration
29	Software engineering challenges for machine learning applications: A literature review	[3]	Kumeno, F.	2019	Configuration
30	Software Engineering Challenges of Deep Learning	[60]	Arpteg, A., Brinne, B., Crnkovic-Friis, L. & Bosch, J.	2018	Configuration
31	Data collection and quality challenges in deep learning: a data-centric AI perspective	[41]	Whang et al.	2023	Data
32	Data Lifecycle Challenges in Production Machine Learning: A Survey	[40]	Polyzotis, N., Roy, S., Whang, S. E. & Zinkevich, M.	2018	Data
33	Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems	[27]	Foidl, H., Felderer, M. & Ramler, R.	2022	Data
34	Data Validation for Machine Learning	[28]	Polyzotis, N., Zinkevich, M., Roy, S., Breck, E. & Whang, S.	2019	Data
35	Data Validation Process in Machine Learning Pipeline	[43]	Vadavalasa	2021	Data
36	Risk-Based Data Validation in Machine Learning-Based Software Systems	[44]	Foidl, Felderer	2019	Data
37	Software Quality for AI: Where We Are Now?	[128]	Lenarduzzi, V., Lomio, F., Moreschini, S., Taibi, D. & Tamburri, D. A.	2021	Data
38	Technical Debt in Data-Intensive Software Systems	[129]	Foidl, H., Felderer, M. & Biffl, S.	2019	Data
39	Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure	[42]	Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O. & Mitchell, M.	2021	Data
40	Debugging Machine Learning Pipelines	[130]	Lourenço, Freire, Shasha	2019	Defect
41	Is using deep learning frameworks free? Characterizing and Measuring Technical Debt in Deep Learning Applications	[29]	Liu, Jiakun and Huang, Qiao and Xia, Xin and Shang, Weiyi	2020	Defect
42	Machine Learning: The High-Interest Credit Card of Technical Debt	[131]	Sculley et al.	2014	Defect
43	Technical Debt in AI-Enabled Systems: On the Prevalence, Severity, Impact, and Management Strategies for Code and Architecture	[19]	Recupito, Gilberto et al.	2024	Defect
44	Technical Debt Payment and Prevention Through the Lenses of Software Architects	[132]	Pérez, B., Castellanos, C., Correal, D., Rios, N., Freire, S., Spínola, R. & Izurieta, C.	2021	Defect
45	Studying Software Engineering Patterns for Designing ML Systems	[38]	Washizaki, H., Uchida, H., Khomh, F. & Guéhéneuc, Y. G.	2019	Design
46	Common problems with Creating Machine Learning Pipelines from Existing Code	[133]	O’Leary, K. & Uchida, M.	2020	Design
47	Design Patterns for AI-based Systems: A Multivocal Literature Review and Pattern Repository	[37]	Heiland, L., Hauser, M. & Bogner, J.	2023	Design
48	Software-Engineering Design Patterns for Machine Learning Applications	[31]	Washizaki, H., Khomh, F., Guéhéneuc, Y. G., Takeuchi, H., Natori, N., Doi, T. & Okuda, S.	2022	Design
49	Correlating Automated and Human Evaluation of Code Documentation Generation Quality	[134]	Hu et al.	2022	Documentation
50	Maintainability Challenges in ML: A Systematic Literature Review	[75]	Shivashankar, K. & Martini, A.	2022	Documentation
51	Software Documentation is not Enough! Requirements for the Documentation of AI	[135]	Königstorfer, F. & Thalmann, S.	2021	Documentation
52	Understanding Implementation Challenges in Machine Learning Documentation	[74]	Chang, J. & Custis, C.	2022	Documentation
53	“This is Just a Prototype”: How Ethics Are Ignored in Software Startup-Like Environments	[79]	Vakkuri, V., Kemell, K. K., Jantunen, M. & Abrahamsson, P.	2020	Ethics
54	Explainable Deep Reinforcement Learning: State of the Art and Challenges	[90]	George A. Vouros	2022	Ethics
55	Managing bias in AI	[136]	Roselli, D., Matthews, J. & Talagala, N.	2019	Ethics
56	Patterns and Anti-Patterns, Principles and Pitfalls: Accountability and Transparency	[81]	Matthews, J.	2020	Ethics
57	Principles alone cannot guarantee ethical AI	[26]	Mittelstadt, B.	2019	Ethics
58	Who pays for ethical debt in AI?	[82]	Petrozzino, C.	2021	Ethics
59	AI Competitions as Infrastructures Examining Power Relations on Kaggle and Grand Challenge in AI-Drive	[6]	Luitse, Blanke, Poell	2023	Infrastructure
60	Infrastructure for Usable Machine Learning the Stanford DAWN Project	[50]	Bailis et al.	2017	Infrastructure
61	Practices and Infrastructures for Machine Learning Systems: An Interview Study in Finnish Organizations	[137]	Muiruri, D., Lwakatare, L. E., Nurminen, J. K. & Mikkonen, T.	2022	Infrastructure
62	A Meta-Summary of Challenges in Building Products with ML Components—Collecting Experiences from 4758	[138]	Nahar et al.	2023	Model
63	A Taxonomy of Software Engineering Challenges for Machine Learning Systems—An Empirical Investigation	[47]	Lwakatare, L. E., Raj, A., Bosch, J., Olsson, H. H. & Crnkovic, I.	2019	Model
64	Clones in Deep Learning Code: What, where, and why?	[139]	Jebnoun, H., Rahman, M. S., Khomh, F. & Muse, B. A.	2022	Model
65	Empirical Analysis of Hidden Technical Debt Patterns in Machine Learning Software	[140]	Alahdab, M. & Çalıklı, G.	2019	Model
66	Hidden Technical Debt in Machine Learning Systems	[5]	Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D. & Dennison, D.	2015	Model
67	Machine Learning Model Development from a Software Engineering Perspective a Systematic Literature Review	[141]	Lorenzoni et al.	2021	Model
68	Quality issues in Machine Learning Software Systems	[52]	Côté, P. O., Nikanjam, A., Bouchoucha, R., Basta, I., Abidi, M. & Khomh, F.	2023	Model
69	Synergy Between Machine/Deep Learning and Software Engineering: How Far Are We?	[142]	Wang, S., Huang, L., Ge, J., Zhang, T., Feng, H., Li, M. & Ng, V.	2020	Model
70	The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction	[7]	Breck, E., Cai, S., Nielsen, E., Salib, M. & Sculley, D.	2017	Model
71	Towards CRISP-ML(Q): A ML Process Model with Quality Assurance Methodology	[49]	Studer, S., Bui, T. B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S. & Müller, K. R.	2021	Model
72	Towards Guidelines for Assessing Qualities of Machine Learning Systems	[143]	Siebert, J., Joeckel, L., Heidrich, J., Nakamichi, K., Ohashi, K., Namba, I. & Aoyama, M.	2020	Model
73	Understanding Development Process of Machine Learning Systems: Challenges and Solutions	[70]	de Souza Nascimento, E., Ahmed, I., Oliveira, E., Palheta, M. P., Steinmacher, I. & Conte, T.	2019	Model
74	What Is Really Different in Engineering AI-Enabled Systems?	[46]	Ozkaya, I.	2020	Model
75	Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process	[71]	Nahar, N., Zhou, S., Lewis, G. & Kästner, C.	2022	People
76	How Do Engineers Perceive Difficulties in Engineering of Machine-Learning Systems?—Questionnaire Survey	[78]	Ishikawa, F. & Yoshioka, N.	2019	People
77	What is Social Debt in Software Engineering?	[76]	Tamburri, D. A., Kruchten, P., Lago, P. & van Vliet, H.	2013	People
78	Exploring the Impact of Code Clones on Deep Learning Software	[72]	Mo, R., Zhang, Y., Wang, Y., Zhang, S., Xiong, P., Li, Z. & Zhao, Y.	2023	Process
79	Studying Software Engineering Patterns for Designing ML Systems	[38]	Washizaki, H., Uchida, H., Khomh, F. & Guéhéneuc, Y. G.	2019	Process
80	It Takes Three to Tango: Requirement, Outcome/data, and AI Driven Development	[144]	Bosch, J., Olsson, H. H. & Crnkovic, I.	2018	Requirements
81	MLife a Lite Framework for Machine Learning Lifecycle Initialization	[145]	Yang et al.	2021	Requirements
82	Requirements Engineering Challenges in Building AI-based Complex Systems	[146]	Belani, H., Vukovic, M. & Car, Ž.	2019	Requirements
83	Requirements Engineering for Artificial Intelligence Systems: A Systematic Mapping Study	[34]	Ahmad, K., Abdelrazek, M., Arora, C., Bano, M. & Grundy, J.	2023	Requirements
84	Requirements Engineering for Machine Learning: Perspectives from Data Scientists	[35]	Vogelsang, A. & Borg, M.	2019	Requirements
85	23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software	[63]	OBrien, D., Biswas, S., Imtiaz, S., Abdalkareem, R., Shihab, E. & Rajan, H.	2022	Self-Admitted (SATD)
86	A Large-Scale Empirical Study on Self-Admitted Technical Debt	[84]	Bavota, G. & Russo, B.	2016	Self-Admitted (SATD)
87	An Empirical Study of Self-Admitted Technical Debt in Machine Learning Software	[83]	Bhatia, A., Khomh, F., Adams, B. & Hassan, A. E.	2023	Self-Admitted (SATD)
88	Automating Change-level Self-Admitted Technical Debt Determination	[147]	Yan, M., Xia, X., Shihab, E., Lo, D., Yin, J. & Yang, X.	2018	Self-Admitted (SATD)
89	Self-Admitted Technical Debt in R: Detection and Causes	[148]	Rishab Sharma, Ramin Shahbazi, Fatemeh H. Fard, Zadia Codabux, Melina Vidoni	2022	Self-Admitted (SATD)
90	Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?	[149]	Mastropaolo, A., Di Penta, M. & Bavota, G.	2023	Self-Admitted (SATD)
91	A Systematic Mapping Study on Testing of Machine Learning Programs	[150]	Sherin, S. & Iqbal, M. Z.	2019	Test
92	Machine Learning Testing: Survey, Landscapes and Horizons	[51]	Zhang, J. M., Harman, M., Ma, L. & Liu, Y.	2020	Test
93	On Testing Machine Learning Programs	[55]	Braiek, H. B. & Khomh, F.	2020	Test
94	Testing Machine Learning based Systems: A Systematic Mapping	[151]	Riccio, V., Jahangirova, G., Stocco, A., Humbatova, N., Weiss, M. & Tonella, P.	2020	Test
95	“We Have No Idea How Models Will Behave in Production until Production”: How Engineers Operationalize Machine Learning	[152]	Shankar, S., Garcia, R., Hellerstein, J. M. & Parameswaran, A.	2024	Versioning
96	On Challenges in Machine Learning Model Management	[48]	Schelter, S., Biessmann, F., Januschowski, T., Salinas, D., Seufert, S. & Szarvas, G.	2015	Versioning
97	On the Challenges of Migrating to Machine Learning Life Cycle Management Platforms	[58]	Njomou, A. T., Fokaefs, M., Silatchom Kamga, D. F. & Adams, B.	2022	Versioning
98	Software Engineering Challenges of Deep Learning	[60]	Arpteg, A., Brinne, B., Crnkovic-Friis, L. & Bosch, J.	2018	Versioning
99	The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool	[61]	Kery, M. B., Radensky, M., Arya, M., John, B. E. & Myers, B. A.	2018	Versioning
100	Versioning for End-to-End Machine Learning Pipelines	[59]	Van Der Weide, T., Papadopoulos, D., Smirnov, O., Zielinski, M. & Van Kasteren, T.	2017	Versioning

3.6. Summary of Materials and Methods

Through a systematic scoping review aligned with the PRISMA-ScR guidelines, this study synthesized findings from a corpus of 100 peer-reviewed publications, comprising both database-sourced and supplementary-identified studies. The structured methodology employed—from comprehensive search strategies and rigorous multi-phase screening to thematic classification and data extraction—ensures a high level of transparency, reproducibility, and methodological rigor. The insights gathered from this extensive review form the empirical foundation for the development of a unified typology of technical debt types specific to AI-based systems and competition platforms. Moreover, they guide the design of a stakeholder-oriented assessment framework and questionnaire, enabling the quantitative evaluation of technical debt manifestations across diverse AI contexts. The next section presents the main findings of this review, organized according to the identified technical debt categories.

4. Results

This section presents the findings of our scoping review, addressing the two research questions defined in Section Research Questions. First, we map the landscape of technical debt types as observed in AI-based competition platforms (RQ1). Then, we summarize the application and preliminary evaluation of a stakeholder-oriented questionnaire for quantifying technical debt manifestations (RQ2).

4.1. Overview

Our analysis reveals that all 18 recognized technical debt categories are relevant to AI/ML competition environments, each introducing distinct risks and operational challenges. The results synthesize empirical examples from the literature and contextualize them within real-world competition scenarios to highlight their practical significance. The distribution of publications across the 18 debt types is summarized in Table 3. In the following paragraphs, we discuss the most and least represented categories.

4.2. Mapping of Technical Debt Categories

The scoping review identified 18 distinct technical debt types applicable to AI-based systems. Figure 2 shows the distribution of reviewed studies across these categories.

The analysis reveals a noticeable disparity in research attention across technical debt categories. Model Debt, Architectural Debt, Data Debt, Configuration Debt, and Code Debt emerge as focal points in the literature, often linked to concerns around maintainability, performance degradation, and system fragility. These areas are frequently discussed in the context of model reproducibility, pipeline consistency, and configuration correctness, particularly within competitive or time-constrained development environments.

In contrast, Infrastructure Debt, People Debt, Build Debt, and Process Debt are markedly underrepresented. Despite their operational relevance—especially in educational or collaborative AI platforms—they are seldom addressed explicitly. This uneven distribution not only reflects current research priorities but also underscores overlooked dimensions that warrant deeper empirical study. Future investigations, particularly in settings involving novice developers or interdisciplinary teams, could shed light on how these underexplored debt types affect project outcomes and platform usability.

4.3. Findings per Technical Debt Type

Although Architecture Debt and Design Debt are presented jointly in this subsection due to their overlapping characteristics, they are retained as distinct categories within the overall technical debt typology.

The following subsections summarize the characteristics of each technical debt category, including definitions, empirical examples drawn from AI competition platforms, primary responsible stakeholders, and severity scores based on their observed impact. These scores were derived qualitatively, based on how often each debt type was encountered in the literature and how serious the reported consequences were in terms of platform architecture, maintainability, or operational reliability. A score of 5 indicates recurring and high-impact issues, while lower values reflect more localized or less frequently reported forms of debt.

Each debt type is presented in a structured format to facilitate stakeholder-specific interpretation and prioritization.

4.3.1. Requirements Debt

Definition: Problems related to the elicitation, specification, or management of requirements in AI competition platforms, often due to the evolving and data-driven nature of AI systems.

Problem: Requirements in AI competitions are frequently underspecified or ambiguous. Traditional RE methods fall short in expressing goals such as explainability, fairness, or dataset constraints.

Example: In a deep learning competition for medical imaging, vague requirements such as “maximize Area under the Curve (AUC)” may lead teams to overfit on artifacts, as organizers fail to define distributional shifts or minimum documentation standards.

Stakeholder: Organizer

Severity Score: 5—Critical impact on fairness, reproducibility, and outcome validity.

4.3.2. Architectural Debt

Definition: Suboptimal architectural decisions that reduce modularity, scalability, and flexibility in AI competition platforms.

Problem: Ad hoc coupling between game engines, scoring logic, and submission systems leads to rigid infrastructures. Changes (e.g., enabling live agents) often require system-wide rewrites.

Example: A platform initially designed for static submissions attempts to support online reinforcement learning agents but fails due to hardcoded interactions between components.

Stakeholder: Organizer

Severity Score: 5—High risk to adaptability, extensibility, and innovation in platform evolution.

4.3.3. Design Debt

Definition: Poor application of software design principles, patterns, or abstractions in AI/ML platforms, hindering maintainability and reuse.

Problem: Rapid integration of ML components often leads to code duplication, unclear responsibilities, and violation of SOLID principles, especially in data preprocessing or model orchestration layers.

Example: A competition platform adds ad hoc wrappers for each new model type instead of generalizing through a strategy pattern, leading to inconsistent behavior and hard-to-maintain code.

Stakeholder: Organizer

Severity Score: 3—Impacts long-term evolvability and codebase clarity.

4.3.4. Data Debt

Definition: Liabilities stemming from incomplete, undocumented, biased, or unstable datasets used in training and evaluation.

Problem: Competitions may provide datasets without metadata, versioning, or consistency between training and test sets, leading to training-serving skew, bias, and poor reproducibility.

Example: A competition provides sunny-weather training data but evaluates submissions on mixed-weather test sets without documentation. Models fail dramatically, undermining the leaderboard.

Stakeholder: Organizer

Severity Score: 5—Critical for fairness, scientific validity, and real-world generalizability.

4.3.5. Algorithm Debt

Definition: Issues resulting from poorly chosen or implemented learning algorithms, including lack of robustness, generalization, or explainability.

Problem: Participants under deadline pressure may implement experimental or fragile variants that overfit or break under real-world conditions.

Example: Custom Q-learning variants win a leaderboard but later prove unstable due to missing regularization and evaluation under limited scenarios.

Stakeholder: Participant

Severity Score: 4—Undermines reproducibility and scientific insight.

4.3.6. Model Debt

Definition: Debt incurred from insufficient model validation, lack of retraining protocols, and inadequate lifecycle management.

Problem: Emphasis on leaderboard performance often leads to overfitting, missing metadata, and models unfit for deployment or reuse.

Example: A high-scoring fraud detection model performs poorly when evaluated on a new dataset due to outdated assumptions and no version control.

Stakeholder: Organizer

Severity Score: 4—Delays or prevents downstream integration and transparency.

4.3.7. Infrastructure Debt

Definition: Technical liabilities arising from unstable, non-reproducible, or poorly scalable infrastructure supporting AI competitions, including compute environments, storage systems, and evaluation servers.

Problem: Inconsistent environments and poor documentation of submission pipelines can result in crashes, unfair evaluations, and erosion of participant trust.

Example: In a medical imaging competition, mismatches between training Docker environments and evaluation containers caused model failures and leaderboard noise.

Stakeholder: Organizer

Severity Score: 5—Directly impacts platform credibility, fairness, and reproducibility.

4.3.8. Test Debt

Definition: Accumulation of inadequate or missing tests, poor validation pipelines, and inability to assess robustness or fairness of submitted models.

Problem: Testing focuses narrowly on metric-based evaluation, neglecting edge cases, fairness, and generalization properties.

Example: A fairness-sensitive competition lacks bias checks in test data, allowing discriminatory models to top the leaderboard undetected.

Stakeholder: Both, Organizer and Participant

Severity Score: 4—Undermines reliability, trust, and real-world applicability.

4.3.9. Build Debt

Definition: Technical liabilities arising from fragile, inconsistent, or undocumented build pipelines, affecting submission reproducibility and evaluation reliability.

Problem: Mismatches in libraries, missing version control in Docker images, and ad hoc packaging scripts lead to evaluation failures and participant frustration.

Example: In a vision challenge, CUDA version mismatches (e.g., participants using CUDA 11.1 while the evaluation server supported CUDA 10.2) between participant images and evaluation servers caused widespread container crashes.

Stakeholder: Organizer

Severity Score: 4—Major barrier to reproducibility and fair evaluation.

4.3.10. Versioning Debt

Definition: Inadequate tracking of dataset, model, and code versions, leading to reproducibility failures and conflicting results.

Problem: Lack of strict version control in multi-stage competitions creates confusion, mistrust in leaderboard shifts, and difficulty in reproducing past results.

Example: A baseline model changes between competition rounds without declared dataset versioning, causing participant frustration and leaderboard inconsistencies.

Stakeholder: Organizer

Severity Score: 4—High risk for reproducibility, auditability, and scientific integrity.

4.3.11. Configuration Debt

Definition: Poor management of model and environment configuration settings, leading to silent failures, runtime errors, and reproducibility issues.

Problem: Lack of validation and standardization of configurations (e.g., hyperparameters, preprocessing options) creates fragile, unpredictable behaviors.

Example: Typographical errors in Docker environment variables led to silent dropout deactivation, reducing model robustness without detection.

Stakeholder: Both, Organizer and Participant

Severity Score: 4—Silent but dangerous; compromises performance, correctness, and reproducibility.

4.3.12. Code Debt

Definition: Poor coding practices, lack of modularity, hardcoded paths, and fragmented experimentation logic leading to maintainability and reproducibility challenges.

Problem: Rapid prototyping and time pressure lead to unstructured, undocumented, and non-reusable codebases, especially in notebook-based workflows.

Example: A winning solution submitted as a monolithic Jupyter notebook cannot be reused or generalized due to embedded hardcoded assumptions.

Stakeholder: Both, Organizer and Participant

Severity Score: 4—Major impact on long-term usability, reproducibility, and knowledge transfer.

4.3.13. Process Debt

Definition: Deficiencies in lifecycle workflows, coordination practices, and process quality assurance mechanisms in AI competition platforms.

Problem: Lack of standardized submission protocols, versioned communications, and reproducibility-oriented lifecycle models lead to confusion, errors, and friction.

Example: A sudden schema change during a multi-phase competition without clear communication causes participant disqualification due to outdated pipelines.

Stakeholder: Organizer

Severity Score: 4—High risk to platform credibility, efficiency, and fairness.

4.3.14. Documentation Debt

Definition: Insufficient, inconsistent, or outdated documentation covering datasets, models, evaluation procedures, and platform infrastructure.

Problem: Missing or unclear documentation hampers reproducibility, transparency, and post-competition knowledge transfer.

Example: Lack of documented preprocessing steps in a winning solution prevents organizers from validating or integrating the model into production systems.

Stakeholder: Organizer

Severity Score: 4—Compromises transparency, onboarding, and reuse.

4.3.15. People—Social Debt

Definition: Socio-technical misalignments, communication breakdowns, and lack of cross-disciplinary collaboration practices within or across teams.

Problem: Misunderstandings between data scientists, engineers, and organizers create delays, mistrust, and systemic inefficiencies.

Example: A team fails to deploy their solution because of coordination gaps between model developers and DevOps engineers during the evaluation phase.

Stakeholder: Both, Organizer and Participant

Severity Score: 3—Significant hidden risk affecting collaboration, productivity, and system robustness.

4.3.16. Ethics Debt

Definition: Liabilities stemming from neglecting fairness, transparency, explainability, and accountability in AI competition design and evaluation.

Problem: When fairness auditing and explainability are not explicitly required, competitions may reward models that perform well on metrics but encode unintended biases.

Example: For instance, in a healthcare competition, biased training data and the absence of fairness checks led a winning model to systematically underperform on underrepresented groups.

Stakeholder: Both, Organizer and Participant

Severity Score: 5—Critical ethical and societal risks, particularly in sensitive application domains.

4.3.17. Self-Admitted Technical Debt (SATD)

Definition: Developer-recognized shortcuts or compromises explicitly documented via comments, TODOs, or deferred improvements in codebases.

Problem: SATD left unresolved in competition submissions can propagate hidden defects and technical liabilities into production environments.

Example: A winning solution includes TODO comments about normalization errors that were never addressed, compromising reproducibility in downstream evaluations.

Stakeholder: Participant

Severity Score: 4—Hidden but impactful; undermines maintainability and trust.

4.3.18. Defect Debt

Definition: Accumulation of unresolved bugs, faulty logic, or data leakage vulnerabilities within competition systems or submitted models.

Problem: Inadequate validation allows latent defects to affect scoring accuracy, leaderboard fairness, and participant trust.

Example: A submission exploits a silent bug in the evaluation API to manipulate leaderboard rankings, discovered only after post-competition audits.

Stakeholder: Organizer

Severity Score: 4—Directly compromises trust, fairness, and scientific validity.

To enhance clarity and support practical understanding, Table 4 provides a consolidated overview of the 19 technical debt types identified in this study. Each row maps a debt type to the primary stakeholder(s) it affects, outlines its typical impact on AI-based competition platforms, and suggests representative mitigation strategies. To further assist readers in navigating the assessment tool, the table also includes references to the corresponding questionnaire items. This synthesis facilitates a smooth conceptual transition from the thematic findings to the stakeholder-oriented evaluation instrument presented in Section 5.

4.4. Observed Patterns and Gaps

The thematic analysis of 100 peer-reviewed studies revealed several recurring patterns regarding the manifestation, co-occurrence, and research treatment of technical debt types in AI/ML competition platforms.

4.4.1. Recurring Co-Occurrences

Among the analyzed studies, several debt types consistently appeared together, forming recurring patterns. Several high-impact debt types consistently appear in combination, suggesting deeper systemic vulnerabilities. Notably, Model Debt, Infrastructure Debt, and Data Debt often co-occur, typically stemming from weak lifecycle governance, outdated or inconsistent infrastructure configurations, and reliance on poorly curated datasets. When present together, these debts can amplify one another—leading to brittle AI models that perform well in constrained testing but collapse under real-world variation.

Another recurring pairing involves Algorithm Debt and Test Debt. In these cases, the absence of rigorous evaluation protocols allows algorithmic shortcuts or overfitted strategies to succeed in leaderboard rankings, masking their inability to generalize beyond the competition dataset. These patterns indicate that technical debt rarely exists in isolation and that cumulative effects may be more damaging than previously understood, especially in fast-paced competitive settings.

Such patterns pose critical risks in competition settings, where minor weaknesses can disproportionately affect outcomes.

4.4.2. Underrepresented Technical Debt Categories

The analysis also highlights areas that are comparatively underexplored in the literature. These include

Process Debt and People (Social) Debt: Despite their operational impact, these categories receive limited formal treatment. Their manifestations—such as unstructured workflows, poor coordination, or unclear stakeholder roles—are often implicit, making them difficult to quantify yet highly influential in competition outcomes.
Build Debt and Infrastructure Debt: While highly relevant in reproducibility and deployment contexts, these are seldom addressed explicitly, particularly in the context of educational or open-source competition platforms.

This imbalance suggests a research gap in understanding the full spectrum of debt types, especially those that emerge from socio-technical dynamics and operational misalignments.

4.4.3. Educational and Human-Centered Contexts

Competitions used in academic settings, such as game jams or course-based AI challenges, show recurring debt patterns linked to Documentation, Test, and Process Debt. These environments emphasize rapid experimentation and learning but often lack structured engineering practices, versioning protocols, and reproducibility checks.

As a result, while they serve as valuable training grounds, they also risk normalizing short-term, debt-prone behaviors unless complemented by clear guidance, infrastructure support, and reflective evaluation mechanisms.

4.4.4. Research and Practical Implications

These findings call for

Greater empirical focus on underrepresented debt types, especially social and process-oriented debt.
Design of competition platforms and guidelines that explicitly address debt-prone areas (e.g., versioning enforcement, configuration validation).
Inclusion of metrics and documentation requirements that encourage sustainable and reproducible AI practices, even in time-constrained or educational settings.

This analysis directly supports the design of our stakeholder-oriented questionnaire and provides a foundational understanding of the technical debt landscape in AI/ML competitions.

4.5. Stakeholder Roles and Technical Debt Responsibility

To support the practical application of our findings, we mapped each technical debt type to the primary stakeholder responsible for its prevention, management, or resolution within AI/ML competition platforms. This mapping is based on a synthesis of the reviewed literature and real-world examples encountered in competition settings.

Table 5 provides an overview of the primary stakeholder roles associated with each of the 18 technical debt types. The classification distinguishes between

Competition Organizers: Those responsible for designing and maintaining the platform, defining evaluation protocols, providing infrastructure, and ensuring transparency and fairness.
Participants: Teams or individuals who submit AI models or agents to the competition and are accountable for code quality, reproducibility, and ethical compliance.
Both: Debt types where responsibility is shared between organizers and participants due to their interdependent nature.

For clarity and conciseness, the term Organizer is used here as an umbrella role that encompasses several functionally distinct sub-roles frequently encountered in competition platforms. These include, among others, the Data Provider, who supplies and curates the competition datasets; the Platform Provider, responsible for maintaining the underlying infrastructure and interface; and the Evaluation Designer, who defines performance metrics and scoring protocols. These sub-roles, while analytically distinct, often overlap in practice and are collectively represented as Organizer to streamline the stakeholder mapping.

Similarly, the term Participant denotes any individual or team submitting AI models to the competition, and may include distinct roles such as model developers, data pre-processors, or post-evaluation analyzers, whose responsibilities vary depending on the competition’s scope and structure.

This mapping enables competition stakeholders to better allocate responsibility, prioritize mitigation actions, and reflect on the structural sources of technical debt in their systems. It also serves as a foundation for customizing the assessment questionnaire introduced in Section 5. A visual representation of stakeholder-specific impact levels per debt type is provided in Figure 3.

Heatmap illustrating the relative impact of each debt type on organizers and participants. The matrix complements Table 5 by visually emphasizing shared and high-risk responsibilities.

4.6. Early Academic Deployment of the Questionnaire

At the time of writing, the questionnaire has been distributed more broadly to academic communities, yielding over 70 preliminary responses. Although the data have not yet been formally analyzed, this initial uptake indicates community interest and provides a promising foundation for future quantitative validation of the instrument.

Further analysis of these responses, including score distribution, stakeholder-specific patterns, and inter-debt correlations, is planned as part of future work. The structure and scoring logic of the questionnaire are described in detail in Section 5.

4.7. Illustrative Use Case: Applying the Questionnaire to a Hypothetical Platform

To demonstrate the practical utility of the stakeholder-oriented questionnaire, we present a hypothetical use case involving an AI/ML competition platform designed for reinforcement learning agents. An organizer wishes to evaluate the platform’s technical debt profile before launching a new edition of the competition.

By completing the questionnaire, the organizer identifies several areas of concern:

In the Infrastructure Debt category (Q33–Q34), both items receive a NO response, indicating that the submission environment lacks container-based isolation and resource monitoring. Given their severity weight of 5, this yields a subtotal of 10 in this category.
For Documentation Debt (Q26–Q28), two items are answered NO and one I Don’t Know, with weights 4, 3, and 2, respectively, resulting in a subtotal of 9.
In Accessibility Debt (Q58–Q60), all three items receive NO responses. With assigned severity weights of 3, 2, and 2, the subtotal is 7 Calculated −7.
Conversely, in categories such as Model Debt (Q35–Q36) and Configuration Debt (Q7–Q9), most responses are YES or Not Applicable, contributing minimally or neutrally to the total debt score.

The platform’s preliminary technical debt score is calculated by summing the weighted debt contributions across all 19 categories. In this scenario, the subtotal scores suggest disproportionately higher debt in infrastructure, documentation, and accessibility.

Based on these observations, the organizer prioritizes the following mitigation actions:

Containerizing the submission pipeline and introducing system-level resource controls.
Updating the documentation portal with clear API usage instructions and dataset versioning details.
Integrating an accessibility audit checklist into the UI design process, including keyboard navigation and WCAG-compliant color contrast.

While this scenario is hypothetical, it reflects common challenges reported in the literature and showcases how the questionnaire can guide organizers in systematically identifying and addressing technical debt prior to competition deployment. The ability to quantify debt by category and severity also supports evidence-based prioritization.

5. Questionnaire and Quantification Method Approach

5.1. Overview and Purpose

To complement the qualitative typology of technical debt types introduced in Section 4, this section presents a structured instrument for the quantitative assessment of technical debt in AI/ML competition platforms. Building upon the 18-category typology and stakeholder mapping previously established, the proposed questionnaire offers a systematic, replicable means for assessing debt visibility and severity from both organizer and participant perspectives.

While the questionnaire quantifies technical debt as perceived by stakeholders, it is not meant to replace observational or statistical analysis, but rather to complement them. It functions as a lightweight, stakeholder-centric diagnostic tool, grounded in the literature-based typology presented in Section 4. Its purpose is twofold: (i) to foster awareness among developers and organizers of the specific types of technical debt most relevant to their role, and (ii) to provide a structured self-assessment mechanism that enables prioritization and reflection, especially in contexts where formal measurement may be difficult to implement.

Beyond its use in empirical studies, the questionnaire also fills an important practical and educational gap. For example, in university-based competitions or academic settings, it enables students to engage early with software quality principles and the long-term implications of technical debt. In professional environments, it allows teams to reflect on technical decisions, structure mentoring and onboarding processes, and build debt-awareness into early development phases. This rationale underlies its presentation in this paper not merely as an auxiliary tool, but as an actionable interface between the conceptual framework and real-world practice.

The assessment framework is structured around the 18 types of technical debt previously identified in RQ2 and assigns a corresponding set of evaluation questions to each stakeholder. The goal is to provide a practical means for AI competition platforms to detect and quantify areas where debt may be accumulating, and to enable targeted intervention. The next subsection details the structure of the questionnaire, followed by examples of its application in Section 5.3.

The questionnaire is intended as a dynamic and iterative tool. Based on feedback received during preliminary rounds, we refined the instrument prior to this submission, reducing the number of items from 68 to 60 while preserving full typological coverage. This updated version, now reflected in Appendix A, improves clarity, reduces redundancy, and lowers cognitive load, without compromising the representativeness of the 18 (plus 1) technical debt categories. This revision reflects the instrument’s adaptability and its commitment to practical usability in real-world contexts.

5.2. Structure and Scoring Logic

The questionnaire consists of 60 closed-form questions, each mapped to one of the 18 (plus one) technical debt types and assigned to a corresponding stakeholder (Organizer or Participant). All questions follow a binary response format:

YES—Best practice applied (reduces debt).
NO—Best practice not applied (adds debt).
I Don’t Know/I Don’t Answer—Indicates uncertainty (adds debt).
Not Applicable—Neutral, scored as zero.

Each item is internally assigned a severity weight from 1 to 5, based on its perceived importance in mitigating the relevant technical debt, as derived from our literature synthesis and thematic analysis (see Section 4). While respondents are not informed of the exact weights, the scoring logic operates as follows:

YES: reduces the total debt score, yields a negative value equal to the weight (−w).
NO or I Don’t Know/I Don’t Answer: increases the score, yields a positive value equal to the weight (+w).
Not Applicable: scored as zero (neutral).

The cumulative score for each debt category is calculated by summing the individual item scores. This aggregated value reflects the perceived severity of the technical debt:

A total score ≤0 indicates low or no observable technical debt.
A score between 1 and 10 suggests moderate technical debt.
A score above 10 indicates significant technical debt accumulation, signaling a need for focused mitigation.

A positive total score reflects divergence from best practices, while a negative or near-zero score suggests alignment. This scoring system enables comparative analysis across stakeholder groups and debt types and supports longitudinal tracking of improvements over time.

For certain categories, such as Accessibility Debt, responses are currently solicited only from organizers, given their direct control over platform design and usability. In future implementations, feedback from participants may be integrated to provide a more comprehensive assessment.

These category-level severity scores, as summarized in Figure 4, were derived by aggregating question-level weights and triangulating them with insights from the literature. While these scores are not intended to be absolute measures, they serve as relative indicators to support stakeholder prioritization, benchmarking, and longitudinal tracking. The scoring logic is designed to be adaptable and may be refined in future iterations based on empirical feedback or expanded debt taxonomies.

5.3. Examples of Use

The following examples in Table 6 and Table 7 illustrate how the questionnaire can be used by different stakeholders to identify technical debt in specific categories:

These examples demonstrate how even a small number of overlooked or uncertain practices can increase a stakeholder’s technical debt exposure. The scoring framework rewards transparency and alignment with sound engineering practices.

5.4. Full Set of Questions

The complete questionnaire, including all 60 questions across the 18 debt categories (plus Accessibility Debt), is included in Appendix A. Each question entry includes

The associated technical debt category,
The stakeholder role (Organizer or Participant),
The severity weight (1–5),
A brief justification, and
A realistic AI competition-specific scenario.

This structure allows for use in both self-assessment and instructional settings, offering a concrete and flexible framework for evaluating technical debt in complex AI systems.

5.5. Accessibility Debt

Accessibility Debt, as introduced in this study, refers to the barriers participants face due to a lack of immediate usability and inclusive design in competition platforms. These barriers include unintuitive interfaces, fragmented onboarding processes, insufficient documentation, lack of multilingual support, and limited responsiveness to novice needs [74,154,155]. Such issues can silently discourage engagement, disproportionately affecting underrepresented participants, and ultimately compromising competition fairness and diversity [9,24].

Unlike traditional usability or infrastructure concerns—typically addressed within broader software quality evaluations—Accessibility Debt is conceptualized here as a socio-technical liability capturing both experiential and structural design limitations specific to AI/ML competition platforms. While it overlaps with categories such as Infrastructure Debt (e.g., platform readiness or technical constraints) [50,137], Documentation Debt (e.g., missing or ambiguous instructions) [134,135], and Process Debt (e.g., opaque submission workflows or limited user guidance) [38,72], its focus is distinct: it addresses how barriers in accessibility accumulate into measurable forms of technical debt that hinder participation, particularly for newcomers or non-technical users.

The decision to introduce Accessibility Debt as a standalone category arose from recurring themes identified in the literature that were not sufficiently represented in existing typologies. Several studies allude to accessibility concerns—such as the absence of onboarding support, multilingual limitations, or high entry barriers—but do not explicitly categorize them under a cohesive technical debt label [8,18]. Our thematic synthesis consolidated these concerns into a unified and operationalizable category that emphasizes the importance of inclusive and usable design in competition infrastructures [9,103].

To enable empirical assessment, a dedicated subset of questionnaire items was developed to measure the presence and severity of Accessibility Debt. These items are scored from the organizer’s perspective, as the relevant design decisions fall under their control. Nevertheless, we acknowledge that participant feedback could offer valuable complementary insights and may be incorporated in future iterations of the assessment framework.

6. Discussion and Implications

As discussed in Section 5, the proposed stakeholder-oriented questionnaire not only quantifies technical debt across the 18 categories but also operationalizes a new category—Accessibility Debt—which emerged as a critical concern in both the literature and our pilot observations. The following section discusses the broader implications of our typology and assessment methodology.

6.1. Synthesis of Key Insights

This study examined how technical debt manifests and is managed in AI/ML competition platforms, addressing both conceptual gaps and practical needs. Through a scoping review of 100 publications and a structured stakeholder-oriented framework, we identified 18 recurring debt types that affect the sustainability, transparency, and fairness of AI systems in competitive settings.

The results show that technical debt in such platforms is multifaceted, encompassing not only traditional categories like code or test debt, but also AI-specific concerns such as model lifecycle management, data drift, fairness auditing, and reproducibility. Furthermore, socio-organizational debt types—such as People Debt, Process Debt, and the newly introduced Accessibility Debt—are often overlooked yet have significant impact on inclusion, usability, and educational value.

As summarized in Figure 4, categories such as Model Debt, Data Debt, and Ethics Debt were perceived as among the most severe, whereas areas such as Requirements Debt or Build Debt were more context-dependent. Moreover, the co-occurrence of high-severity debt types (e.g., infrastructure + model + data) suggests structural risk clusters that merit focused intervention (see Section 4.4). Representative examples of such interdependencies are further discussed in Section 6.3 and illustrated in Appendix B (Table A2), where indicative cause–effect relationships are mapped across debt types.

6.2. Practical Implications

The proposed framework has implications for three primary stakeholder groups:

Competition Organizers: Organizers should proactively assess and mitigate debt types under their responsibility, including infrastructure, configuration, documentation, and accessibility-related aspects. Tools like the stakeholder-specific questionnaire can support pre-competition audits, platform design iterations, and post-competition reflection. As shown in Figure 4, certain debt types (e.g., Data and Infrastructure Debt) have high organizer-side impact, highlighting the importance of early planning, communication, and versioning strategies.
Competition Participants: Participants can use the questionnaire to self-assess their submissions and development workflows, focusing on areas like code hygiene, test coverage, and algorithmic transparency. Identifying hidden debt (e.g., SATD, configuration or versioning issues) before submission enhances reproducibility and ethical compliance.
Educators and Instructors: Academic institutions can embed the questionnaire in AI engineering curricula to teach responsible development practices. The results from the pilot study (Section 4.6) show that students gained awareness of platform-level responsibilities and software-engineering-for-AI challenges, beyond mere model performance optimization.

6.3. Research Implications and Gaps

This work surfaces important research opportunities:

Underexplored categories such as People Debt, Process Debt, and Accessibility Debt merit dedicated empirical investigation, especially in educational and open-source platforms.
Quantitative evaluation of the questionnaire’s validity and sensitivity will be a priority for future work, once the collected responses are analyzed.
Interdependencies between debt types (e.g., between data, infrastructure, and model) should be formally modeled to better understand risk propagation. For example, poor documentation may directly hinder onboarding and inclusivity, thereby amplifying Accessibility Debt. Similarly, weak test practices can increase the likelihood of Defect Debt, while Build and Configuration Debt often co-occur in poorly modularized pipelines. These preliminary observations, drawn from the thematic analysis, support our view that certain debt types may be clustered and co-managed. While a formal dependency model is beyond the scope of this work, illustrative relationships are presented in Appendix B (Table A2) to support future investigations into structured cause-effect mappings among debts.
Expansion of stakeholder roles beyond organizers and participants (e.g., reviewers, maintainers, platform developers) could reveal additional systemic concerns.

The proposed typology and questionnaire offer a starting point for the development of AI-specific technical debt taxonomies and management strategies.

6.4. Threats to Validity

As with all scoping reviews, several limitations apply:

Selection Bias: Although the PRISMA-ScR process was followed (Section 3), some relevant but non-indexed studies may have been missed.
Publication Bias: The review relies on peer-reviewed and indexed literature, which may underrepresent studies with inconclusive, negative, or incomplete results. Additionally, the exclusion of non-English publications may limit insights from other linguistic and regional contexts. While this is common in scoping reviews, it remains a relevant limitation.
Subjectivity in Categorization: The thematic classification of technical debt types was grounded in the literature and expert interpretation, but alternative taxonomies may be proposed in future work.
Limited Evaluation Data: The pilot study in Section 4.6 involved a relatively small number of participants in an educational setting, and the broader deployment (currently at 60+ responses) has not yet been statistically analyzed.
Context-Specific Observations: Some examples and interpretations are tailored to academic or gamified platforms, and may require adaptation for industrial or large-scale AI competitions.
Limited Access to Commercial Platforms: The lack of access to proprietary codebases and internal infrastructures from commercial competition providers (e.g., Kaggle, DrivenData) restricts our ability to assess organizer-side technical debt in those environments. Nonetheless, the questionnaire remains applicable from the participant perspective, paving the way for future investigations into how such competitions influence development workflows and debt accumulation on the submission side.

These limitations notwithstanding, the study provides a grounded, reproducible, and practically oriented framework for understanding technical debt in AI competitions.

6.5. Future Work

We intend to expand this work through several key directions. First, we aim to conduct a quantitative analysis of questionnaire responses from current and future deployments to evaluate the tool’s robustness, identify stakeholder-specific debt patterns, and explore interdependencies between debt types. Second, we plan to develop platform-specific adaptations of the framework, particularly for widely used environments such as Kaggle, RLGame, and GBG Framework. These adaptations will consider domain-specific characteristics, including competition duration, evaluation metrics, and reproducibility mechanisms.

Third, we intend to extend the stakeholder scope by including additional roles such as reviewers, mentors, and platform maintainers, whose contributions and liabilities may affect platform sustainability. We also anticipate refining the existing typology, particularly in light of emerging AI trends such as generative models and reinforcement learning pipelines. Potential extensions may include Prompt Engineering Debt or Explainability Debt, which are increasingly relevant in complex, high-impact AI applications.

Our findings offer a structured starting point for promoting stakeholder-aligned, measurable, and ethically grounded technical debt management in AI competition settings.

7. Conclusions

This study addressed the growing need to systematically assess and manage technical debt in AI/ML competition platforms—dynamic environments where code quality, data integrity, and infrastructure sustainability are often compromised due to time pressure, complexity, or limited engineering practices.

By conducting a structured scoping review of 100 publications, we introduced a unified typology of 18 technical debt types and proposed a novel additional category, Accessibility Debt, which captures challenges related to the immediacy and ease of use of a technological platform. This category is particularly relevant in the context of AI competition platforms, where user interaction speed and operational usability are critical for effective engagement. These debt types were analyzed in terms of their definitions, stakeholder responsibilities, severity, and empirical manifestations in real-world platforms.

To support operational decision-making, we developed a stakeholder-oriented questionnaire for the structured evaluation of technical debt. The tool enables both organizers and participants to assess risks across multiple categories, providing actionable feedback through a simple scoring scheme. Initial dissemination within senior undergraduate and postgraduate courses has yielded over 70 responses, to be used for further analysis.

The study contributes both a conceptual foundation and a practical framework to the discourse on software engineering for AI. It emphasizes that technical debt in AI competitions is not merely a maintenance issue, but a multidimensional design, governance, and equity concern.

By making technical debt visible to and measurable by stakeholders, this work promotes more transparent, inclusive, and sustainable development practices in the evolving landscape of AI research competitions.

Broader Applicability and Design Implications

While the framework developed in this study was designed specifically for AI/ML competition platforms, several of the identified technical debt categories—such as infrastructure, data, configuration, and documentation debt—are relevant to a wider spectrum of AI-based systems. Nonetheless, certain aspects, including People Debt and Accessibility Debt, are more closely aligned with the unique characteristics of competition environments and may not directly generalize to enterprise, industrial, or safety-critical AI settings.

This limitation stems from the deliberate contextual scope of our study, rather than from a flaw or rigidity in the structure of typology itself. In fact, the insights gained through this study may contribute to the informed design of future competition platforms, helping organizers to prioritize maintainability, developer onboarding, inclusivity, and long-term sustainability. Moreover, the categorization of ethically and operationally significant debt types—such as Ethics Debt, Versioning Debt, and Documentation Debt—may provide a structured foundation for informing platform design practices and supporting alignment with emerging AI governance principles and policy frameworks, particularly those concerning auditability, explainability, and responsible system governance.

Author Contributions

Conceptualization, D.S. and D.K.; methodology, D.S.; validation, D.S. and D.K.; formal analysis, D.S.; investigation, D.S.; data curation, D.S.; writing—original draft preparation, D.S.; writing—review and editing, D.S. and D.K.; visualization, D.S.; supervision, D.K.; project administration, D.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this scoping review are derived from publicly available sources retrieved through a comprehensive search across major scientific databases, as described in the Methodology section. All included studies, extracted information, and bibliographic references are accessible through the repositories cited in the References list.

Acknowledgments

The authors acknowledge the use of ChatGPT-4 (OpenAI) for grammar and language polishing. The AI tool was not involved in any scientific content development. The authors reviewed all outputs and bear full responsibility for the final text. The questionnaire was initially developed with 68 questions and revised to 60 based on user feedback and expert review prior to submission.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
CPU	Central Processing Unit
DSLE	Data Science Lab Environment
GBG	General Board Game
GPU	Graphic Processing Unit
GVGAI	General Video Game Artificial Intelligence
ML	Machine Learning
PRISMA-ScR	Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR)
RL	Reinforcement Learning
RQ	Research Question
SATD	Self-Admitted Technical Debt
TD	Technical Debt
XAI	Explainable Artificial Intelligence

Appendix A. Full Questionnaire (Updated Version—60 Questions)

Appendix A.1. Algorithm Debt—Question 1

Question 1: Have you checked if the framework you are using has technical debt or may introduce glitches or incompatibility in your application?
Stakeholder: Organizers and Participants
Score: 3

Justification: Identifying and understanding the technical debt within the framework is essential. It can affect the application’s performance, scalability, and even the user experience. Glitches and incompatibilities can lead to a poor reputation and user frustration.

Example: If the platform is using an outdated version of TensorFlow, it might miss out on new optimizations that could speed up model training. If the chosen framework has a history of memory leaks, it could affect the platform’s ability to scale and handle multiple concurrent games.

Appendix A.2. Architectural Debt—Questions 2 to 3

2.: Question 2: Have you effectively separated concerns and ensured that code reuse does not lead to tightly coupled components?
Stakeholder: Organizers
Score: 5

Justification: Poor separation of concerns can lead to a tangled system that is hard to debug and evolve, significantly increasing technical debt.

Example: Using interfaces or abstract classes to define contracts between components, so they can be easily swapped or modified without affecting others.

3.: Question 3: Have you designed the environment for prototyping ML models to prevent the need to re-implement from scratch for production?
Stakeholder: Organizers
Score: 4

Justification: The need to re-implement models for production can lead to redundant work and increased technical debt.

Example: A model developed in a research setting that requires significant refactoring to be deployed in production.

Appendix A.3. Build Debt—Questions 4 to 5

4.: Question 4: Have you checked your app for bad dependencies or suboptimal dependencies of internal and external artificial intelligence models?
Stakeholder: Organizers and Participants
Score: 4

Justification: Ensuring that your app is free from bad or suboptimal dependencies is crucial for maintaining the integrity, performance, and security of your application. Bad or suboptimal dependencies can contribute to building debt by causing conflicts, reducing performance, and making the building process more fragile.

Example: A reinforcement learning platform might rely on a specific machine learning library for neural network computations. If this library is not kept up-to-date or is known to have performance issues, it could hinder the platform’s ability to scale or adapt to new challenges.

5.: Question 5: Have you tested the ability to build the app on different platforms?
Stakeholder: Organizers
Score: 3

Justification: Cross-platform compatibility is important for reaching a wider audience and ensuring that your app can operate on various devices and operating systems. Inability to build consistently across different platforms can be indicative of Build Debt, as it suggests a lack of portability and potential issues with platform-specific dependencies.

Example: Ensure that the platform can be deployed on both Windows and Linux systems, which might require different sets of dependencies and configurations.

Appendix A.4. Code Debt—Question 6

6.: Question 6: Have you identified and refactored low-quality, complex, and duplicated code sections, including dead codepaths and centralized scattered code, while ensuring clear component and code APIs?
Stakeholder: Organizers
Score: 5

Justification: Dead codepaths, low-quality code, and excessive complexity negatively affect performance, maintainability, and scalability. Removing unused or redundant code improves resource efficiency and reduces the risk of bugs. Detecting poorly written or overly complex logic early supports cleaner design, easier debugging, and long-term maintainability. Simplifying complex sections and consolidating duplicated code enhances readability, encourages reuse, and promotes consistency across the platform.

Example: For instance, if the codebase contains duplicate functions for player score calculation, refactoring them into a single, well-defined function improves clarity and reduces maintenance overhead. Similarly, breaking down a large, hard-to-follow function into smaller components can improve readability. Decoupling tightly bound logic and defining clear interfaces also helps simplify debugging and feature extension. Another example would be restructuring an API to follow RESTful conventions, improving usability and integration.

Appendix A.5. Configuration Debt—Questions 7 to 9

7.: Question 7: Is it easy to specify a configuration as a small change from a previous configuration?
Stakeholder: Organizers
Score: 3

Justification: In RL, quick experimentation is crucial. Being able to specify configuration changes easily allows rapid iteration and model improvement.

Example: A platform allows incremental changes to the learning rate of an RL agent by modifying a single line in a YAML file, facilitating quick experimentation.

8.: Question 8: Do you have poorly documented or undocumented configuration files?
Stakeholder: Organizers
Score: 4

Justification: Proper documentation facilitates understanding and maintenance of configurations.

Example: A platform’s configuration file lacks comments explaining the purpose of each parameter, leading to confusion among developers.

9.: Question 9: Have the configuration files been thoroughly reviewed and tested?
Stakeholder: Organizers
Score: 4

Justification: Reviewing configurations is crucial for identifying errors, optimizing performance, and maintaining consistency. Testing ensures that configurations perform as expected under different conditions. Rigorous testing ensures robustness and reliable performance. Review and testing are vital to catch and identify errors early, optimizing performance, maintaining consistency and ensure the system behaves as expected under various configurations.

Example: A platform undergoes a peer review process for configuration changes, followed by automated tests to validate the new settings.

Appendix A.6. Data Debt—Questions 10 to 15

10.: Question 10: Have you checked for spurious data?
Stakeholder: Participants
Score: 4

Justification: Spurious data can introduce noise and lead to overfitting, which increases technical debt due to the need for additional debugging and retraining.

Example: Identifying and removing outliers from player score datasets that do not align with expected patterns.

11.: Question 11: Have you checked your data for accuracy?
Stakeholder: Participants
Score: 5

Justification: Inaccurate data can mislead the training process, resulting in models that perform poorly in real-world scenarios, thus accumulating technical debt.

Example: Verifying the correctness of reward signals in the game environment to ensure they reflect the intended game mechanics.

12.: Question 12: Have you checked your data for completeness?
Stakeholder: Participants
Score: 4

Justification: Incomplete data can result in underfitting and poor generalization, leading to technical debt when the model fails to perform as expected.

Example: Ensuring that the dataset includes a wide range of scenarios that a player might encounter in the game.

13.: Question 13: Have you checked your data for trustworthiness?
Stakeholder: Participants
Score: 4

Justification: Untrustworthy data can stem from biased or manipulated sources, increasing technical debt by causing the model to learn incorrect behaviors.

Example: Assessing the reliability of data sources, such as player feedback, to confirm they are not influenced by external incentives.

14.: Question 14: Have you performed testing on the input features?
Stakeholder: Participants
Score: 3

Justification: Testing input features is essential to ensure they are predictive and relevant, reducing technical debt by preventing the inclusion of irrelevant or redundant features.

Example: Conducting feature selection to determine the most significant inputs for predicting player engagement.

15.: Question 15: Have you checked your data for data relevance?
Stakeholder: Participants
Score: 3

Justification: Irrelevant data can lead to a model that does not adapt well to the task, increasing technical debt through unnecessary complexity and maintenance.

Example: Filtering out gameplay data that does not contribute to the learning objective, such as background art elements.

Appendix A.7. Design Debt—Questions 16 to 18

16.: Question 16: Pipeline Jungle—Is it possible to maintain a single controllable, straightforward pipeline of ML components?
Stakeholder: Organizers and Participants
Score: 3

Justification: A convoluted pipeline can be difficult to maintain and upgrade, contributing to technical debt over time.

Examples: Multiple data preprocessing steps scattered across the pipeline, making it hard to track changes and fix bugs., e.g., Have you designed separate modules/services for data collection and data preparation?, e.g., Have you checked for improper reuse of complete AI components or pipelines?

17.: Question 17: Does your system include glue code?
Stakeholder: Organizers and Participants
Score: 2

Justification: Glue code is often a quick fix that becomes permanent, increasing technical debt as the system scales.

Example: Temporary scripts that become a permanent part of the workflow, complicating future updates.

18.: Question 18: Have you avoided reusing a slightly modified complete model (correction cascades)?
Stakeholder: Participants
Score: 3

Justification: Correction cascades can create a maintenance nightmare, adding to the technical debt each time the base model is updated.

Example: A small change in the base model requiring adjustments in all derived models.

Appendix A.8. Defect Debt—Questions 19 to 23

19.: Question 19: Have you checked that there is no error in the training data collection that would cause a significant training dataset to be lost or delayed?
Stakeholder: Participants
Score: 5

Justification: Ensuring error-free training data collection is paramount. Errors in training data can introduce biases and inaccuracies that compound over time, leading to significant technical debt.

Example: For example, if a racing game’s training data incorrectly labels off-track excursions as successful maneuvers, the model may learn to drive off-track, requiring extensive retraining and data cleansing later.

20.: Question 20: Have you made the right choice in the hyperparameter values?
Stakeholder: Participants
Score: 4

Justification: Choosing the right hyperparameters is essential for model performance and efficiency. Incorrect hyperparameter values can lead to suboptimal performance or slow convergence, impacting the overall effectiveness of the model.

Example: For instance, incorrect learning rates can cause a model to converge too slowly or not at all, impacting the speed of iteration and potentially leading to a backlog of updates.

21.: Question 21: Have you made sure that there is no degradation in view prediction quality due to data changes, different code paths, etc.?
Stakeholder: Participants
Score: 4

Justification: Ensuring that changes in data or code paths do not degrade view prediction quality is critical in RL games. Even minor degradation in view prediction quality can affect the player’s experience and the game’s overall performance.

Example: An example is a strategy game where unit strengths may change over time; if the model cannot adapt to these changes, its strategies may become obsolete.

22.: Question 22: Have you quality inspected and validated the model for adequacy before releasing it to production?
Stakeholder: Participants
Score: 5

Justification: Quality inspection and validation of the model before releasing it to production are essential in RL games to ensure that the model performs adequately and meets the desired performance criteria. Releasing an inadequately validated model can lead to poor player experience which is a form of technical debt that is often expensive to address post-release and potentially damage the game’s reputation.

Example: For example, a model that has not been validated for a shooter game might misclassify in-game objects, leading to frustrating gameplay and the need for urgent patches.

23.: Question 23: Have you implemented mechanisms for rapid adaptation and regular updates to maintain the model’s efficiency and relevance in response to changes in data, features, modeling, or infrastructure?
Stakeholder: Organizers and Participants
Score: 4

Justification: Implementing mechanisms for rapid adaptation is essential in the fast-paced environment of games, where data and features can change frequently. Without this, the model may quickly become outdated, accumulating technical debt.

Example: For example, in a multiplayer online battle arena (MOBA) game, new characters and abilities are introduced regularly; without rapid adaptation, the model’s strategies could become ineffective.

Appendix A.9. Documentation Debt—Questions 24 to 28

24.: Question 24: Is Requirement Documentation available?
Stakeholder: Organizers
Score: 5

Justification: Requirement documentation is crucial for understanding the goals and objectives of the project, as well as the functionalities expected from the system. Without it, development might lack direction or focus, leading to potential misunderstandings and misalignments.

Example: If the platform’s matchmaking algorithm requirements are not well-documented, developers might implement incorrect or suboptimal features that could require significant rework.

25.: Question 25: Is Technical Documentation available?
Stakeholder: Organizers
Score: 5

Justification: Technical documentation provides insight into the system’s architecture, components, and functionalities from a technical perspective. It aids developers in understanding how different parts of the system interact and how to implement or modify them effectively.

Example: If documentation is lacking, integrating a new game into the platform may take significantly more time and increase the risk of mistakes during the process.

26.: Question 26: Is End-user Documentation available?
Stakeholder: Organizers and Participants
Score: 5

Justification: Clear end-user documentation—including tutorials and help guides—plays a key role in enhancing user satisfaction and minimizing the need for ongoing support.

Example: When features are not well-documented, users may become frustrated or confused, resulting in more support requests and potentially harming the platform’s reputation.

27.: Question 27: Is the documentation clear?
Stakeholder: Organizers and Participants
Score: 5

Justification: Clear documentation is key to effective communication. When instructions are easy to follow and free from ambiguity, both users and developers can interact with the system more efficiently, improving overall usability.

Example: For instance, if the process for submitting competition scores is not clearly explained, participants may interpret it differently—leading to inconsistent data and compromising the accuracy of the leaderboard.

28.: Question 28: Is the documentation up to date?
Stakeholder: Organizers and Participants
Score: 5

Justification: Maintaining up-to-date documentation is essential, as outdated information can misguide both developers and users. This may result in the use of obsolete features or incorrect platform versions, creating unnecessary setbacks.

Example: For example, if recent updates to competition rules are not reflected in the documentation, participants may unknowingly follow outdated instructions—potentially causing confusion or disqualification.

Appendix A.10. Ethics Debt—Questions 29 to 30

29.: Question 29: Are you familiar with the implementation guidelines, the process for submitting clarification requests, and how to address conflicting interpretations of complex or ambiguous concepts?
Stakeholder: Organizers and Participants
Score: 5

Justification: Knowing the implementation rules is crucial for ensuring that the competition is fair and that all participants have a clear understanding of what is expected. This knowledge helps maintain the integrity of the competition and prevents ethical breaches.

Example: In the RangL competition platform, clear implementation rules ensure that participants can optimize strategies within ethical boundaries.

30.: Question 30: Do you know the consequences of non-compliance?
Stakeholder: Organizers and Participants
Score: 5

Justification: Awareness of the consequences of non-compliance is essential to deter unethical behavior and ensure adherence to the competition’s rules. It also helps in maintaining a level playing field for all participants.

Example: In gaming competitions, non-compliance with rules can lead to disqualification or legal actions, as seen in cases where anti-competitive collusion resulted in fines and penalties., i.e., especially for Game Jams.

Appendix A.11. Infrastructure Debt—Questions 31 to 34

31.

Question 31: Are there mechanisms in place for automated monitoring and alerting of infrastructure performance metrics (e.g., Central Processing Unit (CPU) usage, memory utilization, network throughput)?

Stakeholder: Organizers

Score: 4

Justification:

-: Maintainability: Automated monitoring can reduce technical debt by making it easier to maintain the system.
-: Future-proofing: Early detection of performance issues can prevent the accumulation of technical debt related to system degradation.

Example: Implementing comprehensive monitoring from the start can avoid the need for costly refactoring of the monitoring system later on. Our competition’s infrastructure includes an automated monitoring system that continuously tracks CPU and GPU usage, memory utilization, and network throughput. If any metric crosses a predefined threshold, such as CPU usage exceeding 90%, the system triggers an alert to our technical team, prompting immediate investigation and intervention to prevent any performance degradation during the competition.

32.

Question 32: Has provision been made in the infrastructure for sufficient computing resources available?

Stakeholder: Organizers and Participants

Score: 3

Justification:

-: Scalability: While important, over-provisioning resources can lead to unnecessary complexity and costs, contributing to technical debt.
-: Cost Management: Balancing resources with actual needs can minimize expenses and reduce the risk of investing in technologies that may become obsolete.

Example: Investing in scalable cloud services can prevent over-commitment to a particular infrastructure setup, reducing long-term technical debt.

33.: Question 33: Have you developed a robust data pipeline for easy experimentation with AI algorithms?
Stakeholder: Organizers
Score: 5

Justification: Providing the necessary infrastructure to support efficient experimentation is essential in AI competition platforms. A well-designed data pipeline enables participants to iterate quickly, test diverse algorithms, and fine-tune their models—ultimately improving submission quality and raising the overall level of competition. Sound infrastructure management plays a critical role in the platform’s success.

Example: For instance, in an image recognition challenge, a participant aiming to compare multiple machine learning models can benefit from a reliable data pipeline. It allows them to preprocess large datasets with ease, train and evaluate various architectures and hyperparameter settings, and streamline the experimentation process for faster optimization.

34.: Question 34: Have you automated pipelines for model training, deployment, and integration?
Stakeholder: Organizers and Participants
Score: 4

Justification: Automated pipelines are essential for managing AI models on the competition platform. They streamline processes for both organizers and participants by reducing manual effort, minimizing errors, and ensuring consistent model deployment and integration. This enhances the platform’s scalability and efficiency, enabling seamless management of numerous submissions and models.

Example: Imagine an organizer hosting a competition where participants need to deploy their trained models to make predictions on new data. With automated pipelines in place, participants can simply upload their model artifacts to the platform, and the system automatically handles the deployment process, integrating the models into the competition framework for evaluation without manual intervention. This streamlines the submission process and accelerates the deployment of new models on the platform.

Appendix A.12. Model Debt—Questions 35 to 37

35.: Question 35: Are you detecting direct feedback loops or hidden feedback loops?
Stakeholder: Organizers and Participants
Score: 4

Justification: Feedback loops can significantly distort the learning process, leading to suboptimal or even harmful behaviors in the model. Detecting them early is essential to prevent the amplification of biases and errors.

Example: In a game where an AI agent is trained to collect rewards, a direct feedback loop might occur if the agent learns to manipulate the game environment to generate rewards without performing the intended task. A hidden feedback loop could arise if the agent’s actions inadvertently change the environment in a way that affects the agent’s future state evaluations, leading to unintended strategies.

36.: Question 36: Is model quality validated before serving?
Stakeholder: Participants
Score: 5

Justification: Validation ensures that the model performs as expected on unseen data and in real-world scenarios. It is a safeguard against deploying models that might perform well in training but fail in practice.

Example: Before deploying a model trained to play chess, it should be validated against a diverse set of opponents and scenarios to ensure it does not just exploit patterns seen during training but can generalize its strategy to new games.

37.: Question 37: Does the model allow debugging by observing the step-by-step computation of training or inference on a single example?
Stakeholder: Participants
Score: 3

Justification: The ability to debug a model at a granular level is important for understanding and improving its decision-making process. However, it might be less critical than the other two questions if the model is performing well overall.

Example: If an AI agent makes an unexpected move in a game, being able to step through the computation process can help identify why that decision was made, whether it was due to a flaw in the model or an unforeseen aspect of the game environment.

Appendix A.13. People—Social Debt—Questions 38 to 39

38.: Question 38: Is there a system in place to ensure project continuity through team member overlap and retention of the original development team’s knowledge?
Stakeholder: Organizers
Score: 5

Justification: Ensuring project continuity through team member overlap and knowledge retention is paramount. This prevents the loss of expertise and maintains the quality and integrity of the platform.

Example: If a key developer leaves, their replacement can quickly get up to speed if there’s comprehensive documentation and a system for knowledge transfer.

39.: Question 39: Has a project support community been created?
Stakeholder: Organizers
Score: 3

Justification: Having a support community is still highly important. It fosters collaboration, user engagement, and can lead to community-driven development and troubleshooting.

Example: A forum where users can discuss strategies, report bugs, and suggest features can greatly reduce the burden on the core development team and help prioritize tasks based on user feedback.

Appendix A.14. Process Debt—Questions 40 to 42

40.: Question 40: Have you correctly described your data handling procedures?
Stakeholder: Organizers
Score: 4

Justification: Data handling procedures are critical as they directly impact the quality of training data and subsequently affect the performance of the RL agent. Understanding how data is collected, preprocessed, and fed into the RL model is essential for ensuring the agent’s effectiveness.

Example: Ensuring that the data collection process is unbiased and that the replay buffer is managed effectively to prevent overfitting or underutilization of data.

41.: Question 41: Have you correctly described your model development processes?
Stakeholder: Organizers
Score: 4

Justification: Model development processes outline how the RL algorithm is designed and trained. This includes aspects such as the choice of algorithm, network architecture, hyperparameters, and training methodology. Understanding these processes is fundamental for reproducibility and ensuring the model’s reliability and performance.

Example: Documenting the transition from a model-free to a model-based approach, detailing the algorithms used, and the rationale behind choosing specific hyperparameters.

42.: Question 42: Have you correctly described the deployment processes of your model?
Stakeholder: Organizers
Score: 4

Justification: Deployment processes are crucial as they determine how the trained RL model is integrated into the game environment for real-world interaction. Understanding deployment procedures ensures smooth transition from development to production, minimizing potential errors and ensuring the model functions as intended in the game environment.

Example: Describing the process of integrating the trained RL model into the live game environment, including any safety checks and performance monitoring systems.

Appendix A.15. Requirements Debt—Questions 43 to 45

43.: Question 43: Have you thoroughly defined the objectives, scope, stakeholder needs, expectations, decision goals, and insights of the AI system to ensure alignment with business objectives and user expectations?
Stakeholder: Organizers
Score: 5

Justification: Clearly defining the objectives, scope, stakeholder needs, and expectations is fundamental to the success of any AI system. If these aspects are not well-defined, it can lead to requirement debt, where the system does not meet the necessary criteria for success due to ambiguous or incomplete requirements.

Example: If the competition platform’s goal is not aligned with the stakeholders’ expectations, it may result in a system that is technically sound but fails to engage users or meet business objectives.

44.: Question 44: Have you thoroughly addressed the technical aspects of the AI system, including the selection of appropriate AI techniques, algorithms, and models to achieve desired functionality and performance, as well as specifying quality attributes, trade-offs, metrics, and indicators to measure and evaluate system performance effectively?
Stakeholder: Organizers
Score: 5

Justification: Neglecting technical aspects such as AI techniques and models can result in a system that doesn’t meet performance expectations, leading to requirement debt. Technical requirements are essential for system functionality.

Example: Choosing an overly complex model for a simple game could result in longer training times and difficulty in interpreting the model’s decisions, making it harder to debug and improve.

45.: Question 45: Have you monitored and retrained the AI system with new data as needed?
Stakeholder: Organizers and Participants
Score: 4

Justification: Continuously monitoring and retraining the AI system with new data is important for maintaining its relevance and performance, which can prevent the accumulation of technical debt due to model staleness or performance degradation.

Example: In a game competition platform, if new strategies or game updates are introduced, the AI must be retrained to understand these changes. Failing to do so can result in a system that performs poorly and requires significant refactoring later on.

Appendix A.16. Self-Admitted Technical Debt (SATD)—Questions 46 to 47

46.: Question 46: Do you systematically record and track self-admitted technical debt (SATD) comments in the code of Artificial Intelligence (AI/ML/RL) models, using backlog management, issue tracking, or other technical debt management tools?
Stakeholder: Organizers
Score:

Justification: In AI competition platforms, unmanaged SATD can accumulate rapidly, especially in model training scripts, pipelines, or evaluation routines. Tracking such comments enables platform maintainers to prioritize refactoring tasks, reduce entropy, and support long-term platform evolution.

Example: In a reinforcement learning competition hosted by a university using OpenAI Gym, a developer might leave a comment like # TODO: replace hardcoded reward shaping in the agent code. Without logging this SATD item into a task board, the issue may persist unnoticed, leading to hidden model biases or unfair comparisons across participants.

47.: Question 47: Do you regularly plan improvements for areas of the code marked as SATD, such as gradual model refactoring, pipeline component redesign, or documentation of experimental setups?
Stakeholders: Organizers
Score: 5

Justification: Scheduling targeted improvements for self-admitted debt is essential to maintain clarity, reproducibility, and modularity in AI/ML/RL competitions. It also encourages sustainable practices and transparency for participants.

Example: On a Kaggle-like competition platform, the organizers may identify SATD in the evaluation pipeline—e.g., # FIXME: assumes deterministic policy output. By incorporating periodic reviews of such comments into the sprint planning, the organizers ensure the platform evolves with reduced hidden debt and clearer guidelines for future participants.

Appendix A.17. Test Debt—Questions 48 to 54

48.: Question 48: Have the hyperparameters been properly tuned and validated to ensure optimal performance within the game environment?
Stakeholder: Participants
Score: 5

Justification: Tuning hyperparameters directly affects the performance of the RL agent in the game environment. It’s crucial for optimizing its behavior. Optimal hyperparameters are vital for the RL agent to effectively learn and adapt to the game dynamics. Correct hyperparameter tuning is essential in RL game applications to optimize the model’s performance and enhance the gaming experience, ensuring efficient learning and effective decision-making in-game scenarios.

Example: Tuning the learning rate can significantly affect the convergence speed and stability of the training process.

49.: Question 49: Has reproducibility of agent training and environment dynamics been tested to ensure consistency?
Stakeholder: Organizers and Participants
Score: 4

Justification: Reproducibility ensures consistency and reliability in training the agent, which is crucial for fair gameplay.

Example: The use of fixed random seeds to ensure that results can be replicated across different runs.

50.: Question 50: Is there a fully automated test regularly running to validate the entire pipeline, ensuring data and code move through each stage successfully and resulting in a well-performing model?
Stakeholder: Organizers
Score: 5

Justification: Having a fully automated test ensures the integrity of the entire pipeline, which is essential for maintaining the quality of the game.

Example: Automated regression tests can catch issues early before they affect the model’s performance.

51.: Question 51: Do the data invariants hold for the inputs in the game environment?
Stakeholder: Organizers and Participants
Score: 3

Justification: Data invariants are important for maintaining the integrity of the game environment and ensuring fair gameplay.

Example: Checking that the positions of players in a sports game do not suddenly jump to unrealistic values.

52.: Question 52: Are there mechanisms in place to ensure that training and serving are not skewed in the game?
Stakeholder: Organizers and Participants
Score: 4

Justification: Skewed training and serving can lead to unfair advantages or disadvantages for players.

Example: Using the same feature engineering pipeline for both training and serving can help avoid discrepancies.

53.: Question 53: Are the models numerically stable for effective gameplay?
Stakeholder: Organizers and Participants
Score: 5

Justification: Numerically stable models ensure reliable and consistent behavior during gameplay.

Example: The use of gradient clipping in training to prevent exploding gradients.

54.: Question 54: Has the prediction quality of the game not regressed over time?
Stakeholder: Organizers and Participants
Score: 5

Justification: Prediction quality directly impacts the agent’s decisions and ultimately the gameplay experience. Therefore, maintaining its quality is crucial.

Example: Tracking the accuracy of the model’s predictions against a validation set over multiple seasons of a game.

Appendix A.18. Versioning Debt—Questions 55 to 57

55.: Question 55: Have you installed a proper version control system for model, training and test data?
Stakeholder: Organizers
Score: 5

Justification: Version control is crucial in any software development project, including RL game applications, to keep track of changes, revert to previous states if necessary, and collaborate effectively. Without version control, it can be challenging to manage changes, leading to potential errors and difficulties in reproducing results.

Example: If a new model version causes a regression in performance, a version control system would allow developers to quickly revert to a previous, stable version.

56.: Question 56: Have you used the appropriate policy for marking the versions of your software components?
Stakeholder: Organizers
Score: 3

Justification: Proper versioning allows for clear communication about changes and updates to software components. It helps users understand the significance of updates (major, minor, or patch) and ensures compatibility across different versions. While essential, it may not directly impact the RL game’s functionality as much as version control itself.

Example: If a major update is released without proper version marking, it could break compatibility with existing systems that rely on the platform, leading to significant technical debt.

57.: Question 57: Do you maintain a consistent data structure for game state representation throughout iterations, ensuring compatibility between different versions of the RL game?
Stakeholder: Organizers and Participants
Score: 4

Justification: Ensuring consistency in the data structure for representing the game state is crucial in RL game development. Changes in the data structure could affect the performance of RL algorithms, training stability, and overall gameplay experience. Keeping track of these changes and ensuring compatibility between different versions of the game state representation can help maintain the integrity and effectiveness of the RL algorithms employed in the game.

Example: When changes are introduced to the game state representation without preserving consistency, previously trained models may no longer function as intended and may require retraining or adaptation to remain effective.

Appendix A.19. Accessibility Debt—Questions 58 to 60

58.: Question 58: Have you conducted usability testing to identify and address potential barriers in the platform setup process?
Stakeholder: Organizers
Score: 5

Justification: Conducting usability testing is critical to uncover and address accessibility issues that can impede participants’ engagement. By identifying these barriers early, organizers can make necessary adjustments to enhance the platform’s usability and ensure a smooth user experience.

Example: If usability testing reveals that participants struggle with the initial setup due to unclear instructions, organizers can simplify the process and provide more intuitive guidance, thereby reducing Accessibility Debt and improving participant retention.

59.: Question 59: Have you implemented adaptive user interfaces that tailor the setup experience based on participants’ skill levels and preferences?
Stakeholder: Organizers
Score: 4

Justification: Implementing adaptive user interfaces can greatly improve usability by customizing the setup experience according to each participant’s skill level and preferences. A personalized approach minimizes user frustration and enhances engagement by delivering guidance and support that align with individual needs and levels of expertise.

Example: For instance, a novice user might benefit from a simplified setup path with additional explanations and contextual help, whereas a more experienced participant could be offered a faster, less guided version. This differentiation fosters a more positive onboarding experience and helps reduce drop-off due to unnecessary complexity.

60.: Question 60: Have you implemented feedback mechanisms that allow participants to report accessibility issues and suggest improvements?
Stakeholder: Organizers
Score: 3

Justification: The implementation of feedback mechanisms by the organizers enables continuous improvement of the platform, based on the direct input of users. By proactively responding to participant requests and suggestions, they are able to resolve issues more effectively and enhance the overall usability of the platform.

Example: If participants can easily report setup difficulties or suggest enhancements, organizers can quickly implement changes, thereby minimizing Accessibility Debt and fostering a more user-friendly environment.

Appendix B

Table A1. Inter-rater agreement matrix: technical debt labeling for 30 papers.

No	Title	Rater 1	Rater 2
1	Machine Learning Algorithms, Real-World Applications and Research Directions	Algorithm	Algorithm
2	Adapting Software Architectures to Machine Learning Challenges	Architectural	Architectural
3	Architecture Decisions in AI-based Systems Development: An Empirical Study	Architectural	Architectural
4	Machine Learning Architecture and Design Patterns	Architectural	Architectural
5	Searching for Build Debt Experiences Managing Technical Debt at Google	Build	Build
6	Code and Architectural Debt in Artificial Intelligence Systems	Code	Code
7	Code Smells for Machine Learning Applications	Code	Code
8	The prevalence of code smells in machine learning projects	Code	Code
9	A software engineering perspective on engineering machine learning systems State of the art and challenges	Configuration	Process
10	Challenges in Deploying Machine Learning A Survey of Case Studies	Configuration	Infrastructure
11	Data collection and quality challenges in deep learning a data-centric AI perspective	Data	Data
12	Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems	Data	Data
13	Debugging machine learning pipelines	Defect	Defect
14	Common problems with Creating Machine Learning Pipelines from Existing Code	Design	Code
15	Understanding Implementation Challenges in Machine Learning Documentation	Documentation	Documentation
16	Patterns and Anti-Patterns, Principles and Pitfalls: Accountability and Transparency	Ethics	Ethics
17	Infrastructure for Usable Machine Learning: The Stanford DAWN Project	Infrastructure	Infrastructure
18	A Meta-Summary of Challenges in Building Products with ML Components—Collecting Experiences from 4758 Practition	Model	Model
19	Machine Learning Model Development from a Software Engineering Perspective: A Systematic Literature Review	Model	Model
20	Quality issues in Machine Learning Software Systems	Model	Model
21	Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process	People	People
22	Studying Software Engineering Patterns for Designing ML Systems	Process	Architectural
23	MLife: A Lite Framework for Machine Learning Lifecycle Initialization	Requirements	Requirements
24	Requirements Engineering for Artificial Intelligence Systems: A Systematic Mapping Study	Requirements	Requirements
25	23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software	SATD	SATD
26	Self-Admitted Technical Debt in R_ Detection and Causes	SATD	SATD
27	Machine Learning Testing: Survey, Landscapes and Horizons	Test	Test
28	On Testing Machine Learning Programs	Test	Test
29	On the Challenges of Migrating to Machine Learning Life Cycle Management Platforms	Versioning	Versioning
30	Versioning for End-to-End Machine Learning Pipelines	Versioning	Versioning

Table A2. Inter-rater agreement matrix: technical debt labeling for 30 papers cause–effect relationships between selected technical debt types.

Source Debt Type	Affected Debt Type(s)	Relationship Type	Explanation
Documentation Debt	Accessibility Debt	Causal	Lack of clear or multilingual documentation hinders access for non-native or novice users.
Infrastructure Debt	Model Debt, Data Debt	Enabling Constraint	Poor infrastructure restricts the deployment, scalability, and integrity of models and data.
Process Debt	Versioning Debt	Triggering	Opaque or informal processes often lead to versioning inconsistencies and traceability loss.
Test Debt	Defect Debt	Amplifying	Insufficient testing increases the likelihood of undetected bugs or model failures.
Configuration Debt	Reproducibility Issues	Blocking	Non-reproducible configurations block validation and result reuse by participants.
People Debt	Ethics Debt, Documentation	Reinforcing	Knowledge silos and turnover weaken compliance practices and reduce documentation quality.
Design Debt	Maintainability, Defect Debt	Structural	Poor design decisions propagate technical issues and raise maintenance overhead.
Code Debt	Test Debt	Indirect	Unstructured or entangled code often reduces test coverage or inhibits testability.

References

Martínez-Fernández, S.; Bogner, J.; Franch, X.; Oriol, M.; Siebert, J.; Trendowicz, A.; Vollmer, A.M.; Wagner, S. Software Engineering for AI-Based Systems: A Survey. ACM Trans. Softw. Eng. Methodol. 2022, 31, 1–59. [Google Scholar] [CrossRef]
Felderer, M.; Ramler, R. Quality Assurance for AI-based Systems: Overview and Challenges. arXiv 2021, arXiv:2102.05351. [Google Scholar]
Kumeno, F. Sofware engneering challenges for machine learning applications: A literature review. Intell. Decis. Technol. 2020, 13, 463–476. [Google Scholar] [CrossRef]
Tang, Y.; Khatchadourian, R.; Bagherzadeh, M.; Singh, R.; Stewart, A.; Raja, A. An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems How does access to this work benefit you? In Proceedings of the 021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 25–28 May 2021; pp. 238–250. [Google Scholar]
Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M.; Crespo, J.F.; Dennison, D. Hidden technical debt in machine learning systems. Adv. Neural Inf. Process. Syst. 2015, 28, 2503–2511. [Google Scholar]
Luitse, D.M.R.; Blanke, T.; Poell, T. Ai Competitions As Infrastructures: Examining Power Relations on Kaggle and Grand Challenge in Ai-Driven Medical Imaging. AoIR Sel. Pap. Internet Res. 2022. [Google Scholar]
Breck, E.; Cai, S.; Nielsen, E.; Salib, M.; Sculley, D. The ML test score: A rubric for ML production readiness and technical debt reduction. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 1123–1132. [Google Scholar] [CrossRef]
Konen, W. General board game playing for education and research in generic AI game learning. In Proceedings of the 2019 IEEE Conference on Games (CoG), London, UK, 20–23 August 2019; pp. 1–8. [Google Scholar] [CrossRef]
Hong, C.; Jeong, I.; Vecchietti, L.F.; Har, D.; Kim, J.H. AI World Cup: Robot-Soccer-Based Competitions. IEEE Trans. Games 2021, 13, 330–341. [Google Scholar] [CrossRef]
Attanasio, G.; Giobergia, F.; Pasini, A.; Ventura, F.; Baralis, E.; Cagliero, L.; Garza, P.; Apiletti, D.; Cerquitelli, T.; Chiusano, S. DSLE: A Smart Platform for Designing Data Science Competitions. In Proceedings of the 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 13–17 July 2020; pp. 133–142. [Google Scholar] [CrossRef]
Zobernig, V.; Saldanha, R.A.; He, J.; van der Sar, E.; van Doorn, J.; Hua, J.-C.; Mason, L.R.; Czechowski, A.; Indjic, D.; Kosmala, T.; et al. RangL: A Reinforcement Learning Competition Platform. SSRN Electron. J. 2022, 1–10. [Google Scholar] [CrossRef]
Stephenson, M.; Piette, E.; Soemers, D.J.N.J.; Browne, C. Ludii as a competition platform. In Proceedings of the 2019 IEEE Conference on Games (CoG), London, UK, 20–23 August 2019; pp. 1–8. [Google Scholar] [CrossRef]
Kempka, M.; Wydmuch, M.; Runc, G.; Toczek, J.; Jaskowski, W. ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In Proceedings of the 2016 IEEE Conference on Computational Intelligence and Games (CIG), Santorini, Greece, 20–23 September 2016; pp. 1–8. [Google Scholar] [CrossRef]
Kalles, D. Artificial intelligence meets software engineering in computing education. In Proceedings of the 9th Hellenic Conference on Artificial Intelligence, Thessaloniki, Greece, 18–20 May 2016; pp. 1–5. [Google Scholar] [CrossRef]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
Togelius, J. How to Run a Successful Game-Based AI Competition. IEEE Trans. Comput. Intell. AI Games 2016, 8, 95–100. [Google Scholar] [CrossRef]
Kästner, C.; Kang, E. Teaching software engineering for AI-enabled systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering Education and Training, Seoul, Republic of Korea, 5–11 October 2020; pp. 45–48. [Google Scholar] [CrossRef]
Chesani, F.; Galassi, A.; Mello, P.; Trisolini, G. A game-based competition as instrument for teaching artificial intelligence. In Proceedings of the AI* IA 2017 Advances in Artificial Intelligence: XVIth International Conference of the Italian Association for Artificial Intelligence, Bari, Italy, 14–17 November 2017; pp. 72–84. [Google Scholar] [CrossRef]
Recupito, G.; Pecorelli, F.; Catolino, G.; Lenarduzzi, V.; Taibi, D.; Di Nucci, D.; Palomba, F. Technical Debt in AI-Enabled Systems: On the Prevalence, Severity, Impact, and Management Strategies for Code and Architecture. J. Syst. Softw. 2024, 216, 112151. [Google Scholar] [CrossRef]
Serban, A.; Visser, J. Adapting Software Architectures to Machine Learning Challenges. In Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Honolulu, HI, USA, 15–18 March 2022; pp. 152–163. [Google Scholar] [CrossRef]
Amershi, S.; Begel, A.; Bird, C.; DeLine, R.; Gall, H.; Kamar, E.; Nagappan, N.; Nushi, B.; Zimmermann, T. Software Engineering for Machine Learning: A Case Study. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Montreal, QC, Canada, 25–31 May 2019; pp. 291–300. [Google Scholar] [CrossRef]
Sklavenitis, D.; Kalles, D. Measuring Technical Debt in AI-Based Competition Platforms. In Proceedings of the 13th Hellenic Conference on Artificial Intelligence (SETN 2024), Athens, Greece, 11–13 September 2024. [Google Scholar] [CrossRef]
Isbell, C.; Littman, M.L.; Norvig, P. Software Engineering of Machine Learning Systems. Commun. ACM 2023, 66, 35–37. [Google Scholar] [CrossRef]
Kolek, L.; Mochocki, M.; Gemrot, J. Review of Educational Benefits of Game Jams: Participant and Industry Perspective. Homo Ludens 2023, 1, 115–140. [Google Scholar] [CrossRef]
Meriläinen, M.; Aurava, R.; Kultima, A.; Stenros, J. Game jams for learning and teaching: A review. Int. J. Game Based Learn. 2020, 10, 54–71. [Google Scholar] [CrossRef]
Mittelstadt, B. Principles alone cannot guarantee ethical AI. Nat. Mach. Intell. 2019, 1, 501–507. [Google Scholar] [CrossRef]
Foidl, H.; Felderer, M.; Ramler, R. Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-Based Systems; Association for Computing Machinery: New York, NY, USA, 2022; Volume 1. [Google Scholar] [CrossRef]
Polyzotis, N.; Zinkevich, M.; Roy, S.; Breck, E.; Whang, S. Data Validation for Machine Learning. SysML 2019, 1, 334–347. [Google Scholar]
Liu, J.; Huang, Q.; Xia, X.; Shihab, E.; Lo, D.; Li, S. Is using deep learning frameworks free? In characterizing technical debt in deep learning frameworks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Society, Seoul, Republic of Korea, 5–11 October 2020; pp. 1–10. [Google Scholar] [CrossRef]
Bogner, J.; Verdecchia, R.; Gerostathopoulos, I. Characterizing Technical Debt and Antipatterns in AI-Based Systems: A Systematic Mapping Study. In Proceedings of the 2021 IEEE/ACM International Conference on Technical Debt (TechDebt), Madrid, Spain, 19–21 May 2021; pp. 64–73. [Google Scholar] [CrossRef]
Washizaki, H.; Khomh, F.; Gueheneuc, Y.G.; Takeuchi, H.; Natori, N.; Doi, T.; Okuda, S. Software-Engineering Design Patterns for Machine Learning Applications. Comput. Long. Beach. Calif. 2022, 55, 30–39. [Google Scholar] [CrossRef]
Li, Z.; Avgeriou, P.; Liang, P. A systematic mapping study on technical debt and its management. J. Syst. Softw. 2015, 101, 193–220. [Google Scholar] [CrossRef]
Rios, N.; Neto, M.G.d.M.; Spínola, R.O. A tertiary study on technical debt: Types, management strategies, research trends, and base information for practitioners. Inf. Softw. Technol. 2018, 102, 117–145. [Google Scholar] [CrossRef]
Ahmad, K.; Abdelrazek, M.; Arora, C.; Bano, M.; Grundy, J. Requirements engineering for artificial intelligence systems: A systematic mapping study. Inf. Softw. Technol. 2023, 158, 107176. [Google Scholar] [CrossRef]
Vogelsang, A.; Borg, M. Requirements engineering for machine learning: Perspectives from data scientists. In Proceedings of the 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), Jeju, Republic of Korea, 23–27 September 2019; pp. 245–251. [Google Scholar] [CrossRef]
Warnett, S.J.; Zdun, U. Architectural Design Decisions for Machine Learning Deployment. In Proceedings of the 2022 IEEE 19th International Conference on Software Architecture (ICSA), Honolulu, HI, USA, 12–15 March 2022; pp. 90–100. [Google Scholar] [CrossRef]
Heiland, L.; Hauser, M.; Bogner, J. Design Patterns for AI-based Systems: A Multivocal Literature Review and Pattern Repository. arXiv 2023, arXiv:2303.13173. [Google Scholar]
Washizaki, H.; Uchida, H.; Khomh, F.; Guéhéneuc, Y.G. Studying Software Engineering Patterns for Designing Machine Learning Systems. In Proceedings of the 2019 10th International Workshop on Empirical Software Engineering in Practice (IWESEP), Tokyo, Japan, 13–14 December 2019; pp. 49–54. [Google Scholar] [CrossRef]
Menzies, T. The Five Laws of SE for AI. IEEE Softw. 2020, 37, 81–85. [Google Scholar] [CrossRef]
Polyzotis, N.; Roy, S.; Whang, S.E.; Zinkevich, M. Data lifecycle challenges in production machine learning: A survey. SIGMOD Rec. 2018, 47, 17–28. [Google Scholar] [CrossRef]
Whang, S.E.; Roh, Y.; Song, H.; Lee, J.G. Data collection and quality challenges in deep learning: A data-centric AI perspective. VLDB J. 2023, 32, 791–813. [Google Scholar] [CrossRef]
Hutchinson, B.; Smart, A.; Hanna, A.; Denton, E.; Greer, C.; Kjartansson, O.; Barnes, P.; Mitchell, M. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 560–575. [Google Scholar] [CrossRef]
Vadavalasa, R.M. Data Validation Process in Machine Learning Pipeline. Int. J. Sci. Res. Dev. 2021, 8, 449–452. [Google Scholar]
Foidl, H.; Felderer, M. Risk-based data validation in machine learning-based software systems. In Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation, Tallinn, Estonia, 3–10 March 2021; pp. 13–18. [Google Scholar] [CrossRef]
Chen, Z.; Wu, M.; Chan, A.; Li, X.; Ong, Y.S. Survey on AI Sustainability: Emerging Trends on Learning Algorithms and Research Challenges [Review Article]. IEEE Comput. Intell. Mag. 2023, 18, 60–77. [Google Scholar] [CrossRef]
Ozkaya, I. What Is Really Different in Engineering AI-Enabled Systems? IEEE Softw. 2020, 37, 3–6. [Google Scholar] [CrossRef]
Lwakatare, L.E.; Raj, A.; Bosch, J.; Olsson, H.H.; Crnkovic, I. A taxonomy of software engineering challenges for machine learning systems: An empirical investigation. Lect. Notes Bus. Inf. Process. 2019, 355, 227–243. [Google Scholar] [CrossRef]
Schelter, S.; Biessmann, F.; Januschowski, T.; Salinas, D.; Seufert, S.; Szarvas, G. On Challenges in Machine Learning Model Management. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 2018, 5–13. Available online: http://sites.computer.org/debull/A18dec/p5.pdf (accessed on 18 June 2025).
Studer, S.; Bui, T.B.; Drescher, C.; Hanuschkin, A.; Winkler, L.; Peters, S.; Müller, K.R. Towards CRISP-ML (Q): A Machine Learning Process Model with Quality Assurance Methodology. Mach. Learn. Knowl. Extr. 2021, 3, 392–413. [Google Scholar] [CrossRef]
Bailis, P.; Olukotun, K.; Re, C.; Zaharia, M. Infrastructure for Usable Machine Learning: The Stanford DAWN Project. arXiv 2017, arXiv:1705.07538. [Google Scholar]
Zhang, J.M.; Harman, M.; Ma, L.; Liu, Y. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Trans. Softw. Eng. 2022, 48, 1–36. [Google Scholar] [CrossRef]
Côté, P.O.; Nikanjam, A.; Bouchoucha, R.; Basta, I.; Abidi, M.; Khomh, F. Quality Issues in Machine Learning Software Systems. Empir. Softw. Eng. 2024, 29, 149. [Google Scholar] [CrossRef]
Murphy, C.; Kaiser, G.; Arias, M. A Framework for Quality Assurance of Machine Learning Applications. pp. 1–10. 2020. Available online: https://www.researchgate.net/publication/228687118_A_Framework_for_Quality_Assurance_of_Machine_Learning_Applications (accessed on 18 June 2025).
Barr, E.T.; Harman, M.; McMinn, P.; Shahbaz, M.; Yoo, S. The oracle problem in software testing: A survey. IEEE Trans. Softw. Eng. 2015, 41, 507–525. [Google Scholar] [CrossRef]
Braiek, H.B.; Khomh, F. On testing machine learning programs. J. Syst. Softw. 2020, 164, 110542. [Google Scholar] [CrossRef]
Golendukhina, V.; Lenarduzzi, V.; Felderer, M. What is Software Quality for AI Engineers? In Towards a Thinning of the Fog. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI, Pittsburgh, Pennsylvania, 16–24 May 2022; pp. 1–9. [Google Scholar] [CrossRef]
Albuquerque, D.; Guimaraes, E.; Tonin, G.; Perkusich, M.; Almeida, H.; Perkusich, A. Comprehending the Use of Intelligent Techniques to Support Technical Debt Management. In Proceedings of the International Conference on Technical Debt, Pittsburgh, Pennsylvania, 16–18 May 2022; pp. 21–30. [Google Scholar] [CrossRef]
Njomou, A.T.; Fokaefs, M.; Silatchom Kamga, D.F.; Adams, B. On the Challenges of Migrating to Machine Learning Life Cycle Management Platforms. In Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering, Toronto, ON, Canada, 15–17 November 2022; pp. 42–51. [Google Scholar]
Van Der Weide, T.; Papadopoulos, D.; Smirnov, O.; Zielinski, M.; Van Kasteren, T. Versioning for end-to-end machine learning pipelines. In Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning, Chicago, IL, USA, 14–19 May 2017; pp. 1–9. [Google Scholar] [CrossRef]
Arpteg, A.; Brinne, B.; Crnkovic-Friis, L.; Bosch, J. Software engineering challenges of deep learning. In Proceedings of the 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Prague, Czech Republic, 29–31 August 2018; pp. 50–59. [Google Scholar] [CrossRef]
Kery, M.B.; Radensky, M.; Arya, M.; John, B.E.; Myers, B.A. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal QC, Canada, 21–26 April 2018; pp. 1–11. [Google Scholar] [CrossRef]
Giray, G. A software engineering perspective on engineering machine learning systems: State of the art and challenges. J. Syst. Softw. 2021, 180, 111031. [Google Scholar] [CrossRef]
OBrien, D.; Biswas, S.; Imtiaz, S.; Abdalkareem, R.; Shihab, E.; Rajan, H. 23 Shades of Self-Admitted Technical Debt: An Empirical Study on Machine Learning Software; Association for Computing Machinery: New York, NY, USA, 2022; Volume 1. [Google Scholar] [CrossRef]
Paleyes, A.; Urma, R.G.; Lawrence, N.D. Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Comput. Surv. 2022, 55, 1–29. [Google Scholar] [CrossRef]
Van Oort, B.; Cruz, L.; Aniche, M.; Van Deursen, A. The prevalence of code smells in machine learning projects. In Proceedings of the 2021 IEEE/ACM 1st Workshop on AI Engineering–Software Engineering for AI (WAIN), Madrid, Spain, 30–31 May 2021; pp. 35–42. [Google Scholar] [CrossRef]
Gesi, J.; Liu, S.; Li, J.; Ahmed, I.; Nagappan, N.; Lo, D.; de Almeida, E.S.; Kochhar, P.S.; Bao, L. Code Smells in Machine Learning Systems. arXiv 2022, arXiv:2203.00803. [Google Scholar]
Wang, J.; Li, L.; Zeller, A. Better Code, Better Sharing: On the Need of Analyzing Jupyter Notebooks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, Seoul, Republic of Korea, 27 June 2020–19 July 2020; pp. 53–56. [Google Scholar] [CrossRef]
Pimentel, J.F.; Murta, L.; Braganholo, V.; Freire, J. A large-scale study about quality and reproducibility of jupyter notebooks. In Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Montreal, QC, Canada, 25–31 May 2019; pp. 507–517. [Google Scholar] [CrossRef]
Haakman, M.P.A. Studying the Machine Learning Lifecycle and Improving Code Quality of Machine Learning Applications. 2020. Available online: https://repository.tudelft.nl/islandora/object/uuid%3A38ff4e9a-222a-4987-998c-ac9d87880907 (accessed on 18 June 2025).
De Souza Nascimento, E.; Ahmed, I.; Oliveira, E.; Palheta, M.P.; Steinmacher, I.; Conte, T. Understanding Development Process of Machine Learning Systems: Challenges and Solutions. In Proceedings of the2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Porto de Galinhas, Brazil, 19–20 September 2019; pp. 1–6. [Google Scholar] [CrossRef]
Nahar, N.; Zhou, S.; Lewis, G.; Kastner, C. Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, Pennsylvania, 21–29 May 2022; pp. 413–425. [Google Scholar] [CrossRef]
Mo, R.; Zhang, Y.; Wang, Y.; Zhang, S.; Xiong, P.; Li, Z.; Zhao, Y. Exploring the Impact of Code Clones on Deep Learning Software. ACM Trans. Softw. Eng. Methodol. 2023, 32, 1–34. [Google Scholar] [CrossRef]
Rios, N.; Mendes, L.; Cerdeiral, C.; Magalhães, A.P.F.; Perez, B.; Correal, D.; Astudillo, H.; Seaman, C.; Izurieta, C.; Santos, G.; et al. Hearing the Voice of Software Practitioners on Causes, Effects, and Practices to Deal with Documentation Debt. In Requirements Engineering: Foundation for Software Quality: 26th International Working Conference; Springer: Cham, Switzerland, 2020; pp. 55–70. [Google Scholar] [CrossRef]
Chang, J.; Custis, C. Understanding Implementation Challenges in Machine Learning Documentation. In Proceedings of the 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Gran Canaria, Spain, 31 August–2 September 2022; pp. 1–8. [Google Scholar] [CrossRef]
Shivashankar, K.; Martini, A. Maintainability Challenges in ML: A Systematic Literature Review. In Proceedings of the 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Gran Canaria, Spain, 31 August–2 September 2022; pp. 60–67. [Google Scholar] [CrossRef]
Tamburri, D.A.; Kruchten, P.; Lago, P.; Van Vliet, H. What is social debt in software engineering? In Proceedings of the 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), San Francisco, CA, USA, 25 May 2013; pp. 93–96. [Google Scholar] [CrossRef]
Mailach, A.; Siegmund, N. Socio-Technical Anti-Patterns in Building ML-Enabled Software: Insights from Leaders on the Forefront. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 690–702. [Google Scholar] [CrossRef]
Ishikawa, F.; Yoshioka, N. How Do Engineers Perceive Difficulties in Engineering of Machine-Learning Systems?—Questionnaire Survey. In Proceedings of the 2019 IEEE/ACM Joint 7th International Workshop on Conducting Empirical Studies in Industry (CESI) and 6th International Workshop on Software Engineering Research and Industrial Practice (SER&IP), Montreal, QC, Canada, 28 May 2019; pp. 2–9. [Google Scholar] [CrossRef]
Vakkuri, V.; Kemell, K.K.; Jantunen, M.; Abrahamsson, P. “This is Just a Prototype”: How Ethics Are Ignored in Software Startup-Like Environments. In Proceedings of the 2019 IEEE/ACM Joint 7th International Workshop on Conducting Empirical Studies in Industry (CESI) and 6th International Workshop on Software Engineering Research and Industrial Practice (SER&IP), Montreal, QC, Canada, 28 May 2019; pp. 195–210. [Google Scholar] [CrossRef]
Hagendorff, T. The Ethics of AI Ethics: An Evaluation of Guidelines. Minds Mach. 2020, 30, 99–120. [Google Scholar] [CrossRef]
Matthews, J. Patterns and antipatterns, principles, and pitfalls: Accountability and transparency in artificial intelligence. AI Mag. 2020, 41, 81–89. [Google Scholar] [CrossRef]
Petrozzino, C. Who pays for ethical debt in AI? AI Ethics 2021, 1, 205–208. [Google Scholar] [CrossRef]
Bhatia, A.; Khomh, F.; Adams, B.; Hassan, A.E. An Empirical Study of Self-Admitted Technical Debt in Machine Learning Software. arXiv 2023, arXiv:2311.12019. [Google Scholar]
Bavota, G.; Russo, B. A large-scale empirical study on self-admitted technical debt. In Proceedings of the 13th International Conference on Mining Software Repositories, Austin, TX, USA, 14–15 May 2016; pp. 315–326. [Google Scholar] [CrossRef]
Nascimento, E.; Nguyen-Duc, A.; Sundbø, I.; Conte, T. Software engineering for artificial intelligence and machine learning software: A systematic literature review. arXiv 2020, arXiv:2011.03751. [Google Scholar]
Wan, Z.; Xia, X.; Lo, D.; Murphy, G.C. How does machine learning change software development practices? IEEE Trans. Softw. Eng. 2021, 47, 1857–1871. [Google Scholar] [CrossRef]
Serban, A.; Van Der Blom, K.; Hoos, H.; Visser, J. Adoption and effects of software engineering best practices in machine learning. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Bari, Italy, 5–9 October 2020; p. 12. [Google Scholar] [CrossRef]
Abdellatif, A.; Ghiasi, G.; Costa, D.E.; Shihab, E.; Tajmel, T. SE4AI: A Training Program Considering Technical, Social, and Professional Aspects of AI-Based Software Systems. IEEE Softw. 2024, 41, 44–51. [Google Scholar] [CrossRef]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Vouros, G.A. Explainable Deep Reinforcement Learning: State of the Art and Challenges. ACM Comput. Surv. 2022, 55, 39. [Google Scholar] [CrossRef]
Speith, T. A Review of Taxonomies of Explainable Artificial Intelligence (XAI) Methods. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 2239–2250. [Google Scholar] [CrossRef]
Dwivedi, R.; Dave, D.; Naik, H.; Singhal, S.; Omer, R.; Patel, P.; Qian, B.; Wen, Z.; Shah, T.; Morgan, G.; et al. Explainable AI (XAI): Core Ideas, Techniques, and Solutions. ACM Comput. Surv. 2023, 55, 1–33. [Google Scholar] [CrossRef]
Du, M.; Liu, N.; Hu, X. Techniques for interpretable machine learning. Commun. ACM 2020, 63, 68–77. [Google Scholar] [CrossRef]
Puiutta, E.; Veith, E.M.S.P. Explainable Reinforcement Learning: A Survey. In Machine Learning and Knowledge Extraction. CD-MAKE 2020. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; pp. 77–95. [Google Scholar] [CrossRef]
Mizrahi, M. AI4Science and the context distinction. AI Ethic 2025, 1–6. [Google Scholar] [CrossRef]
Reinke, A.; Tizabi, M.D.; Eisenmann, M.; Maier-Hein, L. Common Pitfalls and Recommendations for Grand Challenges in Medical Artificial Intelligence. Eur. Urol. Focus 2021, 7, 710–712. [Google Scholar] [CrossRef]
Pavao, A.; Guyon, I.; Letournel, A.-C.; Baró, X.; Escalante, H.; Escalera, S.; Thomas, T.; Xu, Z. CodaLab Competitions: An Open Source Platform to Organize Scientific Challenges. 2022. Available online: https://hal.inria.fr/hal-03629462/document (accessed on 18 June 2025).
Guss, W.H.; Castro, M.Y.; Devlin, S.; Houghton, B.; Kuno, N.S.; Loomis, C.; Milani, S.; Mohanty, S.; Nakata, K.; Salakhutdinov, R.; et al. The MineRL 2020 Competition on Sample Efficient Reinforcement Learning using Human Priors. arXiv 2021, arXiv:2101.11071. [Google Scholar]
Perez-Liebana, D.; Samothrakis, S.; Togelius, J.; Lucas, S.M.; Schaul, T. General video game AI: Competition, challenges, and opportunities. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 4335–4337. [Google Scholar] [CrossRef]
Bojer, C.S.; Meldgaard, J.P. Kaggle forecasting competitions: An overlooked learning opportunity. Int. J. Forecast. 2021, 37, 587–603. [Google Scholar] [CrossRef]
Kultima, A. Game Jam Natives?: The rise of the game jam era in game development cultures. In Proceedings of the 6th Annual International Conference on Game Jams, Hackathons, and Game Creation Events, Montreal, QC, Canada, 2 August 2021; pp. 22–28. [Google Scholar] [CrossRef]
Koskinen, E. Pizza and coffee make a game jam—Learnings from organizing an online game development event. In Proceedings of the 6th Annual International Conference on Game Jams, Hackathons, and Game Creation Events, Montreal, QC, Canada, 2 August 2021; pp. 74–77. [Google Scholar] [CrossRef]
Giagtzoglou, K.; Kalles, D. A gaming ecosystem as a tool for research and education in artificial intelligence. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence, Patras, Greece, 9–12 July 2018; p. 2. [Google Scholar] [CrossRef]
Salta, A.; Prada, R.; Melo, F.S. A Game AI Competition to Foster Collaborative AI Research and Development. IEEE Trans. Games 2021, 13, 398–409. [Google Scholar] [CrossRef]
Genter, K.; Laue, T.; Stone, P. Three years of the robocup standard platform league drop-in player competition: Creating and maintaining a large scale ad hoc teamwork robotics competition (JAAMAS extended abstract). Proc. Int. Jt. Conf. Auton. Agents Multiagent Syst. AAMAS 2017, 1, 520–521. [Google Scholar] [CrossRef]
Johnson, M.; Hofmann, K.; Hutton, T.; Bignell, D. The malmo platform for artificial intelligence experimentation. Ijcai Int. Jt. Conf. Artif. Intell. 2016, 16, 4246–4247. [Google Scholar]
Aurava, R.; Meriläinen, M.; Kankainen, V.; Stenros, J. Game jams in general formal education. Int. J. Child Comput. Interact. 2021, 28, 100274. [Google Scholar] [CrossRef]
Abbott, D.; Chatzifoti, O.; Ferguson, J.; Louchart, S.; Stals, S. Serious “Slow” Game Jam—A Game Jam Model for Serious Game Design. In Proceedings of the 7th International Conference on Game Jams, Hackathons and Game Creation Events, Virtual, 30 August 2023; pp. 28–36. [Google Scholar] [CrossRef]
Kim, K.J.; Cho, S.B. Game AI competitions: An open platform for computational intelligence education. IEEE Comput. Intell. Mag. 2013, 8, 64–68. [Google Scholar] [CrossRef]
Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA extension for scoping reviews (PRISMA-ScR): Checklist and explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef]
Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. ACM Int. Conf. Proceeding Ser. 2014, 38, 1–10. [Google Scholar] [CrossRef]
Jalali, S.; Wohlin, C. Systematic Literature Studies: Database Searches vs. In Backward Snowballing. In Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Lund, Sweden, 20–21 September 2012; pp. 29–38. [Google Scholar] [CrossRef]
Landis, J.R.; Gary, G.K. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
Simon, E.I.O.; Vidoni, M.; Fard, F.H. Algorithm Debt: Challenges and Future Paths. In Proceedings of the 2023 IEEE/ACM 2nd International Conference on AI Engineering—Software Engineering for AI (CAIN), Melbourne, Australia, 15–16 May 2023; pp. 90–91. [Google Scholar] [CrossRef]
Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Liang, Y.; Shen, Q.; Jiang, J.; Li, S. Toward Understanding Deep Learning Framework Bugs. ACM Trans. Softw. Eng. Methodol. 2023, 32, 1–31. [Google Scholar] [CrossRef]
Dilhara, M.; Ketkar, A.; Dig, D. Understanding Software-2.0: A Study of Machine Learning Library Usage and Evolution. ACM Trans. Softw. Eng. Methodol. 2021, 30, 1–42. [Google Scholar] [CrossRef]
Balhara, S.; Gupta, N.; Alkhayyat, A.; Bharti, I.; Malik, R.Q.; Mahmood, S.N.; Abedi, F. A survey on deep reinforcement learning architectures, applications and emerging trends. IET Commun. 2022, 16, 1–16. [Google Scholar] [CrossRef]
Serban, A.; Visser, J. An Empirical Study of Software Architecture for Machine Learning. arXiv 2021, arXiv:2105.12422. [Google Scholar]
Carleton, A.; Shull, F.; Harper, E. Architecting the Future of Software Engineering. Comput. Long. Beach. Calif. 2022, 55, 89–93. [Google Scholar] [CrossRef]
Franch, X.; Martínez-Fernández, S.; Ayala, C.P.; Gómez, C. Architectural Decisions in AI-Based Systems: An Ontological View. Commun. Comput. Inf. Sci. 2022, 1621, 18–27. [Google Scholar] [CrossRef]
Zhang, B.; Liu, T.; Liang, P.; Wang, C.; Shahin, M.; Yu, J. Architecture Decisions in AI-based Systems Development: An Empirical Study. In Proceedings of the 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Taipa, Macao, 21–24 March 2023; pp. 616–626. [Google Scholar] [CrossRef]
Bosch, J.; Olsson, H.H.; Crnkovic, I. Engineering AI systems: A Research Agenda. In Artificial Intelligence Paradigms for Smart Cyber-Physical Systems; Mahmood, Z., Ed.; IGI Global: Hershey, PA, USA, 2021; pp. 1–19. [Google Scholar] [CrossRef]
Washizaki, H.; Uchida, H.; Khomh, F.; Guéhéneuc, Y.-G. Machine Learning Architecture and Design Patterns. IEEE Softw. 2020, 37, 8. Available online: http://www.washi.cs.waseda.ac.jp/wp-content/uploads/2019/12/IEEE_Software_19__ML_Patterns.pdf (accessed on 18 June 2025).
Muccini, H.; Vaidhyanathan, K. Software architecture for ML-based Systems: What exists and what lies ahead. In Proceedings of the 2021 IEEE/ACM 1st Workshop on AI Engineering—Software Engineering for AI (WAIN), Madrid, Spain, 30–31 May 2021; pp. 121–128. [Google Scholar] [CrossRef]
Morgenthaler, J.D.; Gridnev, M.; Sauciuc, R.; Bhansali, S. Searching for build debt: Experiences managing technical debt at Google. In Proceedings of the 2012 Third International Workshop on Managing Technical Debt (MTD), Zurich, Switzerland, 5 June 2012; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, H.; Cruz, L.; Deursen, A. Van Code Smells for Machine Learning Applications. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI, Pittsburgh, Pennsylvania, 16–17 May 2022; pp. 217–228. [Google Scholar] [CrossRef]
Lenarduzzi, V.; Lomio, F.; Moreschini, S.; Taibi, D.; Tamburri, D.A. Software Quality for AI: Where We Are Now? In Lecture Notes in Business Information Processing; Springer International Publishing: Cham, Switzerland, 2021; Volume 404, pp. 43–53. ISBN 9783030658533. [Google Scholar]
Foidl, H.; Felderer, M.; Biffl, S. Technical Debt in Data-Intensive Software Systems. In Proceedings of the 2019 45th Euromicro Conference on Software Engineering and Advanced Applications, Thessaloniki, Greece, 28–30 August 2019; pp. 338–341. [Google Scholar] [CrossRef]
Lourenço, R.; Freire, J.; Shasha, D. Debugging machine learning pipelines. In Proceedings of the 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Kallithea, Greece, 28–30 August 2019; p. 10. [Google Scholar] [CrossRef]
Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M. Machine Learning: The High-Interest Credit Card of Technical Debt. NIPS 2014 Work. Softw. Eng. Mach. Learn. 2014, 8, 1–9. [Google Scholar]
Pérez, B.; Castellanos, C.; Correal, D.; Rios, N.; Freire, S.; Spínola, R.; Seaman, C.; Izurieta, C. Technical debt payment and prevention through the lenses of software architects. Inf. Softw. Technol. 2021, 140, 106692. [Google Scholar] [CrossRef]
O’Leary, K.; Uchida, M. Common Problems with Creating Machine LearningPipelines from Existing Code. In Proceedings of the Third Conference on Machine Learning and Systems, Bangalore, India, 25–28 October 2023; pp. 1387–1395. [Google Scholar]
Hu, X.; Chen, Q.; Wang, H.; Xia, X.; Lo, D.; Zimmermann, T. Correlating Automated and Human Evaluation of Code Documentation Generation Quality. ACM Trans. Softw. Eng. Methodol. 2022, 31, 1–28. [Google Scholar] [CrossRef]
Königstorfer, F.; Thalmann, S. Software documentation is not enough! Requirements for the documentation of AI. Digit. Policy Regul. Gov. 2021, 23, 475–488. [Google Scholar] [CrossRef]
Roselli, D.; Matthews, J.; Talagala, N. Managing bias in AI. In Proceedings of the Companion Proceedings of the 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 539–544. [Google Scholar] [CrossRef]
Muiruri, D.; Lwakatare, L.E.; Nurminen, J.K.; Mikkonen, T. Practices and Infrastructures for Machine Learning Systems: An Interview Study in Finnish Organizations. Comput. Long. Beach. Calif. 2022, 55, 18–29. [Google Scholar] [CrossRef]
Nahar, N.; Zhang, H.; Lewis, G.; Zhou, S.; Kastner, C. A Meta-Summary of Challenges in Building Products with ML Components—Collecting Experiences from 4758+ Practitioners. In Proceedings of the 2023 IEEE/ACM 2nd International Conference on AI Engineering—Software Engineering for AI (CAIN), Melbourne, Australia, 15–16 May 2023; pp. 171–183. [Google Scholar] [CrossRef]
Jebnoun, H.; Rahman, M.S.; Khomh, F.; Muse, B.A. Clones in deep learning code: What, where, and why? Empir. Softw. Eng. 2022, 27, 84. [Google Scholar] [CrossRef]
Alahdab, M.; Çalıklı, G. Empirical Analysis of Hidden Technical Debt Patterns in Machine Learning Software. Lect. Notes Comput. Sci. Incl. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinform. 2019, 11915, 195–202. [Google Scholar] [CrossRef]
Lorenzoni, G.; Alencar, P.; Nascimento, N.; Cowan, D. Machine Learning Model Development from a Software Engineering Perspective: A Systematic Literature Review. arXiv 2021, arXiv:2102.07574. [Google Scholar]
Wang, S.; Huang, L.; Ge, J.; Zhang, T.; Feng, H.; Li, M.; Zhang, H.; Ng, V. Synergy between Machine/Deep Learning and Software Engineering: How Far Are We? arXiv 2020, arXiv:2008.05515. [Google Scholar]
Siebert, J.; Joeckel, L.; Heidrich, J.; Nakamichi, K.; Ohashi, K.; Namba, I.; Yamamoto, R.; Aoyama, M. Towards guidelines for assessing qualities of machine learning systems. Commun. Comput. Inf. Sci. 2020, 1266, 17–31. [Google Scholar] [CrossRef]
Bosch, J.; Olsson, H.H.; Crnkovic, I. It takes three to tango: Requirement, outcome/data, and AI driven development. CEUR Workshop Proc. 2018, 2305, 177–192. [Google Scholar]
Yang, C.; Wang, W.; Zhang, Y.; Zhang, Z.; Shen, L.; Li, Y.; See, J. MLife: A lite framework for machine learning lifecycle initialization. Mach. Learn. 2021, 110, 2993–3013. [Google Scholar] [CrossRef]
Belani, H.; Vukovic, M.; Car, Z. Requirements engineering challenges in building ai-based complex systems. In Proceedings of the 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), Jeju, Republic of Korea, 23–27 September 2019; pp. 252–255. [Google Scholar] [CrossRef]
Yan, M.; Xia, X.; Shihab, E.; Lo, D.; Yin, J.; Yang, X. Automating Change-Level Self-Admitted Technical Debt Determination. IEEE Trans. Softw. Eng. 2019, 45, 1211–1229. [Google Scholar] [CrossRef]
Sharma, R.; Shahbazi, R.; Fard, F.H.; Codabux, Z.; Vidoni, M. Self-admitted technical debt in R: Detection and causes. Autom. Softw. Eng. 2022, 29, 53. [Google Scholar] [CrossRef]
Mastropaolo, A.; Di Penta, M.; Bavota, G. Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We? In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 585–597. [Google Scholar] [CrossRef]
Sherin, S.; Khan, M.U.; Iqbal, M.Z. A Systematic Mapping Study on Testing of Machine Learning Programs. arXiv 2019, arXiv:1907.09427. [Google Scholar]
Riccio, V.; Jahangirova, G.; Stocco, A.; Humbatova, N.; Weiss, M.; Tonella, P. Testing machine learning based systems: A systematic mapping. Empir. Softw. Eng. 2020, 25, 5193–5254. [Google Scholar] [CrossRef]
Shankar, S.; Garcia, R.; Hellerstein, J.M.; Parameswaran, A.G. “We Have No Idea How Models will Behave in Production until Production”: How Engineers Operationalize Machine Learning. Proc. ACM Hum. Comput. Interact. 2024, 8, 1–34. [Google Scholar] [CrossRef]
Wan, C.; Liu, S.; Hoffmann, H.; Maire, M.; Lu, S. Are machine learning cloud APIs used correctly? In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 25–28 May 2021; pp. 125–137. [Google Scholar] [CrossRef]
Falk, J.; Mose Biskjaer, M.; Halskov, K.; Kultima, A. How organisers understand and promote participants’ creativity in Game Jams. In Proceedings of the 6th Annual International Conference on Game Jams, Hackathons, and Game Creation Events, Montreal, QC, Canada, 2 August 2021; pp. 12–21. [Google Scholar] [CrossRef]
Scott, M.J.; Ghinea, G. Promoting Game Accessibility: Experiencing an Induction on Inclusive Design Practice at the Global Games Jam. arXiv 2013, arXiv:1305.4359. [Google Scholar]

Figure 1. PRISMA flow diagram of the systematic search process. The 72 included studies reflect the database-driven selection. An additional 28 studies were incorporated through supplementary methods (see Section 3.4), resulting in a total of 100 studies analyzed.

Figure 2. Distribution of studies by technical debt type in AI-based systems.

Figure 3. Stakeholder impact matrix for technical debt types.

Figure 4. Perceived severity of technical debt types in AI/ML systems. Bar chart summarizing average severity scores (1–5) assigned to each technical debt category based on expert synthesis and thematic literature analysis.

Table 1. Search queries and retrieved records.

Data Source	Date Range	Search String	Results
Google Scholar	2012–2024	(“Technical Debt” AND (“Artificial Intelligence” OR “AI” OR “Machine Learning” OR “ML”) AND “Software Engineering”) OR (“Technical Debt” AND “AI-Based Systems”)	214
ACM	-//-	“Technical Debt” AND (“Artificial Intelligence” OR “AI” OR “Machine Learning” OR “ML”) AND (“Software Engineering” OR “SE”) OR (“Technical Debt” AND “AI-Based System *”) ¹	413
IEEE Xplore	-//-	(“Technical Debt” AND (“Artificial Intelligence” OR “AI” OR “Machine Learning” OR “ML”) AND (“Software Engineering” OR “SE”)) OR (“Technical Debt” AND “AI-Based Systems”)	49
Scopus	-//-	TITLE-ABS-KEY (“Technical Debt” AND (“Artificial Intelligence” OR “AI” OR “Machine Learning” OR “ML”) AND (“Software Engineering” OR “SE”)) OR TITLE-ABS-KEY (“Technical Debt” AND “AI-Based Systems”)	46
Springer	-//-	(“Technical Debt” AND (“Artificial Intelligence” OR “AI” OR “Machine Learning” OR “ML”) AND (“Software Engineering” OR “SE”)) OR (“Technical Debt” AND “AI-Based Systems”)	148

¹ The asterisk (*) is used as a wildcard character to include plural and variant forms of the search term (e.g., “AI-Based Systems”).

Table 3. Distribution of identified technical debt types across reviewed papers.

Technical Debt Type	# Papers	Papers
Algorithm	4	[114,115,116,117]
Architectural	10	[1,20,118,119,120,121,122,123,124,125]
Build	2	[57,126]
Code	10	[19,21,30,62,65,66,67,69,86,127]
Configuration	4	[3,4,60,64]
Data	9	[27,28,40,41,42,43,44,128,129]
Defect	5	[19,29,130,131,132]
Design	4	[31,37,38,133]
Documentation	4	[74,75,134,135]
Ethics	6	[26,79,82,90,136,153]
Infrastructure	3	[6,50,137]
Model	13	[5,7,46,47,49,52,70,138,139,140,141,142,143]
People	3	[71,76,78]
Process	2	[38,72]
Requirements	5	[34,35,144,145,146]
Self-Admitted (SATD)	6	[63,83,84,147,148,149]
Test	4	[51,55,150,151]
Versioning	6	[48,58,59,60,61,152]

Table 4. Summary of technical debt types with stakeholder impact and suggested mitigation strategies.

No.	Technical Debt Type	Primary Stakeholder(s)	Key Impact on Stakeholders	Suggested Mitigation	Code Ref. in Questionnaire
1	Algorithm	Participant	Sub-optimal or unvalidated algorithmic choices may reduce reproducibility, increase complexity, and affect performance consistency.	Encourage use of baseline models, include validation protocols, and require algorithmic documentation.	Q1
2	Architectural	Organizer	Poor modularization and ad hoc integration decisions lead to tight coupling, limited scalability, and long-term maintainability issues in the platform infrastructure.	Apply early refactoring, adopt architectural patterns (e.g., MVC), and document integration points clearly.	Q2–Q3
3	Build	Organizer	Fragile or undocumented build scripts hinder platform portability, onboarding, and collaboration between contributors or instructors.	Use standardized build tools (e.g., Docker, Maven), document build process, and automate CI/CD pipelines.	Q4–Q5
4	Code	Organizer/Participant	Unreadable, duplicated, or overly complex code increases onboarding time, inhibits reuse, and introduces hidden bugs in competition solutions.	Promote coding standards, enforce linters, use peer review and code refactoring practices.	Q6
5	Configuration	Organizer/Participant	Hard-coded or undocumented configuration settings reduce reproducibility, cause deployment failures, and hinder experiment replication.	Use centralized configuration files, version configuration artifacts, and document parameter effects.	Q7–Q9
6	Data	Organizer	Poor-quality, biased, or evolving datasets affect model training validity, generalization, and fair scoring across submissions.	Apply data versioning, include data validation checks, and provide metadata with provenance and bias analysis.	Q10–Q15
7	Design	Organizer	Poor architectural decisions lead to rigid systems that are hard to extend with new tasks, metrics, or pipelines.	Encourage modular design, document architecture decisions, and use design patterns suited for AI/ML platforms.	Q16–Q18
8	Defect	Organizer	Unresolved or recurring bugs in competition submissions affect scoring fairness, participant confidence, and usability of reference implementations.	Implement test-driven development, integrate automated testing frameworks, and provide reproducible bug reports.	Q19–Q23
9	Documentation	Organizer	Incomplete or outdated documentation causes misunderstandings, onboarding delays, and misuse of platform functionalities.	Maintain up-to-date documentation, provide code-level comments, and supply example workflows for both organizers and participants.	Q24–Q28
10	Ethics	Organizer/Participant	May raise concerns regarding fairness, bias, or lack of transparency in evaluation processes or training data disclosure.	Integrate fairness audits, ethics checklists, and stakeholder feedback loops.	Q29–Q30
11	Infrastructure	Organizer	Can cause instability or performance issues due to outdated or insufficient hosting and computing infrastructure.	Use scalable cloud services and regularly monitor infrastructure health.	Q31–Q34
12	Model	Organizer	Leads to reduced reproducibility, maintainability, or performance when models are undocumented, overfitted, or opaque.	Document model architecture, training routines, and evaluation protocols clearly.	Q35–Q37
13	People	Organizer/Participant	Knowledge silos, turnover, or miscommunication can reduce platform consistency and maintainability.	Establish onboarding documents, cross-training, and transparent collaboration norms.	Q38–Q39
14	Process	Organizer	Unstructured or ad hoc processes can cause delays, confusion, or lack of traceability across platform tasks.	Adopt reproducible workflows with defined pipelines, task ownership, and versioning.	Q40–Q42
15	Requirements	Organizer	Missing or unclear requirements may result in mismatched expectations between organizers and participants.	Define clear and testable requirements early, using templates or user stories.	Q43–Q45
16	SATD	Participant	Unaddressed TODOs or FIXME comments in codebases may indicate areas of known debt left unresolved.	Systematically review SATD comments and integrate into refactoring plans.	Q46–Q47
17	Test	Organizer/Participant	Lack of automated or manual testing increases the risk of defects, regressions, and scalability issues.	Develop and maintain testing protocols, including unit and system-level tests.	Q48–Q54
18	Versioning	Organizer	Failure to track platform versions, data updates, or model submissions can undermine reproducibility and auditability.	Introduce version control systems with changelogs and reproducibility tags.	Q55–Q57
19	Accessibility	Organizer	Lack of accessibility features may prevent equal participation, particularly for users with visual, cognitive, or language-related limitations. This can result in reduced inclusiveness, participation drop-off, and limited feedback from diverse user groups.	Provide accessible documentation (e.g., plain language, multilingual support), enforce UI design standards (e.g., contrast ratios, keyboard navigation), and validate platform usability through accessibility audits or participant surveys	Q58–Q60

Table 5. Technical debt types and primary stakeholders.

Technical Debt Type	Primary Responsible Stakeholder
Algorithm	Participant
Architectural	Organizer
Build	Organizer
Code	Both, Organizer and Participant
Configuration	Both, Organizer and Participant
Data	Organizer
Defect	Organizer
Design	Organizer
Documentation	Organizer
Ethics	Both, Organizer and Participant
Infrastructure	Organizer
Model	Organizer
People	Both, Organizer and Participant
Process	Organizer
Requirements	Organizer
Self-Admitted (SATD)	Participant
Test	Both, Organizer and Participant
Versioning	Organizer

Table 6. Example—organizer perspective: accessibility debt.

Question	Score	Answer	Calculated Score
Have you conducted usability testing to identify and address potential barriers?	5	YES	−5
Have you integrated adaptive user interfaces based on participants’ skill levels?	4	NO	4
Have you implemented feedback mechanisms to report accessibility issues?	3	I Don’t Know/ I Don’t Answer	3
Overall Rating			2

Table 7. Example—participant perspective: model debt.

Question	Score	Answer	Calculated Score
Are you detecting direct or hidden feedback loops in your model?	4	NO	4
Is model quality validated before serving?	5	YES	−5
Does the model allow debugging by observing step-by-step inference on a single example?	3	I Don’t Know/ I Don’t Answer	3
Overall Rating			2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sklavenitis, D.; Kalles, D. A Scoping Review and Assessment Framework for Technical Debt in the Development and Operation of AI/ML Competition Platforms. Appl. Sci. 2025, 15, 7165. https://doi.org/10.3390/app15137165

AMA Style

Sklavenitis D, Kalles D. A Scoping Review and Assessment Framework for Technical Debt in the Development and Operation of AI/ML Competition Platforms. Applied Sciences. 2025; 15(13):7165. https://doi.org/10.3390/app15137165

Chicago/Turabian Style

Sklavenitis, Dionysios, and Dimitris Kalles. 2025. "A Scoping Review and Assessment Framework for Technical Debt in the Development and Operation of AI/ML Competition Platforms" Applied Sciences 15, no. 13: 7165. https://doi.org/10.3390/app15137165

APA Style

Sklavenitis, D., & Kalles, D. (2025). A Scoping Review and Assessment Framework for Technical Debt in the Development and Operation of AI/ML Competition Platforms. Applied Sciences, 15(13), 7165. https://doi.org/10.3390/app15137165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Scoping Review and Assessment Framework for Technical Debt in the Development and Operation of AI/ML Competition Platforms †

Abstract

Featured Application

Abstract

1. Introduction

Research Questions

2. Background and Related Work

2.1. Technical Debt in AI/ML Systems

2.1.1. Requirements Debt in AI/ML Systems

2.1.2. Architectural Debt in AI/ML Systems

2.1.3. Design Debt in AI/ML Systems

2.1.4. Data Debt in AI/ML Systems

2.1.5. Algorithm Debt in AI/ML Systems

2.1.6. Model Debt in AI/ML Systems

2.1.7. Infrastructure Debt in AI/ML Systems

2.1.8. Test Debt and Quality Assurance in AI/ML Systems

2.1.9. Build Debt in AI/ML Systems

2.1.10. Versioning Debt in AI/ML Systems

2.1.11. Configuration Debt in AI/ML Systems

2.1.12. Code Debt in AI/ML Systems

2.1.13. Process Debt in AI/ML Systems

2.1.14. Documentation Debt in AI/ML Systems

2.1.15. People and Social Debt in AI/ML Systems

2.1.16. Ethics Debt in AI/ML Systems

2.1.17. Self-Admitted Technical Debt (SATD) in AI/ML Systems

2.1.18. Defect Debt in AI/ML Systems

2.2. Software Engineering for AI (SE4AI) and Software Engineering for Machine Learning (SE4ML)

2.3. Explainable Artificial Intelligence (XAI) in AI-Based Systems

2.4. AI/ML Competition Platforms

2.4.1. Technical Infrastructure and Platform Design

2.4.2. AI Competitions as a Catalyst for Research and Innovation

2.4.3. Educational Benefits and Skill Development

2.4.4. Challenges and Considerations in AI Competitions

2.4.5. Future Directions

3. Materials and Methods

3.1. Scoping Review Framework

3.2. Search Strategy and Data Sources

3.3. Study Selection and Screening

3.3.1. Inclusion Criteria

3.3.2. Exclusion Criteria

3.3.3. PRISMA-Based Selection Process

3.4. Supplementary Search Strategy

3.5. Data Extraction and Classification

3.6. Summary of Materials and Methods

4. Results

4.1. Overview

4.2. Mapping of Technical Debt Categories

4.3. Findings per Technical Debt Type

4.3.1. Requirements Debt

4.3.2. Architectural Debt

4.3.3. Design Debt

4.3.4. Data Debt

4.3.5. Algorithm Debt

4.3.6. Model Debt

4.3.7. Infrastructure Debt

4.3.8. Test Debt

4.3.9. Build Debt

4.3.10. Versioning Debt

4.3.11. Configuration Debt

4.3.12. Code Debt

4.3.13. Process Debt

4.3.14. Documentation Debt

4.3.15. People—Social Debt

4.3.16. Ethics Debt

4.3.17. Self-Admitted Technical Debt (SATD)

4.3.18. Defect Debt

4.4. Observed Patterns and Gaps

4.4.1. Recurring Co-Occurrences

4.4.2. Underrepresented Technical Debt Categories

4.4.3. Educational and Human-Centered Contexts

4.4.4. Research and Practical Implications

4.5. Stakeholder Roles and Technical Debt Responsibility

4.6. Early Academic Deployment of the Questionnaire

4.7. Illustrative Use Case: Applying the Questionnaire to a Hypothetical Platform

5. Questionnaire and Quantification Method Approach

5.1. Overview and Purpose

5.2. Structure and Scoring Logic

5.3. Examples of Use

5.4. Full Set of Questions

A Scoping Review and Assessment Framework for Technical Debt in the Development and Operation of AI/ML Competition Platforms^†