Code Smells Thresholds Optimization: Defect Prediction as a Case Study

Mashiach, Tom; Katz, Gilad; Kalech, Meir

doi:10.3390/a19050412

Open AccessArticle

Code Smells Thresholds Optimization: Defect Prediction as a Case Study

by

Tom Mashiach

^*,

Gilad Katz

and

Meir Kalech

^*

The Stein Faculty of Computer and Information Science, Ben-Gurion University of the Negev, Beersheba 84105, Israel

^*

Authors to whom correspondence should be addressed.

Algorithms 2026, 19(5), 412; https://doi.org/10.3390/a19050412

Submission received: 1 April 2026 / Revised: 8 May 2026 / Accepted: 10 May 2026 / Published: 20 May 2026

(This article belongs to the Special Issue Algorithms and Machine Learning in Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

In software engineering, detecting and managing code smells are pivotal for maintaining software quality and reducing the risk of defects. Code smells signify potential issues in code that, while not problematic in themselves, may indicate deeper design flaws or future complications. Traditional code smells detection methods, which compare code metrics against fixed or statistically derived thresholds, may not always yield the most accurate code smells relevant to specific software practices. Addressing this gap, this research introduces an innovative methodology that utilizes a neural threshold generator, trained via a cooperative critic, to dynamically generate threshold values for detecting code smells in software components. Although the critic is conceptually related to the discriminator in a Generative Adversarial Network (GAN), its training objective is aligned with rather than adversarial to that of the generator. By integrating relevant code metrics, the proposed model generates customized thresholds for each software component. Our current evaluation focuses on a set of 11 class-level code smells defined by single or AND-connected conditions. It then uses these thresholds to identify code smells, which serve as input features to train a defect prediction model. A key feature of our approach is a cooperative-critic feedback mechanism that continuously refines the thresholds based on the defect prediction outcomes, ensuring the model’s effectiveness in identifying potential software issues is consistently improved. This advanced approach has demonstrated superior defect prediction performance, as evidenced by improved metrics such as the F1-score, AUC-ROC, and AUC-PRC, compared with the results of a defect prediction model that uses the traditional thresholds. Our study underscores the effectiveness of generating context-specific thresholds through neural networks, suggesting a promising avenue for exploring related software practices.

Keywords:

code smells; thresholds; defect prediction

1. Introduction

In the realm of software engineering, code smells have garnered significant attention due to their potential implications on software quality and maintainability. Code smells, as defined by notable figures such as Fowler and Brown [1,2], refer to patterns within code that may indicate deeper issues or poor design choices. While these patterns are not inherently problematic, they often signal underlying deficiencies that could lead to more significant problems, such as increased technical debt or heightened risk of defects.

The correlation between code smells and the likelihood of defects has been substantiated by various studies, suggesting that code smells can serve as reliable indicators for predicting software anomalies [3,4,5,6,7]. This relationship underscores the necessity of incorporating code smell analysis into software defect prediction models, as it offers a sophisticated perspective through which software quality can be assessed and improved. By leveraging code smell detection, developers and researchers can identify potential trouble spots within code more effectively, guiding targeted interventions and mitigating the risk of future defects.

To identify code smells in software components, researchers have developed conditional rules that apply metrics and thresholds to software components [8,9,10,11,12]. For instance, a class is considered to be a ‘Large Class’ if its LOC (Lines of Code) exceeds a predefined threshold. These thresholds are usually constant or based on the distribution of metrics across all components. Some studies, like the one by Fontana et al. [13], have proposed methods to derive different threshold levels from the metric distribution to improve code smell detection accuracy. Similarly, Liu et al. [14] have worked on customizing thresholds to enhance code smell identification. Palomba et al. [15] diverged from the focus on threshold adjustments by developing a parameter to measure the severity of code smells, with the objective of improving defect prediction without altering existing thresholds.

The motivation behind these research efforts to adjust thresholds is rooted in the observation that the constant thresholds previously established might not be optimal in various software practices. Despite these efforts, a significant gap remains: the literature has not yet explored customizing thresholds for individual software components. Tailoring thresholds could transform code smell detection, resulting in more accurate identification and enhancing the performance of related software practices. For instance, defect prediction models that use code smells as features could see improved performance due to the increased accuracy of code smell detection from these personalized thresholds.

Building on this identified gap, our research introduces a pioneering model that customizes code smell thresholds for each software component, aiming to enhance specific software practices by identifying the most effective thresholds for this particular practice. Given the proven effectiveness of code smells as indicators for defect prediction, this paper specifically demonstrates the code smell thresholds optimization approach as a case study to leverage the tailored thresholds to improve the performance of defect prediction models.

Our methodology introduces a model that generates specific thresholds for detecting code smells, aimed at enhancing the accuracy of defect prediction. For each sample, the model determines unique thresholds based on the relevant metrics, which then inform the identification of present code smells. This information serves as input for a defect prediction model. Following this, we use a cooperative-critic feedback mechanism to fine-tune the thresholds. Although this mechanism is conceptually related to the discriminator in a Generative Adversarial Network (GAN), the generator and critic are trained with aligned rather than adversarial objectives. This mechanism leverages the output from the defect prediction model to iteratively optimize the thresholds, ensuring the continuous improvement of the model and its effectiveness in predicting software defects.

Evaluation on 98 software projects examines the efficacy of our model. The current evaluation focuses on 11 class-level code smells whose detection rules are defined by a single condition or by AND-connected conditions, with broader extensions to more complex smell definitions and method-level smells left to future work. We meticulously compare our model’s performance against a baseline model that employs traditional thresholds for code smell detection. This comparative analysis is pivotal, as it directly illustrates the enhancements our tailored threshold approach brings to defect prediction. By analyzing a range of metrics, including F1-score, AUC-ROC, and AUC-PRC, we discern the tangible benefits of customizing thresholds for code smells. This comparison provides an understanding of how our approach, by tailoring thresholds, significantly improves defect prediction performances compared with traditional thresholds.

2. Background and Related Works

This section explores the literature relevant to our research, highlighting four core themes: the background of code smells, related works on their utilization in defect prediction models, related works on the methodologies employed for optimizing code smells thresholds, and an overview of training schemes for models that contain non-differentiable components.

2.1. Code Smells

Code smells are indicative patterns in source code signaling potential design and implementation issues. While not direct indicators of defects, these patterns are symptomatic of suboptimal practices that elevate the code’s technical debt and complexity, potentially paving the way for future defects. For example, code duplication across various classes may not signify existing defects, but it complicates the software and challenges its maintainability and development, thereby increasing the risk of defect introduction. Leading experts in the domain, including Fowler, M. et al., and Brown, W. [1,2], have established core definitions for code smells, identifying them as signs of less-than-optimal software development practices that can be improved with refactoring techniques. However, their work initially outlined only the conceptual nature of code smells without offering detection methodologies.

Subsequent studies have developed formal detection methods based on these conceptual descriptions, employing code metrics (such as line counts) and establishing conditional rules with threshold sets to determine the presence of code smells in software components, such as classes or methods [16,17,18,19]. The thresholds typically are empirical constants or statistically derived from project-specific metric distributions. Suryanarayana et al. introduced a distinct category of code smells, termed design smells, in their research [20]. These design smells specifically target and identify deviations from the essential object-oriented design principles, namely Abstraction, Encapsulation, Modularization, and Hierarchy.

2.2. Utilization of Code Smells for Defect Prediction

The field of defect prediction is a crucial area of focus in software engineering, aimed at identifying the software components that are the most likely to exhibit defects. This identification helps developers prioritize their efforts, optimize resource allocation, and potentially mitigate the time spent on debugging processes [21]. The interplay between code smells and defect prediction has garnered attention, with Piotrowski and Madeyski [3] offering a comprehensive review of studies exploring this relationship. Their analysis underscores a positive correlation between code smells and defects, reinforcing the utility of code smells in bolstering defect prediction models.

Building on the foundational insights provided by Piotrowski and Madeyski, additional research has delved deeper into various code smells and their implications for defect prediction. Ma et al. [4] explored the enhancement of defect prediction models through the integration of code smell detection, revealing a significant improvement in the recall rates of their models. Similarly, Taba et al. [5] introduced antipattern metrics based on the historical analysis of code smells, demonstrating their effectiveness in predicting defects with greater accuracy than traditional metrics. In a novel contribution to the field, Sotto-Mayor et al. [7] embarked on a pioneering investigation to unravel the correlation between design code smells and defect prediction. Design code smells, distinguished as unique types of code smells rooted in the foundational design principles of object-oriented programming, had not been considered in prior research in this context. Their work illuminates the significant, yet previously unexplored, link between these design-specific smells and their predictive value concerning software defects. In another research, Sotto-Mayor et al. delved into the realm of cross-project defect prediction, pioneering the exploration of code smells as predictive features [6]. Their study unveiled the potential of code smells to substantially improve defect prediction models trained across multiple software projects.

Beyond feature engineering based on code smells or traditional metrics, a parallel line of research applies deep learning directly to defect prediction. Wang et al. [22] used deep belief networks to learn semantic features from abstract syntax trees, and Li et al. [23] used convolutional networks over token sequences. Hoang et al. [24] introduced an end-to-end deep learning model for just-in-time defect prediction. More recently, pretrained transformer-based program representations such as CodeBERT [25] have been applied to a range of software engineering tasks, including bug prediction. These approaches are complementary to smell-based prediction: they learn features directly from source code rather than from hand-crafted or rule-based indicators. Our focus in this work is instead on improving the quality of the smell-based features themselves by learning better detection thresholds; combining per-sample learned thresholds with learned code representations is a natural direction for future work.

It is important to note that all the studies mentioned above utilized predefined thresholds to identify code smells. These thresholds can be constant or distribution-based. However, unlike the approaches taken in these studies, our current research employs sample-based thresholds for code smell detection. This distinction highlights the innovative approach of our study, where we tailor thresholds to the specific context of each sample, offering a sophisticated method for code smell detection that enhances defect prediction.

2.3. Optimize Code Smells Thresholds

Research on optimizing code smell thresholds has been fairly limited. Although multiple studies are dedicated to setting thresholds for various software metrics, the domain of code smell threshold optimization is less explored. Nonetheless, some recent studies that address this topic indicate an emerging interest in this area. Metric thresholds are crucial in the transition from mere quantification to informed decision-making in software engineering. These thresholds define the point at which a metric indicates a potential issue, thus guiding actions and decisions. There is a substantial body of research that addresses this gap, with the objective of improving the utility of metrics to guide decisions within the software development life cycle. Various features and approaches to establishing and using these thresholds are explored in various studies, with the purpose of using metrics for guidance and information rather than just measuring [8,9,10,11,12].

The problem of setting decision thresholds appropriately is not unique to software engineering and has been studied extensively in the broader machine learning literature. Sheng and Ling [26] proposed cost-sensitive threshold selection, Zou et al. [27] investigated threshold selection under severe class imbalance, and Pleiss et al. [28] explored instance-conditional decision boundaries in the context of fairness-aware classification. A recurring observation across this literature is that a single global threshold is rarely optimal across heterogeneous inputs, which motivates moving towards thresholds that adapt to context. Our work can be viewed as an instantiation of this broader trend, applied to code smell detection, in which the heterogeneity of software components makes a single global threshold particularly ill-suited.

One study addressing code smell thresholds is the work of Fontana et al. [13]. The researchers introduced a data-driven methodology that uses benchmarks to establish threshold values specifically for code metrics. This study applies the derived thresholds to enhance code smell detection rules. In their approach, Fontana et al. derive three distinct threshold levels for each metric, based on the quartiles of the metric’s distribution. This method allows for the identification of code smells with thresholds that naturally align with the inherent distribution of the data, providing a more data-driven and precise mechanism for code smell detection.

Another work that improves code smell detection is the study by Liu et al. [14]. The paper addresses the need for customizing thresholds to suit the specific characteristics of software applications, engineers’ working schedules, and their individual software quality requirements. The authors utilize genetic algorithms to identify optimal settings, showcasing an innovative approach to threshold customization. Their methodology actively engages engineers in manually evaluating potential code smells and making informed decisions regarding the presence or absence of code smells for accurate ground-truth labeling. However, this study focuses only on five specific code smells, and the authors acknowledge that their approach might not be scalable to encompass a broader array of code smells.

Neither of the previously discussed works by Liu et al. nor Fontana et al. aimed to refine defect prediction models based on code smells. The study by Palomba et al. [15] examines the relationship between code smells and bug proneness, proposing a unique bug prediction model for classes influenced by code smells. The authors introduce the “code smell intensity” metric, which assesses the severity of code smells by leveraging associated metrics, notably utilizing the distance from established thresholds for each code smell. These thresholds are derived from the metrics’ statistical distribution in a large dataset, represented as a quantile function. However, the paper does not endeavor to calculate optimal thresholds to enhance bug prediction capabilities. While the studies by Fontana et al. and Liu et al. have significantly advanced the field of code smell detection by introducing data-driven and context-specific threshold determination, they do not leverage these thresholds for defect prediction. Palomba et al.’s research provides valuable insights into the impact of code smells on bug-proneness; however, it does not investigate threshold optimization for higher prediction accuracy.

A parallel line of work foregoes thresholded rules entirely and instead trains machine learning classifiers to identify code smells directly from labeled examples. Arcelli Fontana et al. [29] conducted a large-scale comparison of classifiers for smell detection, and Di Nucci et al. [30] critically examined the robustness of these learned detectors. Our approach is complementary rather than competing with this line of work: rule-based thresholded smell detection remains the dominant practice in widely deployed tooling such as SonarQube, PMD, DesigniteJava, Organic, and JDeodorant, and our contribution is precisely to improve the thresholds used by such tools on a per-sample basis. Combining per-sample learned thresholds with machine-learning-based smell classifiers is an interesting direction for future work.

Based on the discussions presented, our work distinguishes itself from existing research in two key ways. First, our research addresses the gap described above by dynamically setting tailored thresholds for each sample, taking into account its context, and applying these thresholds to detect code smells. This flexibility significantly enhances the effectiveness of defect prediction models. Secondly, our approach is designed to be extensible to a broader range of code smells, regardless of their form of implementation, thus adding another layer of innovation to our contribution to the field.

2.4. Learning Through Non-Differentiable Operators

A recurring challenge in deep learning is the presence of a non-differentiable operator within an otherwise differentiable pipeline, which blocks end-to-end gradient flow. In our setting, the rule-based Code Smells Calculator (detailed in Section 4) plays this role: it consumes thresholds together with metric values and outputs binary smell indicators via comparison-and-logic operations that are not differentiable. The broader literature has converged on three distinct families of approaches to this general problem, and it is useful to position our design relative to all three.

The first family replaces the non-differentiable operator with a continuous relaxation, restoring gradient flow at the cost of altering the operator’s semantics. The Gumbel-Softmax distribution [31] is the canonical instantiation and has been used, among other places, in GAN-style generators of discrete structures. The second family preserves the discrete operator and supplies gradients via policy-gradient methods borrowed from reinforcement learning; SeqGAN [32] is the most widely cited example, training a discrete-sequence generator against an adversarial discriminator using REINFORCE. Training-stability variants of adversarial training, such as Wasserstein GAN and its gradient-penalty form [33,34], address a different failure mode (adversarial instability) rather than non-differentiability per se, but frequently appear alongside the first two families in the same application contexts. The third family leaves the non-differentiable operator untouched and introduces an auxiliary differentiable network whose role is to produce a learning signal that can be back-propagated in place of the blocked gradient. Actor-critic methods in reinforcement learning [35,36] are the prototypical instance of this pattern, and cooperative-training frameworks such as CoopNets [37] apply closely related ideas to generative modeling with aligned rather than adversarial objectives.

Each family imposes a characteristic trade-off. Relaxation-based approaches require giving up the exact semantics of the original operator, which in our case would mean abandoning the rule-based smell definitions that make the output directly compatible with established tooling. Policy-gradient approaches preserve the operator but introduce high-variance gradient estimators that tend to destabilize training. Adversarial-stability modifications, such as WGAN, leave the non-differentiability of an internal operator unresolved on their own. The auxiliary-critic family, to which our approach belongs, preserves the discrete operator and avoids the high variance of policy-gradient estimators, at the cost of requiring an additional learned component whose training must be coordinated with the rest of the pipeline. The mechanism introduced in Section 4 follows this third route: we leave the discrete Code Smells Calculator untouched and train an auxiliary critic network whose role is to produce a differentiable, informative learning signal for the Thresholds Generator. As detailed there, the generator and critic are trained with aligned rather than adversarial objectives, placing the resulting architecture conceptually closer to cooperative-training frameworks and actor-critic reinforcement learning than to standard GANs.

3. Problem Description

This section outlines the optimization problem of code smells’ thresholds for enhancing defect prediction. We begin by defining defect prediction and its methodologies (from data gathering to Classifier development and evaluation), and discuss the usage of code smells as features in that task (Section 3.1). Next, we provide a detailed definition and formulation of optimizing the code smells’ thresholds (Section 3.2).

3.1. Defect Prediction

Defect prediction is pivotal in improving software quality, focusing on identifying potential code flaws early in the development process. It relies on analyzing historical defect data to assess the likelihood of defects in new software instances. Using statistical and machine learning methodologies, this approach enables a precise prediction of future defects, facilitating targeted efforts in quality assurance. The methodology of defect prediction in the literature typically encompasses the collection and preprocessing of software metrics, the selection of relevant features based on historical data, and the application of machine learning algorithms to train classification models, so that they will finally be able to predict future defects. This process involves training models on past project data, validating them on separate datasets, and fine-tuning parameters to improve prediction outcomes [38,39].

Several literature reviews analyzed and categorized the software metrics used to predict software defects. Radjenovic et al. [40], for example, analyzed different software metrics applied in software fault prediction. Their review, which includes 106 research articles from 1991 to 2011, divides the software metrics into three categories, based on the goal and time. First, traditional metrics are metrics that aim to measure the size and complexity of a code. Second, object-oriented metrics aim to capture object-oriented properties such as cohesion, coupling, and inheritance. Third are process metrics that measure the quality of the development process, such as the number of changes and the number of bugs. Another group of features are known as ‘bad code smells’. Code smells are patterns in software code that indicate potential problems or poor design choices, which may cause problems in the future. Bad code smells are not bugs, but they can make codebase maintenance and evolution harder, signaling the need for refactoring. Code smells have been shown to be both positively correlated with software defects and to positively influence the performance of defect prediction models when used as features [3,6,7]. In the forthcoming section, we will delve deeper into the definition of code smells and discuss the importance of establishing thresholds for their detection. Next, we will formally introduce and define the optimization of code smells’ thresholds as a method to enhance defect prediction capabilities.

3.2. Code Smells Thresholds Optimization

The initial identification of code smells was brought to light by notable works in the field, emphasizing them as signs of less-than-ideal software development practices. Code smells were identified mainly for improvement through refactoring [1,2]. Although these foundational contributions outlined the conceptual framework for code smells, they did not offer any concrete methods for detection. This gap was addressed in later research, which proposed formal methods for detecting code smells using code metrics, such as the number of code lines [16,17,18,19]. These methods apply conditional rules and thresholds to ascertain the presence of code smells in software components, such as classes and methods. For example, a classic code smell is the God Class, characterized by a class that is excessively large and centralizes numerous responsibilities. To detect a God Class, one can apply a specific criterion: checking whether the class’s total lines of code exceed 500 and if its tight class cohesion (TCC), a metric for evaluating the closeness of a class’s public functions, exceeds the average TCC for all classes in the project.

The thresholds for detecting code smells are established through empirical data and statistical analysis. Empirical thresholds are often derived from established best practices and historical data, setting a baseline for acceptable code characteristics, such as maximum method length or acceptable levels of class coupling. Another way to set the threshold is to consider the context of a specific project by analyzing its code metrics to establish norms and outliers within that specific environment. This methodology dictates that thresholds remain consistent across all samples or within the same contextual framework. However, the effectiveness of these thresholds in accurately identifying code smells that are predictive of defects is still a topic of debate. Consequently, this paper proposes to establish optimal thresholds tailored to each sample, to improve the correlation between detected code smells and defect prediction.

An important methodological point about evaluating threshold quality is that, unlike defects, code smells do not have a canonical ground truth. Smell definitions are inherently heuristic, and multiple empirical studies have documented substantial disagreement about when a given smell is present. Fernandes et al. [41] and Paiva et al. [42] reported low agreement across widely used smell-detection tools, and Hozano et al. [43] found comparable levels of disagreement among human developers themselves. Consequently, assessing the quality of learned thresholds by comparing the resulting smell labels against any fixed reference set would effectively measure agreement with one heuristic rather than with any absolute ground truth. We therefore adopt a task-driven evaluation: thresholds are optimized for, and evaluated through, a downstream objective (defect prediction) whose outcomes are observable and unambiguous. This is methodologically parallel to the way learned representations, such as word embeddings, are typically evaluated on downstream tasks rather than against intrinsic semantic ground truth.

Research objective. Let us formally define the research objective of this paper. Let $C S = {c s_{1}, c s_{2}, \dots, c s_{z}}$ be the set of all code smells being considered. A code smell is defined by a function $f_{c s}$ that applies to specific metrics and their corresponding thresholds. Let $M = {m_{1}, m_{2}, \dots, m_{n}}$ be the set of all metric values, and let $M_{c s} \subseteq M$ be the ordered set of the metrics considered for a specific code smell $c s$ . For the given code smell $c s$ there is the ordered set of thresholds $T_{c s} = {t_{1}, t_{2}, \dots, t_{| M_{c s} |}}$ corresponding to the metric set $M_{c s}$ . The function $f_{c s}$ can be formally represented as follows:

\begin{matrix} f_{c s} (M_{c s}, T_{c s}) = c o n d_{1} (m_{1}, t_{1}) o p c o n d_{2} (m_{2}, t_{2}) o p \dots o p c o n d_{| M_{c s} |} (m_{| M_{c s} |}, t_{| M_{c s} |}) \end{matrix}

where:

The expression $c o n d_{i} (m_{i}, t_{i})$ represents a condition that compares a metric $m_{i}$ with its corresponding threshold $t_{i}$ using relational operators such as $<, >$ , or $= =$ .
The symbol $o p$ denotes a logical operator, where ∧ stands for AND, ∨ stands for OR and ¬ stands for Not, used to combine the conditions.

For instance, the function representing the code smell God Class is defined as follows:

f_{g o d c l a s s} = M_{L O C} > 500 \land M_{T C C} < A V G_{T C C}

This means that a class sample is considered a God Class if it contains more than 500 lines of code and its Tight Class Cohesion (TCC) value is lower than the average of all classes’ TCC in the project. The function

f_{c s}

is dependent on the metric values of a specific sample. In this research, a sample is defined as a software component within a particular project (e.g., class, function). This sample is characterized by unique metrics and specific code smells. Mathematically, we represent the set of all such samples as

S = {s_{11}, s_{12}, \dots, s_{i j}, \dots}

, where

s_{i j}

represents the

j^{t h}

software component in the

i^{t h}

project. For each sample

s_{i j}

let

M_{i j} = {m_{i j_{1}}, m_{i j_{2}} \dots m_{i j_{n}}}

be the set of metric values specific to

s_{i j}

.

Contrary to common practices in the literature, where the set of thresholds for each code smell is fixed and independent of the sample, or remains constant across each project, unaffected by the specific characteristics of software components, our approach significantly diverges. We posit that the set of thresholds for each code smell is neither constant nor solely context-dependent, but it is sample-dependent. To formalize this, let

M_{i j}^{c s} \subseteq M_{i j}

be the ordered set of metrics of sample

s_{i j}

relevant to code smell

c s

, and

T_{i j}^{c s}

represents the set of thresholds specific to code smell

c s

and sample

s_{i j}

. For example, let’s consider two samples:

s_{11}

and

s_{21}

in two different projects. Sample

s_{11}

has 523 lines of code with a TCC of

0.42

, while

s_{21}

has 567 lines of code with a TCC of

0.51

. The average TCC for Project 1 is

0.52

, and for Project 2 is

0.54

. Based on the thresholds used in the literature, both samples are identified as having the God Class code smell. However, applying our method yields different thresholds than the fixed thresholds. Specifically, for

s_{11}

, the ordered set of thresholds is

{578, 0.48}

, and for

s_{21}

, it is

{534, 0.63}

. Consequently, our analysis indicates that

s_{21}

exhibits the God Class code smell, whereas

s_{11}

does not.

Framed in machine learning terms, the shift from a global threshold to a sample-dependent threshold corresponds to producing the parameters of the detection rule conditional on the input—an instance-conditional parameterization. This is a well-established paradigm in deep learning. Hypernetworks [44] generate the parameters of one network conditional on input to another; Mixture-of-Experts architectures [45] route different inputs to different parameter subsets; and attention mechanisms produce input-dependent weightings over a fixed parameter bank. Our Thresholds Generator, introduced formally in Section 4, can be viewed through the same lens: it produces the parameters (thresholds) of the rule-based smell detectors conditional on the sample’s own metric profile. Per-sample thresholds are therefore not an ad hoc departure from the global-threshold tradition but an instance of a broader and well-studied modeling pattern.

The set of all functions that define each code smell within the set

C S

is represented by

F_{C S} = {f_{c s_{1}}, f_{c s_{2}}, \dots f_{c s_{z}}}

. The defect prediction problem, from a mathematical standpoint, involves finding a mapping between a set of features and the presence of defects. In our study, these features are the code smells identified through

F_{C S}

. Formally, the prediction model can be articulated as discovering a function D that maps the set of identified code smells for a given software component onto a prediction of defect presence, denoted as

D : F_{C S} \to {0, 1}

. A perfect prediction model will classify each new sample to its correct class. However, in reality, prediction models are not perfect, and as a result, some samples may be misclassified. The performance of classification models can be evaluated using different measurements such as accuracy, recall, and AUC. Given a prediction model D and a set of samples S, we denote the performance of D by the function:

ρ : (D, S) \to R

. In this work, we focus on finding optimal thresholds to enhance the performance of our defect prediction model. The essence of our approach lies in fine-tuning the threshold values

T_{i j}^{c s}

for each code smell and each sample

s_{i, j}

, specifically designed to improve the model’s predictive performance. This optimization process is captured by the objective function

ρ

, which is defined as follows:

a r g m a x_{{T_{i j}^{c s}}} ρ (D ({f_{c s} (M_{i j}^{c s}, T_{i j}^{c s}) | c s \in C S, M_{i j}^{c s} \subseteq M_{i j}}), S)

where:

$M_{i j}^{c s}$ is the ordered set of metrics relevant to code smell $c s$ , and
$T_{i j}^{c s}$ is the ordered set of thresholds relevant to code smell $c s$ , optimized specifically for sample $s_{i, j}$ .

4. Methodology

In this section, the architecture of the proposed model, designed to optimize code smells’ thresholds to improve defect prediction, is elucidated. Initially, an overview of the entire model is presented (Section 4.1), followed by a comprehensive description of each component within the proposed architecture and an explanation of the model’s training process (Section 4.2).

4.1. Overview

Our proposed architecture, presented in Figure 1, is designed to generate sample-specific thresholds for code smell detection in order to improve downstream defect prediction. The architecture consists of four components: Thresholds Generator, Code Smells Calculator, Classifier, and Cooperative Critic. The Thresholds Generator receives well-known metrics, derived from the analyzed code, and produces thresholds for the various code smells contained in the Code Smells Calculator. Based on these thresholds, the Code Smells Calculator evaluates the analyzed code and determines for each code smell whether it applies or not (i.e., binary output). The output of the Code Smells Calculator is provided as input to the Classifier, which determines whether the code has bugs.

The main obstacle to using deep learning for thresholds generation, i.e., code smell thresholds optimization, is the fact that while the Thresholds Generator and the Classifier are differentiable (given that they are implemented using neural networks), the Code Smells Calculator is not. This setup prevents us from training the Thresholds Generator based on the Classifier’s performance, because we are unable to update the Thresholds Generator based on the Classifier’s output.

To overcome this problem, we introduce the fourth component of our architecture: the Cooperative Critic. Although this component is structurally similar to the discriminator in a Generative Adversarial Network (GAN), we adopt the name Cooperative Critic to emphasize that its training objective is aligned with, rather than adversarial to, that of the Thresholds Generator. The Cooperative Critic connects the Classifier and the Thresholds Generator, thus enabling us to train the latter.

It is important to make the nature of this training signal explicit. Minimizing the Cooperative Critic’s loss and minimizing the Thresholds Generator’s loss push in the same direction, both favoring a Classifier whose loss distribution the critic can confidently recognize as ‘real’. This places our architecture conceptually closer to cooperative-training frameworks such as CoopNets [37], in which the two networks are trained with aligned objectives, and to actor-critic methods [35,36], in which a differentiable critic supplies gradients to a module that is not directly reachable by end-to-end backpropagation. The broader problem of obtaining useful learning signals through non-differentiable operators is also addressed by straight-through and surrogate-gradient estimators [46,47]. The Cooperative Critic does perform a discrimination task—separating the Classifier’s true per-sample loss from a noised counterpart—but its training objective is fundamentally cooperative, and we therefore avoid the term Discriminator throughout the remainder of the paper to prevent confusion with the standard GAN setting.

4.2. Detailed Architecture

Building upon the overview provided, this section delves into the intricate architecture of the proposed model, elucidating the design and functionality of each component within the system. The model’s architecture is engineered to refine threshold optimization for code smells, thereby enhancing defect prediction capabilities. The detailed exploration of each component will shed light on their individual contributions and the synergy that bolsters the model’s overall effectiveness.

4.2.1. Thresholds Generator (TG)

The goal of the TG is to produce sample-specific code smell-based thresholds. These thresholds will influence the performance of the functions used by the Code Smells Calculator. The choice of a neural network for the TG is not stylistic but architectural: the training signal for the TG reaches it through the Cooperative Critic (Section 4.2.4), which requires the threshold-generating module to be differentiable so that gradients can be propagated back to it. Non-differentiable alternatives such as decision trees or rule-based selectors cannot participate in this gradient-based scheme. Within the space of differentiable models, the choice of a moderately sized multilayer perceptron for structured tabular inputs is consistent with recent evidence that neural networks are competitive with or outperform classical approaches on tabular learning tasks [48].

The TG, a neural network component, accepts as input the numeric metrics associated with the set of considered code smells for each sample. The numeric code metrics are normalized to a range between 0 and 1. The neural network architecture of the TG is meticulously designed, comprising a total of four layers. This includes the input layer, two hidden layers, and the output layer. Each hidden layer consists of 256 neurons, which allows the network to capture the complex relationships in the data. The activation function employed within the hidden layers is the

R e L U

, selected for its ability to maintain gradient flow, thereby preventing the vanishing gradient problem and enhancing the network’s learning capability. The output is a set of n thresholds, where n represents the total number of thresholds across all the considered code smells. Employing a

s i g m o i d

activation function ensures that each output threshold value is constrained between 0 and 1, making it comparable with the normalized input range.

To optimize the network’s weights and biases during training, we employ the

A D A M

optimizer with a learning rate of

0.0002

. This optimizer is well-regarded for its performance in various neural network applications and contributes significantly to the robustness and efficiency of our model’s training process. It is important to note that both the input metrics and the resulting thresholds are specific to each sample, ensuring that the thresholds generation process is tailored to the unique characteristics of each sample.

4.2.2. Code Smells Calculator (CSC)

The goal of this component is to detect code smells within each software sample. The CSC applies predefined code smell detection functions to each sample, utilizing the thresholds generated by the TG and the normalized metrics as inputs. The sensitivity of each detection function is governed by the generated thresholds, meaning that the TG is guiding the detection process. For every sample processed, the specialized functions are applied to determine the existence of code smells. The final output of this component is a binary array, where each entry signifies whether a specific code smell is present

(1)

or absent

(0)

in the sample. This representation serves as the foundational input for the defect prediction task, performed by the Classifier. It should be emphasized that the CSC is inherently non-differentiable, because it is designed to support all kinds of detection methods regardless of their implementation. While this setup ensures flexibility and future support for all detection methods, the inability to pass gradients back to the TG poses difficulties. More specifically, we are unable to directly tune the TG’s performance based on the performance of subsequent components. We address these challenges in Section 4.2.4.

4.2.3. Classifier

The goal of the Classifier is to predict the likelihood of defects in software samples, utilizing the binary vector output from the Calculator, which signifies the presence or absence of specific code smells. The architecture of the neural network includes an input layer that matches the number of code smells produced by the Code Smells Calculator, one hidden layer with 256 neurons, and an output layer that uses a

s i g m o i d

activation function to provide a probability indicating the likelihood of a defect. For the hidden layer, the

R e L U

activation function is employed to introduce non-linearity. For our loss function, we employ Binary Cross-Entropy (BCE), a standard choice for binary classification tasks. The BCE loss is defined as:

L (y, \hat{y}) = - (y log (\hat{y}) + (1 - y) log (1 - \hat{y}))

(1)

where y is the true label (bugged or not bugged), and

\hat{y}

is the predicted probability. This function quantifies the difference between the predicted and the actual labels, guiding the model to minimize this discrepancy during training. The

A D A M

optimizer is utilized with a learning rate of

0.001

, known for its effective and efficient convergence in various neural network applications to optimize the classifier’s parameters. From the Classifier’s performance, we calculate a set of losses using BCE for each sample, applied without reduction. From this set of losses, we derive a new set of ‘Fake Losses’ by randomly selecting a value within

\pm 0.2

of the actual loss. These two sets of losses will be utilized as part of the input for the next component in our model.

4.2.4. Cooperative Critic

The goal of the Cooperative Critic is to enable the TG component to learn and improve its performance so that it can better support the Classifier. Because the two components cannot be connected directly, the Cooperative Critic serves as the missing link. The Cooperative Critic’s input consists of three elements: (a) the output of the TG (identical to the one sent to the Code Smells Calculator); (b) the value of the Classifier’s loss function (a scalar value), as presented in Equation (1); (c) a noised version of the Classifier’s loss function value, created by randomly sampling a value in the

[- 0.2, 0.2]

range and adding it to the loss value. The Cooperative Critic is then required to determine which of the two scalars is the one produced by the Classifier. The Cooperative Critic’s loss function, presented in Equation (2), and its resulting gradients, are used to update both itself and the TG (specific details are provided in the Section 4.2.5). As discussed in Section 4.1, the goals of the Thresholds Generator and the Cooperative Critic are aligned rather than adversarial; we do not use the Cooperative Critic’s loss to update the Classifier.

L_{D} = - \frac{1}{N} \sum_{i = 1}^{N} [log ({\hat{y}}_{original, i}) + log (1 - {\hat{y}}_{noised, i})]

(2)

Why this signal is informative. The Cooperative Critic is asked to separate a true per-sample BCE loss value from a version perturbed by uniform noise on $[- 0.2, 0.2]$ . When the Classifier is accurate and confident on most samples, its per-sample losses are tightly concentrated near zero, while the noised versions form a wider and offset distribution; the two are then easily separable. When the Classifier is inaccurate or uncertain, its per-sample losses are more broadly distributed, the noised counterpart overlaps them substantially, and the discrimination task approaches chance. Minimizing the Cooperative Critic’s loss on ‘real’ inputs is therefore monotonically related to driving the Classifier’s loss distribution toward zero, which is precisely the signal we want to propagate back to the TG. This style of reasoning has a well-established counterpart in the noise-contrastive estimation and density-ratio-via-classification literature [49,50], where a binary Classifier trained to separate one distribution from a reference distribution is used to characterize the former.

In our setting, the same idea is used not to estimate a density ratio explicitly, but to provide a differentiable scalar signal that is correlated with Classifier quality. If the TG produces effective thresholds, the Classifier’s performance improves, and the loss distribution becomes easier to recognize; the Cooperative Critic thus synchronizes the operation of the TG and the Classifier, subsequently improving the latter’s performance. Finally, by not using the Cooperative Critic’s loss to update the Classifier, we avoid the risk of the Classifier aiming for easily recognizable rather than genuinely effective classifications, thus protecting the architecture’s performance.

Why $\pm 0.2$ . The width of the noise window governs the difficulty of the Cooperative Critic’s task, which in turn governs how informative its gradient is as a signal about the Classifier’s loss distribution. Two limiting cases frame the choice. If the window is very small, real and noised losses are nearly indistinguishable regardless of Classifier quality, and the Cooperative Critic’s gradient carries almost no information about whether the Classifier has improved. If the window is very large, real and noised losses are separable at essentially any stage of training, the Cooperative Critic reaches saturation early, and its gradient again becomes uninformative. Between these extremes lies a range in which the separability of real from noised is a function of the concentration of the Classifier’s loss distribution—which is exactly the signal we want the Cooperative Critic to expose to the TG. The value $\pm 0.2$ sits inside this range given that the Classifier uses Binary Cross-Entropy on probabilistic outputs, whose per-sample loss values in practice span roughly the interval $[0, 3]$ and concentrate near zero for confident-correct predictions. At this scale, $\pm 0.2$ is large enough to matter on easy samples (where real losses near zero are clearly distinguishable from noised losses around $0.1$ – $0.2$ ) and small enough to remain non-trivial on harder samples (where real losses are larger and the noised distribution overlaps meaningfully with them). This kind of moderate-scale additive perturbation is the standard design knob in noise-contrastive training schemes, where the noise distribution’s spread is tuned relative to the data distribution’s own spread rather than to a universal constant.

We implement the Cooperative Critic using a simple architecture, consisting of an input layer, a hidden layer with 256 neurons, and an output layer with a single neuron. The hidden layer employs

R e L U

activation, while the output layer uses a Sigmoid. Like the other components in our architecture, we also use ADAM optimization and a learning rate of

0.001

.

4.2.5. Training Process

The training of our architecture consists of three phases. All phases are performed sequentially for each mini-batch.

Updating the Classifier’s parameters. The training process begins with the TG receiving a batch of input samples. For each sample in the batch, this component of our architecture produces a set of thresholds and sends them to the Code Smells Calculator. The CSC output is the input of the Classifier, which predicts the existence of bugs in the original code. The Classifier is then updated using the BCE loss function presented in Equation (1).
Updating the Cooperative Critic’s parameters. Upon obtaining the Classifier’s loss function value, we create the latter’s noised version and provide both to the Cooperative Critic. The component is tasked with determining which of the values is ‘real’. The Cooperative Critic’s loss function is presented in Equation (2).
Updating the Thresholds Generator’s parameters. The TG’s loss function combines those of the Classifier and the Cooperative Critic: $T G_{L o s s} = B C E (D (l, t), 1) \times (1 + B C E (C (c s), y))$ , where $B C E$ is the Binary Cross-Entropy function, $D (l, t)$ denotes the Cooperative Critic’s output, $C (c s)$ is the Classifier’s output, and y is the true label of the analyzed sample (a value of 1 indicates the code contains bugs). As explained above, this loss function ensures that as the Classifier’s loss decreases, so does the Thresholds Generator’s loss, prompting the Thresholds Generator to optimize its parameters to produce thresholds that lead to more accurate bug predictions.

Two design choices in the TG’s loss function warrant explicit discussion. First, the additive constant in

(1 + B C E (C (c s), y))

plays a gradient-preservation role: without it, when the Classifier becomes confident and correct on a sample (i.e.,

B C E (C (c s), y) \to 0

), the entire TG loss for that sample collapses to zero regardless of the cooperative-critic term, and the TG receives no learning signal from that sample—even in cases where the thresholds could still be improved. The constant guarantees a non-vanishing baseline, so that the critic signal continues to drive learning throughout the training trajectory.

Second, the two terms are combined multiplicatively rather than by an additive weighted sum. They play qualitatively different roles: the cooperative-critic term

B C E (D (l, t), 1)

acts as a quality gate on the generated thresholds, while

(1 + B C E (C (c s), y))

acts as a per-sample difficulty weight that is larger on samples where the Classifier is still making errors. The multiplicative form scales the critic signal up on harder samples—yielding larger gradients there—while keeping it at baseline strength on easy ones. An additive combination, by contrast, would allow the TG to become insensitive to the critic whenever the Classifier happens to be inaccurate, because the sum would already be large regardless of the critic’s verdict. The multiplicative, difficulty-weighted pattern is in the same family as focal loss [51] and online hard example mining [52], both of which have empirically validated the benefit of multiplying a base loss by a per-sample difficulty modulator.

5. Evaluation

This section outlines the experiments that were conducted to assess our threshold-learning method, which is used for predicting software defects. We detail the specific code smells selected for this study, describe the dataset, outline the preprocessing steps taken to prepare the data, and elaborate on the training details. Additionally, we define the evaluation metrics used to assess the performance of our approach, compare it against established baselines, and discuss the training process of our model. Following the explanation of the experimental setup, we will present and discuss the results obtained from our evaluation.

5.1. Experimental Setup

5.1.1. Selected Code Smells

Code smells can manifest at either the class level or the method level. To prove the efficacy of our approach to learn the thresholds of the code smells, in this study, we opt to focus exclusively on class-level code smells. The code smells considered are well documented in the literature, and the work of Sotto-Mayor et al. [7] demonstrates their value as significant features for defect prediction tasks. Based on this, we selected those code smells defined by a single condition or multiple conditions linked by

A N D

operators. This selective approach allows us to concentrate on code smells with specific structural attributes, enabling a more focused and relevant analysis in our research context. It is important to note that the foundational definitions of the code smells we investigate are informed by the standards established in the Organic [53] and DesigniteJava [54] tools. While we did not directly use these tools, their definitions have guided the criteria for code smell identification in our study. Table 1 presents the 11 code smells considered in this work, including their descriptions and the mathematical formulas for their detection as defined in Section 3.2. It is important to note that our learning approach can be extended to more code smells, including more complex conditions and method-level consideration, and is not restricted to those 11 code smells. We concentrate on these 11 code smells because they are simple to apply and demonstrate the effectiveness of our methodology.

To support direct reproducibility of the fixed-threshold baseline used in our experiments (Section 5.1.6), Table 2 reports the numeric value used for each threshold appearing in the detection formulas of Table 1. These values follow the literature-standard definitions established in the Organic [53] and DesigniteJava [54] tools, from which our code smell definitions are derived.

5.1.2. Data Description

The dataset employed in this study originates from the research conducted by Sotto-Mayor et al. [7], encompassing data from 98 distinct projects written in Java, each with five versions. From this data, we collected numeric data detailing various metrics relevant to each class within every version of these projects. In our analysis, we focused on metrics data that are pertinent to the selected code smells.

Table 3 details all the relevant metrics, providing a description and abbreviation for each. The train–test split was based on the versions of each project; from the five versions available, four were allocated for training purposes, while the most recent version was reserved for testing. This temporal, within-project configuration follows the standard within-project defect prediction (WPDP) paradigm [55] and matches the specific split used by Sotto-Mayor et al. [7], from whom our dataset is derived. Under WPDP, a project’s past versions are used to predict defects in its future versions, so learning project-specific patterns is the intended behavior rather than an artifact of leakage; the strict temporal ordering precludes information from the held-out version reaching training.

A detailed summary of all projects, including the number of classes in both the training and testing sets, along with the percentage of defects present in each set, is provided in the https://zenodo.org/records/15294379, accessed on 28 April 2025.

5.1.3. Preprocessing Steps

In Section 5.1.2, we mentioned the use of metric data for each class. Numeric metrics provide quantifiable data, such as lines of code or number of fields, with values that can exist on various scales. To standardize the numeric metrics, we applied min-max normalization to each metric across every version within each project. This normalization process is essential not only for improving the neural network’s ability to effectively interpret these values but also for enabling contextual value of metrics across different projects and versions. The formula for min-max normalization is as follows:

m^{'} = \frac{m - m i n (m_{v, p})}{m a x (m_{v, p}) - m i n (m_{v, p})}

where:

$m^{'}$ is the normalized metric value.
m is the origin metric value.
$m i n (m_{v, p})$ is the minimum metric value in the version v of project p.
$m a x (m_{v, p})$ is the maximum metric value in the version v of project p.

This formula concisely communicates how each metric value m is normalized within the specific context of its version v and project p, ensuring that the normalization is accurately contextualized. An examination of the data reveals a class imbalance, with the defect class being significantly underrepresented in both the training and testing sets. This is a typical scenario in datasets used for defect prediction tasks. To counteract this imbalance in the training set and enhance the model’s learning process, we employed the Synthetic Minority Over-sampling Technique (SMOTE) from the

i m b l e a r n . o v e r_s a m p l i n g

version 0.13.1 Python version 3.14.0. This approach generates synthetic samples for the minority defect class in the feature space, avoiding mere duplication of samples and enriching the training data while leaving the test set unchanged. By applying SMOTE solely to the training set, we aim to achieve a balanced distribution of classes, enabling the neural network to learn from a more proportionate representation of both classes. Ensuring a balanced dataset is essential for preventing the model from developing a bias toward the more common non-defect class, thus improving its predictive accuracy on the test set, which remains unaltered to provide an authentic evaluation of the model’s performance.

5.1.4. Training Details

The entire model is implemented using PyTorch version 2.9., chosen for its dynamic computation capabilities and robust support for neural network training processes.

To assess robustness with respect to initialization and to support reproducibility, we trained and evaluated the model under three independent random seeds for the random number generator in PyTorch version 2.9 and scikit-learn version 1.7; the results reported in Section 5.2 are averaged across these three runs. The dataset was split into training and validation sets using scikit-learn’s

t r a i n_t e s t_s p l i t

function, with

10 %

allocated to validation and the remainder to training, ensuring a random yet consistent distribution of data. The validation split plays a narrow role in our pipeline: it is used exclusively as the early-stopping criterion described below, and does not participate in hyperparameter selection or in any form of model selection evaluated on the same data. All hyperparameters are fixed ahead of time (see Section 4.2.5 and Section 6.1 on the random-search strategy used to select them), and final performance is reported on the held-out v5 test set that is strictly future-in-time relative to both the training and validation data.

We chose a batch size of 8192 to optimize computational efficiency and learning effectiveness. Early stopping was incorporated to monitor the validation loss, with training halting if there was no improvement in the generator loss for 20 epochs, thereby preventing overfitting and enhancing the model’s generalizability and robustness. As an additional safeguard, training is capped at a maximum of 200 epochs; in combination with the early-stopping criterion, typical runs terminate within 40–80 epochs.

5.1.5. Evaluation Metrics

In evaluating the performance of our model, which is designed to generate optimal thresholds for detecting code smells, our focus is on assessing a defect prediction classifier. This Classifier utilizes code smells as inputs, which are calculated based on the thresholds we establish. We aim to determine the impact of these thresholds on the classifier’s performance in defect prediction, providing insight into how our model contributes to enhancing the classifier’s effectiveness in identifying potential defects in the code. Considering the unbalanced nature of our dataset, where defects constitute a minority class, selecting appropriate evaluation metrics is vital to gaining an understanding of the model’s predictive performance. Hence, we emphasize the F1-score, AUC-ROC, and AUC-PRC metrics, which collectively offer a comprehensive evaluation of the model’s effectiveness in this specific scenario. The F1-score, which harmonizes precision and recall, is a critical metric in our evaluation, encapsulating the balance between the classifier’s precision and recall. Here, precision (the ratio of true positives to the sum of true positives and false positives) and recall (the ratio of true positives to the sum of true positives and false negatives) are foundational:

Precision = \frac{T P}{T P + F P} Recall = \frac{T P}{T P + F N} F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(3)

The AUC-ROC (Area Under the ROC curve) assesses the classifier’s ability to distinguish between the classes at various classification thresholds. The ROC curve, plotting the true positive rate (TPR) against the false positive rate (FPR) across different classification thresholds, offers insight into the classifier’s discriminative capacity. The AUC value, within a range of 0 to 1, reflects the classifier’s effectiveness in distinguishing between the classes, with values closer to 1 indicating higher discrimination ability:

TPR = \frac{T P}{T P + F N} FPR = \frac{F P}{T N + F P}

. Similarly, the AUC-PRC (Area Under the Precision-Recall Curve) is particularly valuable for our unbalanced dataset. This metric focuses on the precision-recall trade-off, providing insights into the model’s performance in the context of a class imbalance. The Precision-Recall curve plots precision against recall for different classification threshold values, and the area under this curve represents the model’s effectiveness in identifying rare defect instances amidst a large number of non-defect instances. These metrics are calculated using the scikit-learn library version 1.7 in Python version 3.14.0., renowned for its robust toolkit for machine learning model evaluation, ensuring a thorough and standardized assessment of our classifier’s performance.

5.1.6. Baseline Comparisons

To evaluate the impact of our optimized thresholds on defect classification, we trained a baseline model using the same Classifier and code smells but with traditional thresholds from the literature. We then compared both models on the same test set, isolating the effect of threshold adjustment. For a comprehensive evaluation, we calculated all the metrics outlined in Section 5.1.5 for each version in the test set for both the baseline and proposed models. Subsequently, a t-test was conducted to statistically compare the performance metrics of the two models, providing insights into the efficacy of our proposed approach in the context of defect classification.

The choice of literature-standard fixed thresholds as our baseline is deliberate. Our contribution concerns the threshold-setting step specifically, so the appropriate comparison holds every other element of the pipeline constant—code smell definitions, Classifier architecture, and feature set—and varies only the thresholds. The experiment therefore functions as a controlled ablation of the threshold-setting component: it directly isolates the effect of replacing fixed, literature-established thresholds with per-sample thresholds produced by our Thresholds Generator. Alternative threshold-setting schemes proposed in the literature, such as the distribution- and quartile-based approach of Fontana et al. [13], also assign thresholds at the project level rather than the sample level, and therefore remain on the same side of the global/per-sample divide as the baseline we use. A direct head-to-head comparison against per-project quartile thresholds is a natural next step and is noted as future work in Section 7.

5.2. Results

In this section, we present the comparative results of defect prediction performance using two different sets of thresholds for code smells, as outlined in Section 5.1.6. The first set consists of our optimally generated thresholds, while the second set comprises the original, baseline thresholds traditionally used in the field. The analysis aims to demonstrate the impact of these differing thresholds on the effectiveness of defect prediction, highlighting how our proposed thresholds enhance performance in terms of the F1-score, AUC-PRC, and AUC-ROC metrics. The findings are depicted in Figure 2, which illustrates the comparative performance metrics for defect prediction using the two sets of thresholds. Additionally, this figure provides the t-test p-values, offering a clear and concise quantification of the performance disparities. The comparative analysis shows that our optimized thresholds yield a higher F1-score than traditional baselines. Since the F1-score balances precision and recall, this improvement indicates our model more effectively avoids both false positives and false negatives. Such balance is essential in defect prediction, enhancing both the accuracy and practical reliability of the model in real-world software development.

Additionally, the superior AUC-ROC and AUC-PRC scores of our optimized thresholds further emphasize its robustness in defect prediction. The AUC-ROC score, representing the model’s ability to distinguish between the classes across all possible thresholds, highlights our model’s effectiveness in identifying defects without being swayed by the imbalance in class distribution. A higher AUC-ROC indicates that our model can reliably separate defective cases from non-defective ones. Furthermore, the AUC-PRC, or the area under the precision-recall curve, is especially informative in our context of imbalanced data, where positive instances (defects) are less common. A higher AUC-PRC score signifies that our model not only predicts defects accurately but does so with a high degree of confidence, maintaining high precision even as recall varies. This is crucial in software defect prediction, where the precision of the positive predictions (defects) often matters more than the overall accuracy.

Two further observations about the statistical status of these results are worth making explicit. First, with

N = 98

projects, the central limit theorem ensures that the t-statistic is approximately normally distributed even under moderate departures from normality of the underlying per-project metric distributions; project-level independence is satisfied, as the 98 projects are distinct codebases developed by different teams. More importantly for interpreting the reported tests, the p-values in Figure 2 range from

p = 6.00 \times 10^{- 6}

(F1) to

p = 1.53 \times 10^{- 22}

(AUC-PRC), the latter corresponding to roughly ten standard deviations of separation between the two means under the null. This is so far from conventional significance thresholds that even severe violations of the t-test’s nominal assumptions—of the kind that might inflate the effective p-value by many orders of magnitude—would leave the qualitative conclusion intact. Second, the practical magnitude of the effect is directly visible in the reported means. For AUC-PRC in particular, our method achieves approximately

0.28

against the baseline’s approximately

0.17

, a relative improvement of roughly

65 %

, with comparable separation on F1 and AUC-ROC. These gains are consistent across the three independent random seeds used for training.

It is also worth making explicit that the baseline comparison itself constitutes a natural ablation of the novel components of our architecture. Removing the Thresholds Generator reduces the pipeline to a Classifier operating on code smells detected under fixed, literature-established thresholds—which is exactly the baseline we report against. Removing the cooperative critic eliminates the gradient path through which the Thresholds Generator is trained (the Code Smells Calculator being non-differentiable), leaving the Generator’s outputs effectively untrained and strictly inferior to the fixed-threshold baseline. Removing the Classifier would eliminate the defect-prediction objective altogether. The three novel components are therefore not independently removable; the architecture is minimal with respect to the task by construction, and the reported gains over the baseline attribute are directly attributed to the Thresholds Generator and Cooperative Critic acting jointly.

Our evaluation shows that the thresholds generated by our model significantly improve defect prediction compared with traditional methods. This supports our hypothesis that tailoring code smell thresholds to each sample yields more relevant and predictive indicators of defects. As a result, our approach enhances both the accuracy and practical value of defect prediction in software systems. In conclusion, the effectiveness of our threshold optimization approach is conclusively validated by the test case focused on defect prediction. This practical application demonstrates the significant advancements of our approach—optimizing code smell thresholds—to enhance specific software practices. The successful outcomes of this test case confirm our research hypothesis and highlight the practical benefits of our approach.

6. Threats to Validity

6.1. Internal Validity

The internal validity of our study is essential to ensure that the findings accurately reflect the relationships being explored without being influenced by external factors.

In our research, internal validity might be impacted by our methodology for hyperparameter optimization. We utilized a random search strategy to identify the hyperparameter settings that lead to the best results. Although random search is a widely acknowledged and efficient technique for hyperparameter optimization, it does not comprehensively explore the full parameter space. As a result, while the identified settings are considered optimal within the context of our study, they may not represent the absolute best configuration across all possible scenarios.

A second internal validity consideration, specific to architectures of our type, is the non-convex and coupled nature of the training objective: the Thresholds Generator, the Cooperative Critic, and the Classifier are optimized jointly, and the resulting coupled system admits no general convergence guarantees. This property is shared with essentially all coupled-network architectures in current use, including standard GANs and actor-critic systems, for which formal convergence results are likewise unavailable. We therefore rely on empirical stability evidence. Training is early-stopped on validation loss to prevent divergence, and, as described in Section 4.2.5, the model was trained and evaluated under three independent random seeds with the reported results averaged across these runs. The consistency of the gains across 98 distinct projects, with p-values ranging from

10^{- 6}

to

10^{- 22}

, is itself strong empirical evidence against an unstable optimization: an unstable training procedure would not yield gains of this magnitude and consistency across such a diverse project portfolio. We nevertheless acknowledge that three seeds do not exhaust the space of possible initializations; broader seed coverage would further strengthen the robustness claim.

A final internal validity point concerns the fairness of the comparison between our method and the fixed-threshold baseline. Both arms use an identical Classifier architecture, the same feature representation (the outputs of the Code Smells Calculator), and the same training procedure for the Classifier. Only the source of the thresholds differs—learned per sample in our method, fixed from the literature in the baseline. Any overfitting tendency introduced by the neural Classifier therefore affects both arms equally and cannot account for the observed performance gap.

6.2. External Validity

The external validity of our research concerns the generalizability and applicability of our findings across different contexts and settings. First, we did not conduct a comparative analysis with other state-of-the-art defect prediction models. The reason is that such a comparison falls outside the scope of our current research objectives. The contribution of our paper is the introduction of a novel model to establish code smell thresholds per sample, which we demonstrate with respect to defect prediction. For this reason, instead of using broad prediction models from the literature, we compare our method to set thresholds in advance, independent of the case. Furthermore, our analysis is confined to a set of 11 code smells in order to show the validity of our approach, limiting direct comparisons with models that encompass a broader array of code smells.

A second external validity consideration concerns the choice of dataset. Our study leverages data from prior research, specifically open-source Java projects from Apache, which in principle limits the generalizability of our findings. Several features of the study design nonetheless mitigate this concern. First, we adopt the dataset of Sotto-Mayor et al. [7] specifically to enable direct comparability with established prior work on smell-based defect prediction. Second, although all 98 projects come from the Apache ecosystem, they span a broad range of application domains—among them web servers, databases, scientific libraries, build tools, and machine-learning frameworks—providing substantial internal diversity even within a single ecosystem.

Third, Java-based Apache projects constitute a well-established evaluation substrate for defect-prediction research, supported by reproducibility-focused resources such as PROMISE, and our corpus of 98 projects with five versions each is considerably larger than that used in many recent studies in this area [56,57]. Fourth, the 11 code smells considered in this work are defined over language-agnostic object-oriented metrics (LOC, LCOM, TCC, FANIN, FANOUT, DIT, and related quantities) that exist in any object-oriented language, so conceptual portability to C#, C++, or Kotlin is direct—only the metric-extraction tooling would need to change. Cross-language evaluation is a concrete direction for future work. The reliance on specialized tools for metrics extraction and the defect-labeling algorithm remains a residual threat to validity, but is mitigated by our use of the validated methodology of Sotto-Mayor et al. [7], whose work provides a solid foundation for our data pipeline.

A related but distinct generalization question is cross-project defect prediction (CPDP), which evaluates whether a model trained on some projects can predict defects on entirely new, previously unseen projects. The experimental configuration used in this work is the standard within-project defect prediction (WPDP) setup described in Section 5.1.2, in which a project’s own past versions are used to predict defects in its future versions. CPDP is a strictly different task [58], and extending our per-sample Thresholds Generator to the CPDP setting is a natural follow-up study.

6.3. Construct Validity

In our study, construct validity concerns how effectively our chosen approach captures the essence of code smells and their relationship with defect prediction. Our examination is focused on a distinct set of 11 class-level code smells that exclusively utilize

A N D

operators. This selection, elaborated in Section 5.1.1, is targeted for our current scope, with the anticipation of expanding to include more code smells in future work. While our study adopts code smell definitions from the Organic [53] and DesigniteJava [54] tools, as utilized in Sotto-Mayor et al.’s analysis, this approach establishes a specific framework for our construct validity. It is important to recognize that these definitions represent just one perspective within a broader field where multiple definitions and interpretations of code smells exist. Our focus on these particular definitions highlights their influence on our interpretation of code smells, acknowledging that our research is situated within a larger context of varying perspectives on what constitutes a code smell.

A further construct validity consideration concerns the interpretability of per-sample thresholds. A global threshold offers a single auditable number per smell that practitioners can inspect directly; per-sample thresholds do not summarize to such a single number, and in that specific sense, some explanatory parsimony is lost. Two considerations mitigate this concern. First, each per-sample threshold is itself a scalar with a clear semantic meaning—for example, “for this class, the LOC cutoff used to evaluate Large Class was 578”—so the local, per-prediction interpretation remains transparent. Second, the distribution of thresholds produced by the model across samples can be examined post hoc to characterize how cutoffs vary with software-component properties, providing an aggregate-level view that complements the per-sample one. The shift introduced by our approach is therefore better understood as a move from a single global summary to a pair of complementary views—per-sample thresholds and their distribution over samples—rather than as a pure loss of interpretability.

7. Conclusions and Future Work

In this study, we proposed a novel method for optimizing code smell thresholds using a neural network model trained with a cooperative-critic feedback mechanism. The model dynamically generates thresholds based on class-level metric values, which are then used to detect code smells. These smells are passed to a defect prediction model, and the prediction performance guides threshold refinement. We evaluated our approach on 98 software projects, each with five versions, using the first four for training and the fifth for testing. To benchmark our method, we trained a baseline defect prediction model using code smells identified via traditional thresholds. Results show that our learned thresholds yield significantly better defect prediction performance across F1-score, AUC-PRC, and AUC-ROC metrics. Overall, the defect prediction test case strongly supports the validity of our code smell thresholds optimization approach. This test case, evaluated on 11 class-level smells defined by single or AND-connected conditions, demonstrates that our model can tailor thresholds to the unique characteristics of each software system and improve defect prediction accordingly. Extending these benefits to a broader range of software practices that use code smells is a direction our approach is designed to support rather than one the present evaluation already demonstrates.

Several directions for future research naturally follow from the present study. First, the range of code smells considered can be extended to include more complex and method-level smells, in addition to the 11 class-level smells with

A N D

operators considered here. Second, a per-project stability analysis—distributional plots of per-project F1, AUC-ROC, and AUC-PRC under both arms, together with per-project paired-difference statistics across the 98 projects—would complement the aggregated results reported in Figure 2 and give a fuller picture of stability across the project portfolio. Third, a direct head-to-head comparison against distribution- and quartile-based thresholding schemes, most notably the per-project approach of Fontana et al. [13], would place our per-sample method within the broader landscape of data-driven threshold-setting strategies. Fourth, extending our Thresholds Generator to the cross-project defect prediction (CPDP) setting, in which the model must generalize to projects unseen during training, would complement the within-project results reported here. Finally, exploring how this threshold-optimization approach may benefit other software engineering practices that utilize code smells, such as software reconfiguration, represents a promising avenue for further investigation.

Author Contributions

Conceptualization, T.M., G.K. and M.K.; Methodology, T.M. and G.K.; Software, T.M.; Validation, T.M.; Resources, T.M.; Data curation, T.M.; Writing—original draft, T.M.; Writing—review & editing, G.K. and M.K.; Supervision, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are available at https://zenodo.org/records/15294379, accessed on 28 April 2025. The code of this project is available at https://github.com/BGU-AiDnD/CodeSmellsThresholdsOptimization-TomMashiach, accessed on 1 May 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Becker, P.; Fowler, M.; Beck, K.; Brant, J.; Opdyke, W.; Roberts, D. Refactoring: Improving the Design of Existing Code; Addison-Wesley Professional: Boston, MA, USA, 1999. [Google Scholar]
Brown, W.H.; Malveau, R.C.; McCormick, H.W.S.; Mowbray, T.J. AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1998. [Google Scholar]
Piotrowski, P.; Madeyski, L. Software Defect Prediction Using Bad Code Smells: A Systematic Literature Review. In Lecture Notes on Data Engineering and Communications Technologies; Springer: Berlin/Heidelberg, Germany, 2020; Volume 40, pp. 77–99. [Google Scholar]
Ma, W.; Chen, L.; Zhou, Y.; Xu, B. Do We Have a Chance to Fix Bugs When Refactoring Code Smells? In Proceedings of the 2016 International Conference on Software Analysis, Testing and Evolution (SATE), Kunming, China, 3–4 November 2016; pp. 24–29. [Google Scholar] [CrossRef]
Taba, S.E.S.; Khomh, F.; Zou, Y.; Hassan, A.E.; Nagappan, M. Predicting Bugs Using Antipatterns. In Proceedings of the 2013 IEEE International Conference on Software Maintenance, Eindhoven, The Netherlands, 22–28 September 2013; pp. 270–279. [Google Scholar] [CrossRef]
Sotto-Mayor, B.; Kalech, M. Cross-project smell-based defect prediction. Soft Comput. 2021, 25, 14171–14181. [Google Scholar] [CrossRef]
Sotto-Mayor, B.; Elmishali, A.; Kalech, M.; Abreu, R. Exploring Design smells for smell-based defect prediction. Eng. Appl. Artif. Intell. 2022, 115, 105240. [Google Scholar] [CrossRef]
Foucault, M.; Palyart, M.; Falleri, J.R.; Blanc, X. Computing contextual metric thresholds. In Proceedings of the 29th Annual ACM Symposium on Applied Computing, Gyeongju, Republic of Korea, 24–28 March 2014; pp. 1120–1125. [Google Scholar]
Shatnawi, R.; Li, W.; Swain, J.; Newman, T. Finding software metrics threshold values using ROC curves. J. Softw. Maint. Evol. Res. Pract. 2010, 22, 1–16. [Google Scholar] [CrossRef]
Oliveira, P.; Valente, M.T.; Lima, F.P. Extracting relative thresholds for source code metrics. In Proceedings of the 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), Antwerp, Belgium, 3–6 February 2014; pp. 254–263. [Google Scholar]
Alves, T.L.; Ypma, C.; Visser, J. Deriving metric thresholds from benchmark data. In Proceedings of the 2010 IEEE International Conference on Software Maintenance, Timișoara, Romania, 12–18 September 2010. [Google Scholar]
Ferreira, K.A.; Bigonha, M.A.; Bigonha, R.S.; Mendes, L.F.; Almeida, H.C. Identifying thresholds for object-oriented software metrics. J. Syst. Softw. 2012, 85, 244–257. [Google Scholar] [CrossRef]
Fontana, F.A.; Ferme, V.; Zanoni, M.; Yamashita, A. Automatic metric thresholds derivation for code smell detection. In Proceedings of the IEEE/ACM 6th International Workshop on Emerging Trends in Software Metrics, Florence, Italy, 17 May 2015; pp. 44–53. [Google Scholar]
Liu, H.; Liu, Q.; Niu, Z.; Liu, Y. Dynamic and automatic feedback-based threshold adaptation for code smell detection. IEEE Trans. Softw. Eng. 2015, 42, 544–558. [Google Scholar] [CrossRef]
Palomba, F.; Zanoni, M.; Fontana, F.A.; De Lucia, A.; Oliveto, R. Smells like teen spirit: Improving bug prediction performance using the intensity of code smells. In Proceedings of the 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), Raleigh, NC, USA, 2–7 October 2016. [Google Scholar]
Moha, N.; Gueheneuc, Y.G.; Duchien, L.; Le Meur, A.F. DECOR: A Method for the Specification and Detection of Code and Design Smells. IEEE Trans. Softw. Eng. 2009, 36, 20–36. [Google Scholar] [CrossRef]
Marinescu, C.; Marinescu, R.; Mihancea, P.; Ratiu, D.; Wettel, R. iPlasma: An Integrated Platform for Quality Assessment of Object-Oriented Design. In Proceedings of the IEEE International Conference on Software Maintenance-Industrial & Tool Volume, Budapest, Hungary, 25–30 September 2005. [Google Scholar]
Danphitsanuphan, P.; Suwantada, T. Code Smell Detecting Tool and Code Smell-Structure Bug Relationship. In Proceedings of the 2012 Spring Congress on Engineering and Technology, Xi’an, China, 27–30 May 2012; pp. 1–5. [Google Scholar] [CrossRef]
Tsantalis, N.; Chaikalis, T.; Chatzigeorgiou, A. JDeodorant: Identification and Removal of Type-Checking Bad Smells. In Proceedings of the 2008 European Conference on Software Maintenance and Reengineering, Athens, Greece, 1–4 April 2008. [Google Scholar] [CrossRef]
Suryanarayana, G.; Samarthyam, G.; Sharma, T. Refactoring for Software Design Smells: Managing Technical Debt; Morgan Kaufmann: Burlington, MA, USA, 2014. [Google Scholar]
Paterson, D.; Campos, J.; Abreu, R.; Kapfhammer, G.M.; Fraser, G.; McMinn, P. An Empirical Study on the Use of Defect Prediction for Test Case Prioritization. In Proceedings of the 2019 12th IEEE Conference on Software Testing, Validation and Verification, Xi’an, China, 22–27 April 2019; pp. 346–357. [Google Scholar] [CrossRef]
Wang, S.; Liu, T.; Tan, L. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering, Austin, TA, USA, 14–22 May 2016; pp. 297–308. [Google Scholar]
Li, J.; He, P.; Zhu, J.; Lyu, M.R. Software defect prediction via convolutional neural network. In Proceedings of the 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), Prague, Czech Republic, 25–29 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 318–328. [Google Scholar]
Hoang, T.; Dam, H.K.; Kamei, Y.; Lo, D.; Ubayashi, N. Deepjit: An end-to-end deep learning framework for just-in-time defect prediction. In Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Montréal, QC, Canada, 26–27 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 34–45. [Google Scholar]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. Codebert: A pre-trained model for programming and natural languages. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 1536–1547. [Google Scholar]
Sheng, V.S.; Ling, C.X. Thresholding for making classifiers cost-sensitive. In Proceedings of the Aaai, Boston, MA, USA, 16–20 July 2006; Volume 6, pp. 476–481. [Google Scholar]
Zou, Q.; Xie, S.; Lin, Z.; Wu, M.; Ju, Y. Finding the best classification threshold in imbalanced classification. Big Data Res. 2016, 5, 2–8. [Google Scholar] [CrossRef]
Pleiss, G.; Raghavan, M.; Wu, F.; Kleinberg, J.; Weinberger, K.Q. On fairness and calibration. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Arcelli Fontana, F.; Mäntylä, M.V.; Zanoni, M.; Marino, A. Comparing and experimenting machine learning techniques for code smell detection. Empir. Softw. Eng. 2016, 21, 1143–1191. [Google Scholar] [CrossRef]
Di Nucci, D.; Palomba, F.; Tamburri, D.A.; Serebrenik, A.; De Lucia, A. Detecting code smells using machine learning techniques: Are we there yet? In Proceedings of the 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), Campobasso, Italy, 20–23 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 612–621. [Google Scholar]
Kusner, M.J.; Hernández-Lobato, J.M. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv 2016, arXiv:1611.04051. [Google Scholar] [CrossRef]
Yu, L.; Zhang, W.; Wang, J.; Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; PMLR: New York, NY, USA, 2017; pp. 214–223. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Konda, V.R.; Tsitsiklis, J.N. Actor-critic algorithms. In Proceedings of the 13th International Conference on Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; PMLR: New York, NY, USA, 2016; pp. 1928–1937. [Google Scholar]
Xie, J.; Lu, Y.; Gao, R.; Zhu, S.C.; Wu, Y.N. Cooperative training of descriptor and generator networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 27–45. [Google Scholar] [CrossRef] [PubMed]
Lessmann, S.; Baesens, B.; Mues, C.; Pietsch, S. Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings. IEEE Trans. Softw. Eng. 2008, 34, 485–496. [Google Scholar] [CrossRef]
Nagappan, N.; Ball, T. Use of relative code churn measures to predict system defect density. In Proceedings of the 27th International Conference on Software Engineering, St. Louis, MO, USA, 15–21 May 2005; pp. 284–292. [Google Scholar]
Radjenović, D.; Heričko, M.; Torkar, R.; Živkovič, A. Software fault prediction metrics: A systematic literature review. Inf. Softw. Technol. 2013, 55, 1397–1418. [Google Scholar] [CrossRef]
Fernandes, E.; Oliveira, J.; Vale, G.; Paiva, T.; Figueiredo, E. A review-based comparative study of bad smell detection tools. In Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering, Limerick, Ireland, 1–3 June 2016; pp. 1–12. [Google Scholar]
Paiva, T.; Damasceno, A.; Figueiredo, E.; Sant’Anna, C. On the evaluation of code smells and detection tools. J. Softw. Eng. Res. Dev. 2017, 5, 7. [Google Scholar] [CrossRef]
Hozano, M.; Garcia, A.; Fonseca, B.; Costa, E. Are you smelling it? Investigating how similar developers detect code smells. Inf. Softw. Technol. 2018, 93, 130–146. [Google Scholar] [CrossRef]
Ha, D.; Dai, A.; Le, Q.V. Hypernetworks. arXiv 2016, arXiv:1609.09106. [Google Scholar]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Bengio, Y.; Léonard, N.; Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv 2013, arXiv:1308.3432. [Google Scholar] [CrossRef]
Neftci, E.O.; Mostafa, H.; Zenke, F. Surrogate gradient learning in spiking neural networks. IEEE Signal Process. Mag. 2019, 36, 51–63. [Google Scholar] [CrossRef]
Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inf. Process. Syst. 2021, 34, 18932–18943. [Google Scholar]
Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 297–304. [Google Scholar]
Sugiyama, M.; Suzuki, T.; Kanamori, T. Density Ratio Estimation in Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
Oizumi, W.; Sousa, L.; Oliveira, A.; Garcia, A.; Agbachi, A.B.; Oliveira, R.; Lucena, C. On the identification of design problems in stinky code: Experiences and tool support. J. Braz. Comput. Soc. 2018, 24, 13. [Google Scholar] [CrossRef]
Sharma, T. Designite—A Software Design Quality Assessment Tool. 2016. Available online: https://zenodo.org/records/2566832 (accessed on 28 April 2025).
Kamei, Y.; Shihab, E.; Adams, B.; Hassan, A.E.; Mockus, A.; Sinha, A.; Ubayashi, N. A large-scale empirical study of just-in-time quality assurance. IEEE Trans. Softw. Eng. 2012, 39, 757–773. [Google Scholar] [CrossRef]
Nevendra, M.; Singh, P. TRGNet: A deep transfer learning approach for software defect prediction. Expert Syst. Appl. 2025, 282, 127799. [Google Scholar] [CrossRef]
Anand, K.; Jena, A.K.; Das, H.; Askar, S.S.; Abouhawwash, M. Software defect prediction using wrapper-based dynamic arithmetic optimization for feature selection. Connect. Sci. 2025, 37, 2461080. [Google Scholar] [CrossRef]
Zimmermann, T.; Nagappan, N.; Gall, H.; Giger, E.; Murphy, B. Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Amsterdam, The Netherlands, 24–28 August 2009; pp. 91–100. [Google Scholar]

Figure 1. Architecture of the Proposed Model. This diagram illustrates the interconnected components of the model, depicting the flow from the Thresholds Generator (1) through the Code Smells Calculator (2) and Classifier (3), and concluding with the feedback mechanism involving the Cooperative Critic (4). Each component’s role is delineated, emphasizing their contribution to the optimization of code smell thresholds for enhanced defect prediction.

Figure 2. Comparative performance metrics of classification results using Generated Thresholds vs. Original Thresholds. Each bar represents a metric, illustrating the enhancements in F1-score, AUC-PRC, and AUC-ROC when employing generated thresholds compared with original thresholds. p-values are displayed above each corresponding pair of bars, indicating statistical significance. All reported p-values denote significant differences, underscoring the robustness of the improved defect prediction accuracy achieved with the optimized thresholds, as elaborated in Section 5.2.

Table 1. This table enumerates the various code smells examined in the research, offering a comprehensive overview that includes the name of each code smell, a detailed description, and the mathematical formula employed for its calculation.

Code Smell Name	Description	Calculation Formula
Lazy Class	A class does not have a single, well-defined responsibility.	LOC < T
Swiss Army Knife	An abstract class has many responsibilities and functionality.	IA == 1 ∧ IMC > T
Refused Bequest	A subclass does not use or override methods or properties that are inherited from its superclass.	OR > T
Large Class	A class has grown too large and contains too many methods or properties.	LOC > T
Class Data Should Be Private	A class data is exposed publicly, rather than being kept private and only accessible through methods.	NOPF > T
God Class	A class has too many responsibilities, and becomes too large and complex to understand and maintain.	LOC > T1 ∧ TCC < T2
Multifaceted Abstraction	A class has more than one responsibility assigned to it.	LCOM > T1 ∧ NOF > T2 ∧ NOM > T3
Unnecessary Abstraction	A class that is actually not needed (and thus could have been avoided).	NOPM == 0 ∧ NOF < T
Broken Modularization	A class that is not cohesively encapsulating its responsibilities	NOPM == 0 ∧ NOF > T
Hub-Like Modularization	A class has dependencies (both incoming and outgoing) with a large number of other classes.	FANIN > T1 ∧ FANOUT > T2
Deep Hierarchy	A class’s inheritance hierarchy is excessively deep.	DOI > T

Table 2. Numeric values of the literature-standard thresholds used in the fixed-threshold baseline, corresponding to the formulas in Table 1. The values follow the definitions established in the Organic [53] and DesigniteJava [54] tools. Where a threshold is defined relative to a project-level statistic (e.g.,

{AVG}_{T C C}

for God Class), the statistic is computed on the training data.

Table 2. Numeric values of the literature-standard thresholds used in the fixed-threshold baseline, corresponding to the formulas in Table 1. The values follow the definitions established in the Organic [53] and DesigniteJava [54] tools. Where a threshold is defined relative to a project-level statistic (e.g.,

{AVG}_{T C C}

for God Class), the statistic is computed on the training data.

Code Smell	Threshold(s)	Value(s)
Lazy Class	T (LOC)	$T = 50$
Swiss Army Knife	T (IMC)	$T = 10$
Refused Bequest	T (OR)	$T = 0.33$
Large Class	T (LOC)	$T = 500$
Class Data Should Be Private	T (NOPF)	$T = 0$
God Class	$T_{1}$ (LOC), $T_{2}$ (TCC)	$T_{1} = 500$ , $T_{2} = {AVG}_{T C C}$
Multifaceted Abstraction	$T_{1}$ (LCOM), $T_{2}$ (NOF), $T_{3}$ (NOM)	$T_{1} = 0.8$ , $T_{2} = 7$ , $T_{3} = 7$
Unnecessary Abstraction	T (NOF)	$T = 3$
Broken Modularization	T (NOF)	$T = 3$
Hub-Like Modularization	$T_{1}$ (FANIN), $T_{2}$ (FANOUT)	$T_{1} = 20$ , $T_{2} = 20$
Deep Hierarchy	T (DOI)	$T = 6$

Table 3. The table provides an overview of the metrics utilized in this study, detailing the full name of each metric, its abbreviation, and a description that elucidates the metric’s purpose and relevance.

Metric Name	Abbreviation	Description
Lines Of Code	LOC	The number of lines of code in a class.
Is Abstract	IA	A bolean metrics that indicate if a class is abstract (1) or not (0).
Interface Method Declaration Count	IMC	The number of methods declared in an interface.
Override Ratio	OR	Overridden Methods/Overridable Superclass Methods.
Tight Class Cohesion	TCC	The degree of relatedness of methods within a class based on their shared access to instance variables.
Lack of Cohesion in Methods	LCOM	The lack of cohesion in methods within a class.
Number Of Fields	NOF	Total number of fields in a class.
Number Of Public Fields	NOPF	Number of public fields in a class.
Number Of Methods	NOM	Total number of methods in a class.
Number Of Public Methods	NOPM	Number of public methods in a class.
FAN-IN	FANIN	The number of other classes or components that use this class.
FAN-OUT	FANOUT	The number of distinct classes that a given class uses.
Deep Of Inheritance	DIT	The length of the inheritance path from a given class to its highest ancestor class.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mashiach, T.; Katz, G.; Kalech, M. Code Smells Thresholds Optimization: Defect Prediction as a Case Study. Algorithms 2026, 19, 412. https://doi.org/10.3390/a19050412

AMA Style

Mashiach T, Katz G, Kalech M. Code Smells Thresholds Optimization: Defect Prediction as a Case Study. Algorithms. 2026; 19(5):412. https://doi.org/10.3390/a19050412

Chicago/Turabian Style

Mashiach, Tom, Gilad Katz, and Meir Kalech. 2026. "Code Smells Thresholds Optimization: Defect Prediction as a Case Study" Algorithms 19, no. 5: 412. https://doi.org/10.3390/a19050412

APA Style

Mashiach, T., Katz, G., & Kalech, M. (2026). Code Smells Thresholds Optimization: Defect Prediction as a Case Study. Algorithms, 19(5), 412. https://doi.org/10.3390/a19050412

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Code Smells Thresholds Optimization: Defect Prediction as a Case Study

Abstract

1. Introduction

2. Background and Related Works

2.1. Code Smells

2.2. Utilization of Code Smells for Defect Prediction

2.3. Optimize Code Smells Thresholds

2.4. Learning Through Non-Differentiable Operators

3. Problem Description

3.1. Defect Prediction

3.2. Code Smells Thresholds Optimization

4. Methodology

4.1. Overview

4.2. Detailed Architecture

4.2.1. Thresholds Generator (TG)

4.2.2. Code Smells Calculator (CSC)

4.2.3. Classifier

4.2.4. Cooperative Critic

4.2.5. Training Process

5. Evaluation

5.1. Experimental Setup

5.1.1. Selected Code Smells

5.1.2. Data Description

5.1.3. Preprocessing Steps

5.1.4. Training Details

5.1.5. Evaluation Metrics

5.1.6. Baseline Comparisons

5.2. Results

6. Threats to Validity

6.1. Internal Validity

6.2. External Validity

6.3. Construct Validity

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI