1. Introduction
A common aspect of many software systems is the continuous change process, in which developers conduct many source code changes to add new features, fix defects, and perform refactoring [
1]. Each code change has a different impact on the software’s internal structure. If conducted inadequately, it can introduce bugs, reduce code quality, and increase software complexity. In this regard, we observe different initiatives to avoid software degradation.
Software refactoring is a well-known strategy to decrease software complexity by restructuring source code elements. Most approaches explore historical versions of repositories to identify instances of a subset of refactoring techniques [
2,
3,
4,
5]. However, these approaches adopt logical rules, heuristics, and predefined catalogs to identify refactoring techniques [
6,
7,
8], resulting in a limited number of source code changes and potentially failing to meet developers’ needs. Additionally, Oliveira et al. [
9] surveyed 107 developers of popular Java projects on GitHub (multiple versions) to better understand the refactoring mechanics they use in practice—they asked about the outputs of seven refactoring types available in popular IDEs applied to small programs. They pointed out that refactoring mechanics must be revisited to avoid the misunderstandings identified among developers as a starting point for improvement.
Other studies investigated strategies for handling bad programming practices. Some approaches [
10,
11,
12] applied machine learning to predict, detect, and fix defects across different codebases, such as software repositories and programming courses [
13,
14], technical debt [
15], and patch suggestions [
16]. In general, the approaches rely on information from previous bug instances (metrics and structural properties) to build machine learning models. However, refactoring techniques usually focus on code refactoring to improve code structures through code transformations using known modifications [
17].
In a nutshell, the literature widely shows that applying modifications to fix or improve source code is effective. Additionally, code modifications have been previously described in the literature, both to fix and improve the internal quality of source code. However, we did not find any studies in the literature that analyze modifications and their impact—what is different about focusing on the effects of modifications or even identifying defined modifications (as pointed out by [
18], which uses a previously defined change pattern specification)? In this work, we examine modifications with no prior classification of why they were applied (to fix or improve). In addition, we start from the principle that we do not know beforehand which modifications improve or worsen the internal quality of the source code.
We do not have a predefined catalog of modifications that can be considered a recurring pattern of change. So, initially, we intend to identify recurrent modifications to create a catalog of modifications. It is important to note that catalog-based refactoring detection, as widely reported in the literature, relies on known refactoring techniques and bug fixes. In contrast, our approach focuses on unknown modifications (not present in the catalog). So, we focus on expanding the range of suggested modifications beyond those in the catalog. As a long shot, examples from code changes existing in repositories might support the evolution of the source code: creating our catalog of modifications enables us to identify their context and recommend their reuse during coding.
This paper aims to analyze applied code modifications as an alternative to overcoming the limitations of using catalog-based refactoring techniques. Instead, we learn from code transformations. The concept behind this strategy is to reuse previous instances (code examples) to identify a structural pattern in recurrent code changes and extract the sequence of changes to transform the input into the output [
19].
Therefore, our objective is to evaluate source code changes to identify recurrent modifications (which we call patterns) that might represent improvements in software maintainability. Given code repositories as input, we applied a process that extracts the code mapping of each code change to identify clusters of syntactic patterns (structural code changes) and evaluate the overall impact on maintainability.
Since our focus is on identifying recurrent code modifications that represent internal improvements, the process explores a set of commits in each repository to extract the abstract syntax tree () of each modification using source code differencing. In the following, we apply a greedy algorithm to cluster the mappings based on before () and after () the modification. Finally, we classify the patterns using a regression model and evaluate the overall impact across a subset of code metrics.
The paper is organized as follows.
Section 2 summarizes the main topics that support the approach presented.
Section 3 presents the process for analyzing source code commits and identifying recurrent syntactic patterns, measuring their impact using a learning model to identify possible patterns and their impact on internal quality. Next,
Section 4 presents a summary of observations after applying the proposed process to seven repositories.
Section 5 discusses the threats to validity regarding our study on instantiating the proposed process.
Section 6 presents a discussion about related works, highlighting their limitations and what differentiates this work from them.
Section 7 presents the final remarks and further works.
2. Background
This section summarizes the main topics that support the approach presented in this paper. They are: Abstract Syntax Tree and Maintainability.
2.1. AST Differencing Process
Formally, the source code differencing process (or
AST Differencing Process) aims to identify the set of operations that transform an
(
input) to an
(
output). These operations define the modification steps, which constitute
the edit-script [
20].
Figure 1 shows an overview of the differencing process.
The majority of approaches divide the process into two main phases:
Mapping phase: A node
t in a tree
T is represented by a label
, a value
, and a limited number of child nodes (
). A modification can affect any of these elements that constitute a node. In the first phase, the process compares each modified file’s
and
within a commit to identify the node relationships and obtain the corresponding differences. The obtained relations identify the nodes that are non-modified in both trees and those that are modified in both trees. The process extracts the related nodes between the trees and creates the respective mappings from the modified items (depicted as colored lines in
Figure 1). The mappings are the input for the next phase.
Edit-script generation phase: The process applies the mappings to identify the sequence of operations (
edit-script) of the modification. The
edit-script is composed of transformations such as insertions, deletions, movements, and updates of nodes from the input tree to the output (depicted as dashed lines in
Figure 1).
We use the
GumTree (GT) algorithm for processing the mapping phase [
20]. GT simulates developer behavior to identify differences between the
through a two-fold analysis: first, a top-down walk to detect isomorphic subtrees and, afterward, a bottom-up exploration to identify additional candidates to combine into a new mapping.
While generating the edit-script, many applications [
20,
21,
22,
23] use the algorithm proposed by Chawathe [
24], which allows for the differencing of any hierarchical structures using a strategy based on
breadth-first search.
2.2. Source Code Internal Quality
The internal quality of source code has an intrinsic relationship with the maintenance phase, which is crucial in the software life cycle due to its duration and costs [
25]. In this sense, it is essential to consider how developers assess aspects of software maintainability that can improve the internal quality of source code and reduce costs. Software maintainability is a property that assesses how difficult it is to maintain the source code to implement a new feature or fix a bug in a software system [
26].
In this context, software metrics are powerful tools for measuring and monitoring source code attributes and software quality. The metrics enable monitoring of software properties, such as the complexity and size of elements, as well as the identification or prediction of defect occurrences in source code [
27].
To evaluate the internal quality of the analyzed object-oriented systems, this study adopts the Chidamber and Kemerer (CK) metrics suite. The CK metrics represent one of the earliest and most influential sets of object-oriented design metrics, originally proposed to quantify fundamental structural properties of software systems and to support the assessment of maintainability-related attributes [
28]. It is important to note that the paper focuses on structural maintainability (the code’s internal “health”) rather than functional maintainability.
The suite comprises the following metrics: Weighted Methods per Class (WMC), Coupling Between Objects (CBO), Lack of Cohesion of Methods (LCOM), Depth of Inheritance Tree (DIT), Number of Children (NOC), and Response For a Class (RFC). Together, these metrics capture complementary aspects of object-oriented design, including class-level complexity, coupling, cohesion, inheritance structure, reuse potential, and interaction complexity. Such properties have long been recognized as relevant indicators of internal software quality, particularly with respect to program comprehension, testing effort, and software evolution [
29]. We are aware of constraints related to the metrics. The metrics serve as surrogate measures: the direct variable (the actual hours a developer spends on a task) is impossible to measure at scale across thousands of commits in a repository, but structural maintainability can be used to identify whether the source code is easier or more difficult to modify.
The empirical relevance of CK metrics has been extensively investigated in the literature. Prior studies report significant associations between these metrics and fault-proneness, maintenance effort, and change proneness in both industrial and open-source systems [
30,
31,
32,
33]. More recent work has further explored their role in the analysis of design degradation, code smells, and defect prediction, reinforcing their applicability in empirical assessments of software quality [
34,
35,
36].
Although the Maintainability Index (MI) has been widely adopted as a high-level indicator of software maintainability, it is not employed in this study. The MI is a composite, tool-dependent metric that aggregates heterogeneous properties, such as size, complexity, and documentation, into a single value. As a consequence, variations in the index hinder the identification of specific structural factors responsible for observed effects.
In the context of an experimental design focused on internal quality attributes, the use of disaggregated structural metrics is more appropriate, as they support reproducibility, allow interpretation, and enable more precise analysis of cause-and-effect relationships [
29,
37].
In this work, CK metrics are computed at the class level and used as independent variables that represent internal structural characteristics of the source code. The metrics are analyzed both individually and in combination to support comparative and longitudinal analyses across system versions. While these metrics do not provide a complete characterization of software quality, they serve as well-established structural proxies that enable systematic, reproducible, and empirically grounded analysis of internal design properties.
3. The Process of Identifying Code Modifications
In this section, we describe the extraction and classification of recurrent source code modifications (candidates for patterns). First, in the literature, studies on catalog-based refactoring are limited to searching for problem “patterns” to be solved with solution “patterns” in the catalog. Thus, the search for patterns consists of a pattern-matching process. Our approach goes beyond that: it explicitly extracts and generalizes systematic, recurrent source code modification patterns. So, we create our catalog by identifying recurrent modifications during an evolutionary process.
The process obtains mappings of code changes from a repository as input by comparing subsequent commits. The following clusters the mappings into groups of common syntactic patterns. Finally, it classifies the obtained code patterns based on their impact on maintainability (the implementation is available in
https://github.com/leandroungari/learning-code-extractor, accessed on 24 April 2026).
Figure 2 shows the framework of the learning process. In the following, we describe each phase in detail.
3.1. Code Mappings’ Extraction
In the first phase, the process’s input is a sequence of versions (represented by commits) from each code repository. Given two subsequent versions, we identify the files that were modified between them. A relevant factor in the choice of the repository is the version control system Git
https://git-scm.com/. We apply the
JGit library (
https://projects.eclipse.org/projects/technology.jgit, last accessed on 24 April 2026) to manipulate the branches and commits of repositories.
We extract the mapping for each code from the modified files of each commit using the GT source code differentiating algorithm. We build an implementation upon
GumTree Spoon AST Diff (
https://github.com/SpoonLabs/gumtree-spoon-ast-diff, last accessed on 24 April 2026), which traverses each file to identify the mappings and groups them into four categories: insert, delete, move, and update mappings. This allows us to improve the comparison of common patterns. Each mapping is composed of the syntactic elements involved before (
) and after the applied modification (
). The ordered pair (
) represents the syntactic structure of a given modification, which is later used to identify syntactic modification patterns. Files and modifications with no impact on internal quality are not colored.
Additionally, we optimized the extraction process to obtain a more precise set of mappings, including converting delete/insert mappings to update mappings, joining related mappings, and composing hierarchical mappings. Also, modifications and files with a negative impact on internal quality are depicted in red, and those with a positive impact on internal quality are depicted in green. We describe the main optimizations briefly:
Convert delete/insert to update: We identified occurrences of pairs of mappings, composed of a deletion followed by an insertion of the same element with some differences. This case represents a misidentification of mapping, in which GT could not relate the code elements before and after the modification. The solution is to convert both mappings into an updated mapping.
Join related mappings: We identified some code changes that affect different nonsequential sibling node trees. In this scenario, GT generates multiple mappings in a single modification rather than a single mapping that groups all the code elements involved. To solve this case, we analyze dependencies (related variables) and hierarchies (parent nodes) among closely related mappings. We merge the mappings that satisfy at least one verification into a single mapping.
Composition of mappings: We identified some mappings with a lower level of tree-node granularity. For instance, we modified a variable in the conditional expression of a for-loop. If the mapping represents just the variable or expression, it does not assess the semantics of the modification. To adjust these mappings, we define mapping levels based on the AST (class, interface, method, constructor, attribute, statement, and annotation), ensuring minimal semantic changes in the code.
3.2. Identifying Syntactic Code Modification Patterns
Once the set of mappings is obtained, the process clusters “similar” mappings and identifies the existing code patterns.
Figure 3a shows an example of compatible code changes that define the same code pattern. In these modifications, the developer alternates the positions of the variable and literal to prevent a possible null pointer exception when calling
equals(). In the second example (
Figure 3b), both code changes are incompatible because the second code change evolves a different literal after the modification.
The process applies a greedy strategy that compares the mappings by pairs according to the syntactic model. Specifically, each mapping consists of the syntactic elements involved before () and after the applied modification (). At this point, we need to identify the modifications that occurred in the pair () and then group them.
For that, initially, given the abstract ASTs of the elements and involved in each mapping, the syntactic models of each one are obtained, which, when combined, form the syntactic pattern of the modification. The syntactic pattern combines the ASTs in both versions (before and after the modification), disregarding variable and literal symbol names and replacing them with references.
Next, modifications are grouped by compatible syntactic patterns. An empty set of groupings is initially defined to store the identified modification patterns. For each mapping, it is determined whether there is a group of modifications whose syntactic pattern is compatible. The compatibility check is performed by walking through all nodes of both ASTs and verifying the equality of the syntactic functions of each node. If all are equal, the ASTs are compatible. If a matching cluster is found, the mapping is added to the set of mappings linked to the pattern. Otherwise, a new cluster is created, the pattern of which is defined by the syntactic pattern present in the map. This process adopts a greedy strategy, resulting in a first (and unique) matching cluster for each syntactic pattern. This approach is presented in Algorithm 1, where: represents the set of clusters; generates the AST, and it is used to obtain the AST before and after modifications; and verifies the compatibility as mentioned.
| Algorithm 1 Algorithm of the extraction of code patterns |
- 1:
function ExtractionCodePattern() ▷ - 2:
▷ set of groupings - 3:
for all do - 4:
- 5:
- 6:
- 7:
for all do - 8:
if is compatible with then - 9:
- 10:
- 11:
- 12:
end if - 13:
end for - 14:
if then ▷ did not find a compatible group - 15:
- 16:
end if - 17:
end for - 18:
return - 19:
end function
|
3.3. Classifying Code Patterns
First, we analyze each code modification pattern to assess its impact on maintainability. Given a set of code modification patterns and their respective occurrences (a mapping), it is necessary to calculate the impact of the change, accounting for the internal quality of the modified source code. For that, the calculation compares the metrics WMC, DIT, CBO, and LCOM in each file (before and after modification, respectively, and ) using a mapping-based approach to obtain the variation in maintainability. Algorithm 2 represents the process of calculation.
If the value of a metric increases, it means that the metrics WMC, DIT, CBO, and LCOM become worse. The measure decreases by one point. In the opposite case, the measure increases by one point (becomes better). If the value remains the same, the result has no impact. The process obtains a measure in the range of for each file since the value defines a “degradation” of code, while the value defines an “improvement” of code.
Since the occurrence of two or more similar mappings defines a code modification pattern, we compute the CMP average by averaging all individual mappings within a group. We know that the average impact may be imprecise due to the combination of code modification patterns within the same files, which can alter the overall measure of these files’ maintainability. This is a threat to our analysis, but we need to keep in mind that we are looking at historical file modifications, and a set of changes may have been applied (consequently identified as a modification pattern).
| Algorithm 2 Variation of Maintainability Algorithm |
- 1:
Input: - 2:
: pair of commits - 3:
: set of metrics - 4:
: metric value for class c in the SV version, given that: - 5:
if - 6:
function VariationOfMaintainability - 7:
mc = 0 - 8:
for do - 9:
for do - 10:
if then - 11:
- 12:
else - 13:
if then - 14:
- 15:
end if - 16:
end if - 17:
end for - 18:
end for - 19:
return mc - 20:
end function
|
To address interference among code pattern modifications, we use machine learning (ML) algorithms to compute the effect of each code modification pattern on maintainability variation. The ML algorithm considers each modified file along with selected commits. Each entry of the model represents a source file, which is compounded by the indication of presence , absence , and removal of each code pattern and, in the end, the average variation value . We obtain the weights of each code pattern in each source file by training and testing the model. To establish the real impact of each code pattern, we need to calculate the pondered average using the weights and . In the end, if the result is positive, we classify the pattern as an improvement; otherwise, it is a degradation.
3.4. Putting the Process to Work: Decisions Made and Their Impact
It is important to consider our process depicted in
Figure 2 at a high level, and its instantiation must be carefully planned. In this section, we discuss three decisions made to implement our approach as presented. The first decision was to use Git repositories. Git is a widely used distributed version control system in the software development industry. We decided to use Git because it does not introduce any constraints and also allows other researchers and practitioners to replicate our proposed approach.
The second decision was about metrics. As mentioned, we used Weighted Methods per Class (WMC), Coupling Between Objects (CBO), Lack of Cohesion of Methods (LCOM), Depth of Inheritance Tree (DIT), Number of Children (NOC), and Response For a Class (RFC). We focused on metrics that reveal the internal quality of the source code. Another possible metric is the Maintainability Index, a composite metric (calculated from Lines of Code, Cyclomatic Complexity, and Halstead Volume) that provides a single “health score” for a file. However, the composition is abstract, making direct engineering intervention infeasible.
The third decision is about the machine learning algorithm. To address interference among code pattern modifications, we build a regression model to determine how each code modification pattern affects maintainability variation. The model considers each modified file along with selected commits, as mentioned in
Section 3.3. We performed benchmarking comparing XGBoost [
38] (we used version 3.1.0), LightGBM [
39] (we used version 4.2.0), and Random Forest [
40] (version 4.7-1.2) on a small subset. LightGBM presented the worst results. We did not delve deeper into this comparative analysis, but LightGBM appeared to be more sensitive to overfitting in datasets. Random forest performed well on the subset, but it can become very memory-intensive (with many large trees). XGBoost, which uses descendant gradients to assess the influence of elements across the analyzed hierarchy, showed better overall performance than the other two methods. So, we applied XGBoost in the approach evaluation, presented in the following.
4. Evaluation
The evaluation aims to investigate the main code patterns regarding the most significant types, their impact on code quality, and the applicability of the code changes. In the experiment, six popular GitHub Java projects were selected: Gson, Truth, JUnit 4, JUnit 5, CheckStyle, and FindBugs. Additionally, a seventh repository was included, named Refactoring Toy Example, as it was built to study software refactoring; it was selected for its application of refactoring experiments. The selection considered the length of the version history, the number of developers, and a mix of repositories from different repository families.
The experiment was performed on a distributed cluster running CentOS Linux release 7.7, with 2x Intel Xeon CPU E5-2690@2.90 GHz/E5430@2.66 GHz and 32 GB RAM. This is a modest computer configuration with low computing power, and its selection is based on the possibility of other researchers replicating this study without requiring high processing power. Each analysis was executed through individual jobs submitted to the cluster, and the most relevant code patterns from each repository were selected to facilitate the investigation of results. The selection considers the code patterns with the highest number of instances and a relevant impact on maintainability.
4.1. Overview of Repositories
The results presented in this section were obtained by applying the learning process, which involved evaluating 12,065 commits, as shown in
Table 1. The first step,
Extraction and Analysis of Mappings, aimed to identify modifications across each repository’s history and to obtain the mappings corresponding to these changes. During this process, 132,819 mapping instances were identified, considering all analyzed repositories. The “Mappings” column in
Table 1 lists the specific mappings for each repository.
In the
Identification of Syntactic Patterns stage, the mappings were grouped based on their structural similarity, resulting in 2405 syntactic patterns with two or more occurrences. The “Clusterings” column in
Table 1 describes the number of patterns identified in each repository.
In the
Classification of Syntactic Patterns stage, the patterns were analyzed regarding the average behavior of the files in which they occurred. To assess the impact more accurately, a regression model was implemented to isolate the individual impact of each pattern, controlling for influences from other modifications. The model was trained with 80% of the pattern instances, using the remainder (20%) for testing. The error rate was calculated using the least-squares method, yielding absolute and percentage errors over the interval
. The error rate was below 8% across all repositories, with those containing at least 1000 commits showing even lower rates. Only the Refactoring Toy Example repository had an error greater than 7%, while the others remained close to or below 2%, as detailed in the last column of
Table 1.
4.2. Code Patterns’ Impact on Maintainability
To analyze the code patterns’ impact on maintainability, the most significant patterns in each repository were identified, and all code patterns with only one occurrence and without considerable impact () were discarded.
All repositories presented fewer relevant code patterns than the total number, and most repositories presented fewer than ten code pattern instances, except for the JUnit4 repository. A code pattern is considered relevant if its average weight of occurrence in files differs from zero. The total number of relevant patterns in the analyzed repositories is 74, representing 3.02% of the total. The reduced number of relevant patterns occurs due to the impact of other simultaneous modifications in the same files. On average, each file contained approximately 7.91 modifications, with the repository Refactoring Toy Example at the lower level () and the repository Gson () at the higher level. Therefore, the patterns identified were generally not responsible for the most significant impact on file maintainability.
Regarding the impact on maintainability, the patterns presented some instances with the same value and others with a range of values. Despite the variation, all patterns showed stable behavior in terms of tendency to improve or degrade, so neither instance showed both aspects.
Figure 4 presents box plots that illustrate the impact of each code pattern on maintainability. In descriptive statistics, a
box plot is a chart used in explanatory data analysis to visually show the distribution of numerical data and skewness by displaying the data quartiles and median. Box plots summarize the five-number dataset, including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score.
Taking pattern
in
Figure 4 as an example, the first quartile, minimum value, and median are coincident (the orange line at the bottom of the box); the third quartile forms the top of the box; and the maximum value is depicted in the upper region outside the box. That means that pattern
has a meaningful number of occurrences, most of them with no impact on internal quality (first quartile, minimum value, and median are equal to zero), but the third quartile and maximum value are positive (1 and 2, respectively), which means that when it influences, the impact is positive. Patterns
,
, and
are the opposite: when they influence, the effect is negative, reaching indexes −4, −3, and −2 of impact on internal quality, respectively. The
JUnit 5 and
FindBugs repositories were not included due to the reduced number of patterns (<4) identified in each repository.
4.3. JUnit4 Repository
This section presents an analysis focusing on maintainability, along with a sample of modifications that represent the identified code modification patterns in the JUnit4 repository, which has the highest number of commits. This case study overviews the repository project and analyzes maintainability behavior across its examined history. Subsequently, we will analyze the main syntactic patterns concerning the type of modification and the resulting impact.
The JUnit4 repository is an open-source project developed by the JUnit Team that implements a unit testing framework for the Java programming language. This repository was chosen due to its recurring presence in related studies and its long development history, remaining active even with the availability of more recent framework versions.
Concerning the variation in maintainability over the repository’s history, as illustrated in
Figure 5, a moderate fluctuation between improvements and degradations in the average behavior of each commit is observed. This figure represents both the variation in maintainability and the number of changes over time, using overlapping axes to visualize the number of modifications and their impact on the internal quality of the source code. First, one may note the initial values on the horizontal axis, representing commits; the quantity of modifications (in red) varies between 15 and 170 but has no impact on the average maintainability variation (in blue). In contrast, commits between 650 and 750 show few modifications, with maintainability varying from 1 to −2. Also, commit 1803, with almost 700 modifications, improved maintainability (positively, but close to zero). Well-defined trends of increasing or decreasing maintainability can be observed in specific intervals of commits. For example, one may observe the commit interval around 1500: there is an increase in modifications (red lines) and a decrease in maintainability (blue lines). Also, a few commits with more than 100 modifications exist in the same interval. On the other hand, most commits registered fewer than 150 changes each.
Analyzing the modification rate, it is possible to observe in the history well-defined periods of increase and decrease, accompanied by a limited number of commits with a high rate in isolation. Additionally, concerning the syntactic patterns, the clustering process identified 708 instances.
Regarding JUnit4, by executing the regression model, 44 relevant syntactic patterns were identified, accounting for approximately 6.21% of the initial set. For some patterns, such as 11255 and 1663, no variation in their impact on maintainability was observed—11255 always impacts negatively (−4), and 1663 always impacts positively (+3). The syntactic pattern 2931 shows little variation—it shows a positive impact between 2 and 3 across its multiple occurrences. Despite including more stable patterns, many still show median effects close to (or equal to) zero.
Figure 6 illustrates the behavior of the occurrences of each syntactic pattern in the resulting set—the syntactic patterns 11255, 2931, and 1663 are highlighted in the figure.
The relationships among the selected code patterns were also examined. The evaluation used the regression model’s metrics to identify the impact of each pattern pair. The results, presented in
Figure 7, reveal that most patterns exhibit low interdependence. Fewer than 10 patterns showed any significant correlation with another syntactic pattern, with a correlation coefficient greater than 0.5. The final set of code patterns includes simple modifications involving variables, parameters, method calls, and basic refactoring, such as type replacement in method parameters and extract-method refactoring. Subsequently, some syntactic patterns are described in greater detail.
In pattern #12129, the key modification involves substituting the class’s constructor method call by altering the parameter type, as seen in Listing 1. In Listing 1, line
shows how the signature for the method was before its modification; line
shows the signature for the method after modification. This change resulted in a noticeable reduction in maintainability metrics for the affected files. However, the variation was partly driven by additional modifications to those same files. The structural modifications [
41] are depicted in
Figure 8, and the behavior in the sequence diagrams is depicted in
Figure 9.
| Listing 1. Modification represented by the syntactic pattern #12129. |
![Electronics 15 01956 i001 Electronics 15 01956 i001]() |
For pattern #1163, the primary change centers on a complete revision of the method signature, including replacing the method name and parameters, as depicted in Listing 2. Adjustments were made to specific commands to accommodate a different data type passed via the parameter (whereas previously the data was generated internally). Before, the entire Plan object was received as a parameter; after modification, only the Plan class’s “description” attribute (Description class) is passed instead of the entire Plan class. Also, a single Test class object is returned instead of a list of Test class objects. The structural modifications [
41] are depicted in
Figure 10, and the behavior in the sequence diagrams is depicted in
Figure 11. The impact on maintainability was significant (always 3), showing marked improvements, mainly due to the removal of commands and methods from the modified files.
For pattern #11700, the modification focuses on updating a method call that was moved between classes during a move-method refactoring. This adjustment led to a slight improvement in maintainability, reflected in reductions in the WMC and LCOM metrics, although the CBO metric increased. Upon reviewing the occurrences, it became clear that this variation was partially due to this specific modification, with additional changes contributing to the overall outcome. This is illustrated in Listing 3; its structural modification [
41] is depicted in
Figure 12, and the behavior in the sequence diagrams is depicted in
Figure 13.
| Listing 2. Modification represented by the syntactic pattern #1163. |
![Electronics 15 01956 i002 Electronics 15 01956 i002]() |
| Listing 3. Modification represented by the syntactic pattern #11700. |
![Electronics 15 01956 i003 Electronics 15 01956 i003]() |
In the case of pattern #2563, the modification consists of replacing a call to a method whose parameter is obtained from another method call with a chained method call in an assignment operation. Regarding maintainability, the modification was classified as a great improvement; however, this result stems from the removal of some complex methods from the occurrence classes. Listing 4 shows the pattern code; its structural modification [
41] is depicted in
Figure 14, and the behavior in the sequence diagrams is depicted in
Figure 15.
| Listing 4. Modification by the syntactic pattern #2563. |
![Electronics 15 01956 i004 Electronics 15 01956 i004]() |
5. Threats to Validity and Discussion
First, potential threats to validity are considered following the guidelines of [
42]. Several factors that may influence the validity of the research findings are discussed below. Additionally, relevant observations made during the development and execution of the learning process are examined to propose directions for future work.
5.1. Internal Validity
Internal validity relates to the degree to which the observed results can be attributed to the proposed methodology. A primary threat is the use of the CK (Chidamber & Kemerer) metrics suite as a proxy for maintainability. We acknowledge that software maintainability is a complex, multi-dimensional attribute influenced by human factors, such as developer experience and documentation quality, which are not captured by automated source code analysis. To mitigate this, we focused on metric variation () rather than absolute values. This allows us to observe how specific modification patterns structurally impact the code, providing a clear signal of design evolution independent of the baseline quality.
5.2. Construct Validity
Construct validity refers to how well a measure or indicator accurately reflects the theoretical concept it aims to assess. In this sense, the first threat arises from obtaining code changes from repositories. The process for obtaining code changes from repositories can identify them through extraction. However, the many modifications per file in most commits made separating code changes and their respective relations challenging. This phenomenon is characterized by tangled code changes that directly impact our data collection process. Regarding the organization of the software repository, developers might use single-purpose commits throughout the development process, ensuring that each commit includes only related changes. Among the analyzed repositories, we observed that the CheckStyle repository had the most organized commits, with higher independence and a lower number of code changes per commit (approximately 4.4). Other repositories yielded better results than CheckStyle, even with poor commit organization, suggesting that this threat to validity is minor.
The second threat concerns the types of code changes in code patterns. The set of code change patterns is mainly composed of three types of modifications. The first type is modifications related to operations under identifiers of code entities (classes, methods, and attributes) such as renames and moves. The second type concerns modifications made to implement new features. These modifications affect different parts of the same file or even different files, replacing variable types or changing method signatures. In both scenarios, these changes result in references to types/classes/methods. The last recurrent type is related to specific software refactoring. In general, the observed types are simple refactorings such as move class/field/method, rename class/method/field/variable, and push-down/pull-up method/field. In a few cases, we observe some combinations of those refactorings to perform a composed refactoring (e.g., extract-method and extract-class). In summary, the code changes are simple in structure and are applied repeatedly throughout the project. On the other hand, more complex code changes did not register a relevant number of occurrences. In this sense, we observe that the larger a modification, the more dependencies it has on the local source code and, consequently, the lower its repetitiveness. We observe that software refactoring should improve internal quality, but that occasionally happens.
The third threat concerns the identification process, which compares pairs of files across consecutive versions. This strategy identifies a modification precisely within a file, but not in its extension. In some cases, a modification may move part of the related source code to an additional (new) file, and the differencing process identifies two separate mappings (insert and deletion) instead of a move mapping. To handle this case, we perform an iterative walk over the insertion and deletion mappings to verify whether each pair of the two types refers to the same code element and can be combined into a unique mapping. Another problem with mapping is the propagation of a modification across multiple files. Identifying changes by a pair of files may lead to incomplete analysis and classification of the modification’s extension. A single modification can affect multiple files due to their dependencies. In this case, the current implementation only considers source code changes, not other files. The result needs a more complete identification and an imprecise evaluation of the impact on maintainability, which affects its classification between improvement and degradation.
In addition, the thresholds used to classify “improvement” or “degradation” patterns are based on established heuristics in the literature. However, what constitutes a “significant” change in CBO (Coupling) or WMC (Complexity) can be subjective. We addressed this by evaluating the clustering error rate to validate that the identified syntactic patterns correspond to meaningful structural changes rather than random fluctuations.
5.3. External Validity
External validity indicates how the study results apply to other contexts or populations, aiming to generalize its results. We used seven repositories to identify modifications throughout each repository’s history and obtain the mappings corresponding to these changes. One repository—
Refactoring Toy Example)—was included because it is used in similar studies. Six repositories are representative and allow adequate analysis of the results (commits ranging from 1400 to 2700), with an error rate of less than 5%—see
Table 1.
We used open-source repositories in the Java language because it is widely used [
43], and many programmers could directly consider the results. Tarwani and Chug [
44] claim that the dataset obtained from Java repositories makes the study directly applicable to other object-oriented languages, which can be locally valid. However, programming languages have their own characteristics, especially the particularities in implementing the concepts of their paradigms. For example, inheritance between classes is an essential concept in object-oriented programming, and every language provides it. However, multiple inheritance (an object inherits characteristics from more than one class) is not allowed in JAVA. The characteristics of object-oriented programming languages guide the implementation of object-oriented concepts, reflecting the internal structure of the source code and its modifications. We use source code modifications to obtain the modification patterns. Therefore, the modification patterns observed would be generalizable to other programming languages, as claimed by Tarwani and Chug [
44], but different modification patterns might arise in source code from other object-oriented programming languages. So, did we find out all about modification patterns? No. Is that what we found in other contexts? Yes.
Despite the possibility of generalization discussed regarding the use of the Java language, we claim that the generalization remains limited due to the other constraints outlined in Construct validity.
5.4. Reliability
Reliability concerns the repeatability of a study—the ability to produce the same results. To mitigate the threat to reliability, we specified the process of our approach and repeated it across seven repositories. Thus, we are convinced of its repeatability and, consequently, its reliability, mainly based on results from seven repositories.
6. Related Works
First, it is essential to note that mining or learning from repositories is widely explored, with many studies focusing on this. However, the studies focus on multiple goals, such as refactoring identification [
5], bug fixing, and even bug location [
45], which implies what must be observed in the source code. It is common in the literature to use approaches that start from a well-defined problem (that is, a well-defined focus). For example, code duplication negatively affects the internal source code, leading to bug propagation and significantly increasing maintenance costs. Researchers have made efforts to address this problem. Bano and Anzari [
46] compare deep learning strategies applied to software engineering with methods such as ASTNN [
47] and FA-AST [
48], which detect similar code pairs. Additionally, Zang et al. [
49] have shown that modeling the relationship between code tokens to capture their long-range dependencies is crucial for comprehensively capturing the information in a code fragment. However, they did not focus on internal quality attributes, on how clones affect internal quality, or on whether modifications are made beyond those inherent in clone removal.
Metrics
Tingting and partners [
50] extracted design relationships between quality attributes and architecture tactics, using
Maintainability,
Usability,
Reliability,
Performance,
Compatibility,
Security, and
Portability as quality attributes. They focused on architectural tactics.
Regarding
the evolution of the source code, to improve its internal quality, studies have investigated refactoring, bug detection, and fixing. Fowler [
51] presents a catalog of the most relevant refactoring techniques. Some studies [
2,
3,
4,
5] propose techniques to identify refactoring opportunities in code repositories. However, Al Dallal [
8] highlights several limitations of these techniques. Also, they do not identify modifications and their impact on the internal quality of the source code.
In the context of corrections (another approach that starts from a well-defined problem), Osman et al. [
10] and Zhong and Meng [
12] extract bugs and their fixes from software repositories. Furthermore, Negara et al. [
52] and Molderez et al. [
53] explore patterns in recurrent instances of code changes.
From both evolution and corrective maintenance perspectives, Febriyanti and colleagues [
54] hypothesized that more elegant and proficient code might be more difficult for developers to maintain. Specifically, they used Python source code to investigate the risk level of proficient code within a file. As a result, they found that most code-proficient development poses a low maintainability risk, but there are cases where proficient code is also risky to maintain. Their study should help developers identify scenarios in which proficient code might harm future code maintenance. Despite using different programming languages, Febriyanti et al. [
54] offered a different perspective for further investigation, as they did not delve into the quality of the source code. However, it is reasonable to hypothesize that proficient source code influences the causal mechanisms underlying the relationship between modification patterns and internal source code quality.
Regarding the evaluation, various aspects of a software’s internal structure might be considered. Some studies propose different approaches to calculating software maintainability. Among them, the Maintainability Index (MI), proposed by Oman and Hagemeister [
55], is one of the most widely used approaches in the literature and in industrial tools. However, authors have noted limitations in measuring the MI, particularly in object-oriented systems [
27,
56]. Dubey and Rana [
57] highlight an inverse correlation between CK [
28] metrics and complexity levels. Molnar and Motogna [
53] propose a formula to measure the variation in software maintainability using a subset of CK metrics. Other papers [
58,
59] have noted the use of metrics beyond CK metrics, suggesting further work on these metrics.
Unlike previous related work, we focus on identifying source code modifications independent of a prior catalog and on evaluating their impact on internal source code quality across multiple occurrences. The core novelty of our research is the integrated focus on recurrent modification patterns and their quantifiable impact on internal source code quality metrics. In this sense, our contribution moves beyond simple change detection or bad-smell detection to establish a causal link between recurrent modifications and their internal quality outcomes.
In contrast to prior research focused on static code clones or general bug-fix patterns, this study introduces a novel methodology to explicitly extract and generalize systematic, recurrent source code modification patterns. Furthermore, we provide a quantitative assessment of how these specific patterns directly impact internal quality attributes, such as variations in the Maintainability Index. By integrating the evolutionary process with its measurable quality outcomes, this work represents a significant advancement over the existing literature.
7. Final Remarks and Further Works
This paper presents a framework for identifying and classifying code changes within repositories into patterns of improvement or degradation regarding internal quality. By employing object-oriented metrics, the process quantifies the impact of these modifications on maintainability. Our analysis focuses on learning these patterns and evaluating clustering rates; the results demonstrate an acceptable error margin, validating the hypothesis that the identified Syntactic Patterns effectively represent recurrent modifications. Nonetheless, certain constraints persist, as discussed in the previous sections and summarized below.
The evaluation indicates that the primary code patterns consist of straightforward modifications involving basic operations, such as moving and renaming elements, adding new features, and performing some refactoring. In most cases, these patterns have minimal to no impact on the maintainability of the related files, which include identifiers, class types, method and attribute names, and their corresponding references.
Other modifications that manipulate the class structure by inserting or deleting elements are the main driver of variation in maintainability. We note a significant prevalence of other modifications in duplicate files containing instances of code patterns, which affects the individual evaluation of those patterns. In addition, many patterns involve modifications across multiple files, and the identification process only partially detects the corresponding source code. Shallow source code modifications have minimal impact. However, the class and sequence diagrams presented for Codes 1 to 4 show the modification beyond the local code modification. One may observe that the recurrent source code modifications expressed by
Syntactic Patterns, in fact, represent structural modifications that have impacted the internal source quality. We have presented results from seven repositories (see
Table 1), and the results obtained from trained models are statistically significant, except for one repository (
Refactoring Toy Example) that was included to compare our approach against mining refactoring techniques in the literature.
Regarding practical applicability, this work is the first step toward that goal. In real-world development environments, it would help identify enlightening patterns that reveal a biased distribution—with the valuation of indices concentrated above or below—for example, patterns that show a mechanism to identify the tendency (positive or negative) and indicate it to the programmer within an IDE.
In addition, as further work, this study can be replicated using different repositories to enhance the findings. Further investigation could focus on process adjustments, such as using alternative metrics to classify code patterns. Other metrics focused on internal structure might be worth investigating. They might help identify enlightening patterns that reveal a biased distribution—with the valuation of indices concentrated above or below—for example, patterns showing the minimum score, first quartile, and median coincident (same value). The metrics we initially consider are: Attribute Hiding Factor, Shadowing/Static Hiding, Method Overriding, and Method Inheritance Factor. Such metrics help address partial detection of modifications across multiple files, which limits the classification precision. Further research could explore how and why these patterns affect software quality.