Influence of Performance Metrics Emphasis in Hyperparameter Tuning for Aircraft Skin Defect Detection: An Early Inspection of Weighted Average Objectives

Kurniawan, Christian; Suvittawat, Nutchanon; Soh, De Wen

doi:10.3390/technologies14010075

Open AccessCommunication

Influence of Performance Metrics Emphasis in Hyperparameter Tuning for Aircraft Skin Defect Detection: An Early Inspection of Weighted Average Objectives

by

Christian Kurniawan

^*,†

,

Nutchanon Suvittawat

^†

and

De Wen Soh

Information Systems Technology and Design, Singapore University of Technology and Design, Singapore 487372, Singapore

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Technologies 2026, 14(1), 75; https://doi.org/10.3390/technologies14010075 (registering DOI)

Submission received: 10 December 2025 / Revised: 17 January 2026 / Accepted: 19 January 2026 / Published: 22 January 2026

(This article belongs to the Special Issue Aviation Science and Technology Applications)

Download

Browse Figures

Versions Notes

Abstract

To address the limitations of traditional aircraft skin inspection, the aviation industry and academia have increasingly been exploring the integration of computer vision technologies into the defect detection process. These implementations of computer vision technologies rely on the performance of underlying neural network models, whose effectiveness is highly influenced by their hyperparameter configuration. To obtain optimum hyperparameters, an optimization procedure is often employed to optimize a certain combination of the model’s performance metrics. However, in the aircraft skin defect detection domain, studies to inspect the effect of different emphases in the performance metrics considered in this objective function are still not widely available. In this paper, we present our early observations regarding the influence of different performance metrics’ emphases during the hyperparameter tuning process on the overall performance of a computer vision model employed for aircraft skin defect detection. In this preliminary inspection, we consider the utilization of YOLOv12 and the Bayesian Optimization approach for the defect detection model and hyperparameter optimizer, respectively. We highlight the possible performance degradation of the model after a hyperparameter tuning procedure when the weight factor distribution of the performance metrics is not carefully considered. We note several weight factors of interest that could serve as initial possible “safe spots” for further exploration.

Keywords:

computer vision; hyperparameter tuning; aircraft skin inspection; defect detection; performance metrics; weight factors distribution

1. Introduction

To meet operational safety standards and regulatory airworthiness requirements, aircraft undergo regular maintenance and inspections. Within the framework of modern aviation maintenance, the inspection processes often employ Structural Health Monitoring (SHM) techniques that continuously monitor the aircraft structure via embedded sensors [1,2] and Non-Destructive Testing (NDT), which uses non-destructive diagnostic tools to detect defects [3,4]. These NDT inspections may employ multiple different tools, such as microwaves, X-ray, or ultrasonic inspections, to identify internal flaws [5]. However, despite the availability of these internal sensing modalities, surface-level inspection remains the most frequent and critical task in NDT, particularly in aircraft maintenance, repair, and overhaul (MRO).

In the aviation industry, General Visual Inspection (GVI) can be seen as the first step in MRO surface-level inspection [6]. As this critical inspection is traditionally performed manually, the process often becomes one of the most labor-intensive aspects of aviation operations, is prone to human error, and can lead to costly delays [7,8,9,10]. To reduce this labor-intensive burden, surface-level inspection is increasingly being automated by integrating Deep Learning (DL) technologies [11].

These implementations of DL technologies rely on the performance of the underlying Neural Network (NN) models, whose effectiveness is highly influenced by their hyperparameter configuration [12,13]. For a given application’s dataset, hyperparameters can be set to their default values, or manually configured based on the existing literature, experience, and experiments. Alternatively, optimization techniques can be implemented to automate the tuning procedure. A critical yet underexplored aspect of adopting optimization techniques during the hyperparameter tuning procedure is the tradeoff between different performance metrics emphases within the objective function. In multi-metric tasks such as the application of computer vision models for aircraft skin defect detection, one must balance Precision, Recall, and mean Average Precision (mAP). Preliminary evidence suggests that employing an objective function with a more balanced emphasis on multiple performance metrics, such as the weighted average of the metrics, yields a more robust overall performance than single-metric focuses [14]. However, the landscape of influence of the different emphases in performance metrics during the hyperparameter tuning procedure remains systematically unexplored.

In this study, we are interested in inspecting the effect of varying metric emphases for the weighted average objective function employed in the hyperparameter tuning procedure on overall computer vision performance. We specifically consider the application of computer vision technology for aircraft skin defect detection. By inspecting the influence landscape of different performance metrics across different weight distributions, we are hoping to identify general trends in the weights’ impact on individual performance metrics. This landscape could provide a direction for the weight distribution during the hyperparameter tuning procedure when given a desired emphasis in the performance metrics, streamlining the process of defect detection research and reducing the time, cost, and complexity of implementing DL-driven inspection systems across diverse operational environments.

To convey our observation, we first describe the datasets we used in this study in Section 3.1. We then describe the computer vision model we employed, together with its performance metrics, in Section 3.2 and Section 3.3, respectively. The considered objective function for the hyperparameter tuning procedure and the optimization method we utilized are described in Section 3.4 and Section 3.5, respectively. Section 3.6 describes our inspection procedure. Finally, we discuss our observations in Section 4.

2. Related Work

As previously mentioned, to ensure aircraft airworthiness and a timely return to service, NDT and SHM are heavily utilized within MRO workflows. Specifically, during the periodic MRO checks, NDT techniques are employed to detect damage without compromising the structural integrity of the aircraft. Comprehensive reviews of NDT techniques, such as visual, ultrasonic, thermographic, radiographic, electromagnetic, acoustic emission, and shearography testing, for composite materials are provided by [3,5]. These techniques are particularly relevant in modern aircraft structures, as composite airframes are widely being adopted. For instance, Ref. [15] demonstrated the use of Pulse Phase Thermography and Total Harmonic Distortion techniques to improve the efficiency of defect detection on Honeycomb Sandwich Composites material that have been widely used in aircraft. In another example, Ref. [16] proposed a Lightweight Magnetic Convolutional Neural Network to detect hidden corrosion during electromagnetic testing on aircraft structure. For the specific use case of structural aircraft components, Ref. [5] also briefly outlined different NDT modalities that could be selected during the inspection. In parallel, SHM extended inspection capabilities beyond periodic checks by employing embedded or externally mounted sensors to support the continuous or on-demand assessment of aircraft structural integrity. This paradigm enables a shift from schedule-based maintenance toward condition-based maintenance, offering potential reductions in operational downtime, inspection burdens, and maintenance costs [1,2]. The integration of Machine Learning (ML) techniques into SHM frameworks for aircraft inspections has also been reviewed [17].

However, while many NDT and SHM methods are effective for internal and structural flaws, the vast majority of surface-level anomalies are still identified through GVI. Because manual GVI is labor-intensive and susceptible to human fatigue, there is a critical need for automated DL solutions that can provide the same level of reliability as traditional NDT while significantly reducing inspection lead times [11]. For example, the use of Unmanned Aerial Vehicles (UAVs) equipped with computer vision technology for defect detection and fault diagnosis has demonstrated significant potential in other large-scale structures, such as wind turbine blades [18]. For the aircraft GVI application, Ref. [10] showed that the use of UAVs eases the inspection process on hard-to-access surfaces and standardizes the image-capturing process for the inspections. Following this trend, the aviation industry and academia are increasingly focused on integrating DL technologies, paired with autonomous platforms, into defect detection processes [11,19,20,21].

Nonetheless, the practical integration of DL technologies must overcome significant domain-specific hurdles, such as image acquisition issues, class imbalance, and data scarcity. Here, recent studies addressing similar problems for different domain applications may offer valuable insights. For example, to improve the quality of images captured using UAVs, the stabilizing control developed by [22] to suppress oscillations due to the sloshing dynamics in liquid-carrying UAVs can be adapted to stabilize the UAV positioning required for high-precision aerial monitoring and surface inspection tasks. Alternative autonomous platforms, such as climbing robot [23], can also be considered to maintain standoff consistency, enabling stable imaging and interpretation on curved aircraft skins. Furthermore, the automation of UAV-based inspections can also be enhanced through Reinforcement Learning (RL) for UAV path self-planning that optimizes flight trajectories across different aircraft models while ensuring full coverage of the areas of interest [19].

To mitigate undesired training biases due to the imbalanced class distributions in the datasets, resampling strategies are often employed. For instance, the Adaptive Synthetic Minority Oversampling TEchnique based on Edited Nearest Neighbors (ASMOTE-ENN), proposed by [24] to deal with uneven distributions between faulty and faultless wind turbine datasets, is relevant to the imbalanced defect type distributions on aircraft skin. In the aviation field, a similar oversampling strategy has also been explored by [25] to predict aircraft fastener assembly under imbalanced dataset availability. Using the opposite strategy, undersampling of uneven datasets, such as the method explored further by [26], although often avoided in aviation to prevent the loss of critical environmental variance in the majority (healthy) class, can serve as data-cleaning to reduce model bias and noise in the decision boundary. Alternatively, the integration of multiscale attention mechanisms and dynamic loss functions into an established feature extraction and reconstruction network, such as the Global Wavelet-integrated Residual Frequency Attention Regularized Network (GWRFARN) proposed by [27], can also be employed to address the data imbalance problem. Meanwhile, to tackle the data scarcity problem, data augmentation strategies, such as the one employed by [28,29], can be utilized to increase the size of the existing data. In a different direction, strategies to mitigate the impact of limited training samples, such as the Adaptive Fused Domain-cycling Variational Generative Adversarial Network (AFDVGAN) developed by [30] that integrates smooth-regularized variational learning with a ratio-controlled domain-cycling mechanism and an adaptive data fusion strategy, can also be applied.

Regarding perception, multiple studies utilize computer vision to automate the typical GVI findings and structure the inspection pipeline. This includes the automatic detection and the verification of exterior aircraft structural elements as a prerequisite for reliable anomaly-checking [6], as well as defect detection on aircraft surfaces using object detection and instance-segmentation approaches [31,32]. Specific computer vision architectures design for aircraft inspection have also been proposed, e.g., YOLO-FDD [33] and FC-YOLO [34], to improve the detection of small and safety-critical skin defects. A collaborative framework between UAV path planning with a lightweight detection network, such as the one proposed by [35], promotes a closed-loop system with real-world industrial applicability. A survey of current developments in aircraft skin defect detection utilizing computer vision technologies is provided by [14].

Given the increasing popularity of computer vision for GVI inspections and considering that the performance of DL architectures is highly sensitive to hyperparameter configurations, the adoption of automated optimization frameworks for hyperparameter tuning is a necessity [12,13]. Within the aeronautical domain, optimization strategies are increasingly utilized to handle complex design and inspection challenges. For instance, Ref. [36] incorporated a Genetic Algorithm (GA) into their propeller design optimizer technology, OpenVINT 5. Similarly, in aero-engine inspection, the DDSC-YOLOv5 architecture utilizes GA-based anchor optimization to better match defect geometry [37]. In the specific context of aircraft surface inspection, Refs. [21,38] demonstrated performance enhancements in computer vision-based aircraft surface inspection through explicit hyperparameter tuning.

Because the objective function in automated optimization strategies is generally computationally expensive, traditional optimization approaches, such as grid search or random search, often lack efficiency in high-dimensional search spaces. Consequently, sequential model-based optimization, such as Bayesian Optimization (BO), is frequently employed to reduce the number of function calls by iteratively updating a surrogate model to find the approximate optimum of the original computationally expensive function, as demonstrated by [39,40,41,42]. In addition to model-based methods like BO, nature-inspired metaheuristics have gained attention for their ability to navigate complex, multi-dimensional search spaces in defect detection applications. A notable example of this is the Amended Gorilla Troop Optimization (AGTO) algorithm proposed by [43] to adaptively tune CNNs for mechanical fault diagnosis. Other strategies include the Nutcracker Optimization Algorithm (NOA) and Swarm Intelligence, which balance exploration and exploitation in global search tasks, as employed by [18,19].

In addition to the optimization strategy, deciding the form of the objective function is also an essential step in the hyperparameter tuning procedure. Despite its importance, most efforts have focused on tuning procedures, with less attention paid to standardizing the objective function that optimizes various performance metrics, e.g., Precision, Recall, and mean Average Precision (mAP) in computer vision models. For instance, one might expect to obtain an optimum Recall score when using it as the sole objective function during the tuning process. However, it has been shown by [14] that using a weighted average objective function with a more balanced distribution of weights between Precision, Recall, and mAP resulted in a more robust overall performance compared to if a full emphasis is given to only one of the performance metrics. Nevertheless, the study only tested four different extreme spectrums of weight distributions and has not systematically explored the distribution landscape to identify the influence of these different weight distributions toward the hyperparameter tuning procedure. Therefore, exploring the effect of varying these weight distributions is crucial for effective hyperparameter tuning.

3. Objectives and Settings

3.1. Experiment Datasets

Two publicly available datasets were used in this preliminary study, which, for simplicity, we will refer to as Dataset I [44] and Dataset II [45] throughout this paper. Dataset I consists of 688 training, 197 validation, and 98 test images extracted from video footages of an old Boeing 747 taken by a drone. Each of the images in Dataset I was resized to

640 \times 640

pixels. The visual defects in Dataset I were annotated and classified into one of three defect categories: (1) rust, (2) missing-head, and (3) scratch. Example images for these three defect types in Dataset I can be seen in Figure 1. Meanwhile, Dataset II consists of 9651 training, 642 validation, and 429 test images of multiple defective airplanes. The defects in Dataset II were annotated and classified into one of the five defect categories: (1) missing-head, (2) scratch, (3) crack, (4) dent, and (5) paint-off. Figure 2 displays example images of the five defect types found in Dataset II. Table 1 provides a summary of the number of images used in the training, validation, and testing pools, together with instances of defects in Datasets I and II.

Given the large size of Dataset II, to reduce training cost during the tuning process, the original pools of training and validation images were randomly split into 20 smaller pools of 500 training and 200 validation images each. During each training process for a tuning iteration, a pair of these training and validation pools were randomly selected and used as a surrogate for the full dataset. The objective function score to maximize for each tuning step was calculated based on the performance of the model tested using the entire 429 test images of Dataset II. This approach allowed us to reduce the computational cost of individual training processes while ensuring the utilization of the entire data in Dataset II across the complete tuning procedure. By dynamically sampling these pools across the tuning iterations, the optimizer is exposed to the broader distribution of the full dataset, mitigating the risk of overfitting to a single non-representative subset.

Additionally, we are also interested in comparing the influence of tuning emphases when using a smaller subset of a dataset. Here, we generated two new smaller datasets by randomly selecting 100 training, 50 validation, and 50 testing images from Dataset I and 350 training, 150 validation, and 100 testing images from Dataset II. For simplicity, throughout this paper, we will call these new datasets Dataset III and Dataset IV as the subsets of Dataset I and Dataset II, respectively. These smaller subsets were utilized as computational surrogates to investigate the stability of hyperparameter optimization trends under data-constrained conditions, i.e., whether the trends in the full datasets could be recovered using a smaller data volume. While it is acknowledged that small-scale sampling introduces a potential risk of distributional shift, this approach aims to determine if the relative influence of different weight factors remains consistent across scales. This analysis helps evaluate the feasibility of using computational surrogates to reduce the time required for optimization without sacrificing the validity of the tuning trajectory. A summary of the number of images used in the training, validation, and testing pools, together with defect instances in Datasets III and IV, is provided in Table 2.

3.2. Computer Vision Model

As mentioned, in this study, we are interested in observing the influence of performance metrics in hyperparameter tuning for a computer vision model in the aircraft skin defect detection application. Although other computer vision family of algorithms, such as instance segmentation, can be applied and have been explored for the aircraft skin defect detection application (e.g., [31,46]), given the constraints of our annotated datasets and limited resources, we chose to employ the object detection family of algorithms in this preliminary study.

In our current study, we specifically chose to employ a version of the You Only Look Once (YOLO) algorithm, a member of the well-known family of single-stage real-time object detectors. The first YOLO algorithm was the first one-stage model to achieve true real-time object detection, and was introduced by [47] in 2015. Since its introduction, it has gained widespread popularity, leading to numerous advancements and adaptations that have further enhanced its speed, accuracy, and versatility. A comprehensive overview of the YOLO series model up to YOLOv12, together with a summary of their architectures and applications, can be seen in [48]. In early 2025, Ref. [49] proposed a YOLO framework that focuses on the attention layer mechanism, called YOLOv12. By adopting the attention-centric design, YOLOv12 is able to harvest the performance benefits of attention mechanisms while keeping the real-time detection that matches its CNN-based predecessors, outperforming previous versions of YOLO series algorithms while keeping a comparable detection speed. YOLOv12 has also been reported to surpass algorithms which leverage the use of the DEtection TRansformer (DETR) strategy in real-time detection settings. In our study, we employ YOLOv12 algorithm as implemented in the Ultralytics library [50]. Detailed information on the algorithm implementation utilized in this study can be found in [51].

3.3. Performance Metrics

We used Precision

(P)

, Recall

(R)

, and mean Average Precision

(mAP)

described in [52] as the performance metrics of the computer vision model we considered. Here, we calculated the model’s Precision and Recall based on the Intersection over Union

(I o U)

of the actual and predicted bounding boxes. Given a ground truth bounding box with area

B_{t r u t h}

pixel units that tightly enclose a defect on the image, the

I o U

of the predicted bounding box with area

B_{p r e d}

pixel units is defined as follows:

\begin{matrix} I o U = \frac{B_{p r e d} \cap B_{t r u t h}}{B_{p r e d} \cup B_{t r u t h}} . \end{matrix}

(1)

We then defined the notion of True Positive

(T P)

as the number of

I o U

instances that are higher than

50 %

with the correct defect classifications. The Precision

(P)

can then be computed as

\begin{matrix} P = \frac{T P}{N_{B_{p r e d}}}, \end{matrix}

(2)

where

(N_{B_{p r e d}})

denotes the total number of predicted bounding boxes. Here,

P

quantifies the ability of the model to correctly classify the detected defects. On the other hand, the Recall

(R)

is defined as follows:

\begin{matrix} R = \frac{T P}{N_{B_{t r u t h}}}, \end{matrix}

(3)

with

(N_{B_{t r u e}})

denoting the actual number of ground truth bounding boxes present in the images. Here,

R

signifies the ability of the model to detect the defects present in the given images. The model’s mean Average Precision

(mAP)

is then calculated from the average values of the area under the precision–recall curve across all the defect types, i.e.,

\begin{matrix} mAP = \frac{1}{N_{D}} \sum_{i = 1}^{N_{D}} \int_{0}^{1} P_{i} (R) d R . \end{matrix}

(4)

Here,

N_{D}

is the number of defect types considered for a given dataset, while i is the index of the defect types. The implementation of

mAP

approximation by [52] was employed to obtain the

mAP

score in this study. We denoted the

mAP

of the models in this study as

mAP 50

because we calculated

mAP

at the

50 %

threshold of the

I o U

.

3.4. Hyperparameter Tuning Objective

The hyperparameter tuning procedure can be performed by optimizing a certain objective function that captures the desired performance metrics. Thus, using a specific performance metric, e.g.,

P, R, or mAP 50

, as the sole objective function is a possible option. However, ref. [14] showed that solely optimizing one of these metrics yields suboptimal results compared to when a weighted average objective function with a more balanced distribution of weights between the metrics is used. Alternatively, a scoring system that encapsulates Precision and Recall while still allowing for different emphases between the two metrics, such as the family of

F_{β}

-score, can also be employed. However, although

mAP

is calculated from Precision and Recall scores, considering it explicitly might capture nuanced differentiation in the shape of the Precision–Recall curve, which might lead to a more optimal performance. Thus, given the multifaceted metric evaluation of an object detection algorithm, we employed a linear weighted average as one of the common methods to combine the three performance metrics described in Section 3.3 into one scalar value F, i.e.,

F = [\begin{matrix} w_{P} & w_{R} & w_{mAP 50} \end{matrix}] [\begin{matrix} P \\ R \\ mAP 50 \end{matrix}] .

(5)

Here, the vector of performance metrics

{[\begin{matrix} P & R & mAP 50 \end{matrix}]}^{⊤}

is obtained from the training procedure of the computer vision model with a given set of hyperparameters. Specifically, we considered the hyperparameter space

H

that comprises 29 hyperparameters, as listed in Table 3.

3.5. Optimization Procedure

The hyperparameter tuning procedure can then be carried out by solving an optimization problem to minimize the negative of F in Equation (5) for a given vector of weight factors

[\begin{matrix} w_{P} & w_{R} & w_{mAP 50} \end{matrix}]

over the hyperparameter space

H

, which serves as the design variables, i.e.,

\begin{matrix} H^{*} = \underset{H \in H}{arg min} (- F (H)) . \end{matrix}

(6)

Here,

H^{*}

denotes the optimum set of hyperparameters. The training process required to obtain the vector performance metrics

{[\begin{matrix} P & R & mAP 50 \end{matrix}]}^{⊤}

is treated as a black-box function. Because this training process is generally computationally expensive, sequential model-based optimization is employed to ease the computational burden during the optimization scheme. To efficiently and effectively search the hyperparameter space while keeping a minimum computational budget, we employed the Bayesian Optimization (BO) algorithm to solve the optimization problem (6) for each given vector weight factors. Here, the algorithm reduces the number of function calls by iteratively updating a surrogate model to find the approximate optimum of the original computationally expensive function. We specifically employed the BO algorithm as implemented by the gp_minimize function available in [54].

3.6. Inspection Procedure

To observe the impact of different emphases on performance metrics during the hyperparameter tuning procedure, we employed a systematic sampling strategy to map the distribution landscape of the weight factor vectors

[\begin{matrix} w_{P} & w_{R} & w_{mAP 50} \end{matrix}]

. Given that the weight factors are constrained such that

w_{P} + w_{R} + w_{mAP 50} = 1

, the search space constitutes a 2-simplex. To adequately represent this continuous space, we utilized a coarse-to-fine grid approach involving 26 discrete samples. Here, a systematic coverage of the landscape was established by sampling 21 vectors of weight factors by iterating the weight of each performance metric from zero to one with an interval of

0.2

. This grid ensured that the boundaries and the internal regions of the simplex were represented evenly. We then further refine the inspection in high-interest regions by considering additional five vectors. Three of these vectors represent slight shifts in emphasis near the center of the landscape, i.e.,

[\begin{matrix} 0.3 & 0.3 & 0.4 \end{matrix}], [\begin{matrix} 0.3 & 0.4 & 0.3 \end{matrix}], and [\begin{matrix} 0.4 & 0.3 & 0.3 \end{matrix}]

. The final two considered vectors of weight factors serve as critical baselines, i.e., one providing perfectly equal emphasis to all metrics,

[\begin{matrix} \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{matrix}]

, and one focusing solely on the balanced trade-off between Precision and Recall,

[\begin{matrix} 0.5 & 0.5 & 0.0 \end{matrix}]

.

We then observed the improvement (or degradation) of each performance metric as we altered the weight factors by subtracting the performance scores obtained before hyperparameter tuning from those obtained after the tuning process. The approximate influence landscape was created by interpolating the change in performance from the 26 samples using the triangulation-based natural neighbor interpolation method implemented in the MatLab griddata() function with the “natural” method [55].

4. Results and Discussion

Figure 3 shows the performance improvements and degradations that resulted after the use of hyperparameter tuning procedures with different emphases in the performance metrics during the tuning process. The performance improvements and degradations were calculated by subtracting the performance of the model before tuning from the performances after the hyperparameter tuning procedure with the 26 different vectors of weight factors in the objective function, shown as scatter points in Figure 3. Here, positive values show an improvement while negative values reveal a degradation in the performance, with the standard deviations of the performance changes for each dataset also being displayed. The approximate influence landscapes of different emphases in performance metrics during the tuning procedure are displayed by interpolating the improvement and degradation of the 26 performance changes.

The performance changes displayed in Figure 3 indicate that not all weight factors resulted in performance improvements, highlighting the importance of carefully choosing the emphasis distribution during the hyperparameter tuning process. Intuitively, the performance of the model should not degrade after a tuning procedure because the optimization algorithm will always keep the original hyperparameters if it cannot find a better score. However, given a vector of weight factors employed to calculate the weighted objective function F during a tuning process, a higher score for F might result from a decrease in a particular individual metric, which can be attributed to the degradation in the model’s overall performance. Thus, without careful consideration of an appropriate distribution of weight factors, hyperparameter tuning can actually harm the performance of the model, underscoring the importance of selecting an appropriate vector of weight factors to avoid degradation in the desired metric. Moreover, given the relatively large area in the influence landscape that resulted in performance degradation, as shown in Figure 3, further study to inspect the full influence landscape is needed.

Table 4 and Table 5 show the resulted Precision, Recall, and

mAP 50

of YOLOv12 in detecting defects for all four datasets before and after tuning procedures employing the 26 different vectors of weight factors for the optimization objective function. The highest values for each performance metric resulting from the tuning procedures are indicated by the gray-shaded cells. We specifically show the performances of Datasets I and III together in Table 4 to highlight the difference in their performance compared to Dataset III, which is a subset of Dataset I. For a similar reason, we display the performances of Datasets II and IV together in Table 5.

Comparing the best performances across Datasets I and III, as well as Datasets II and IV, we can see that the resulting Precision scores of the original size datasets and their smaller subsets are similar, but the Recall scores are significantly degraded, which in turn degrades their

mAP 50

scores. However, in this preliminary study, we only observed the performances of one subset from each original dataset. Therefore, further inspection to determine if there is a subset size that does not differ significantly in terms of the resulting performance metrics will be needed, remembering that the computational cost of each function call during the hyperparameter tuning procedure can be expensive. Finding a smaller size of dataset that will achieve a similar performance after the tuning process, as if all the data are being used, is especially beneficial since this can reduce the computational cost of each function call during the optimization iterations. For example, we observed that the average computational time for each function call of Dataset III reduced to

0.2

h from

0.6

h for Dataset I, while for Dataset IV, it reduced to

0.4

h from

6.6

h for Dataset II. We conducted this study using training iterations in a machine with an AMD EPYC 7543 32-Core Processor, 503.7 GB of memory, and two NVIDIA L40S GPUs.

Looking into the performances of the four datasets after tuning, we cannot see a definitive universal pattern of the weight factors that resulted in the performance improvement across all datasets. However, although we cannot see a universal trend for the weight factors that provided the desired general improvement in performance, we noted five vectors of weight factors, i.e.,

[\begin{matrix} 0.3 & 0.3 & 0.4 \end{matrix}]

,

[\begin{matrix} 0.5 & 0.5 & 0.0 \end{matrix}]

,

[\begin{matrix} 0.0 & 0.4 & 0.6 \end{matrix}]

,

[\begin{matrix} 0.3 & 0.4 & 0.3 \end{matrix}]

, and

[\begin{matrix} 0.4 & 0.4 & 0.2 \end{matrix}]

.

Weight factors of

[\begin{matrix} 0.3 & 0.3 & 0.4 \end{matrix}]

and

[\begin{matrix} 0.5 & 0.5 & 0.0 \end{matrix}]

are, out of the 26 considered vectors, the ones that resulted in at least one metric improvement across all four datasets. Although these two distributions did not lead to the highest improvements across all four datasets, they can serve as initial “safe spots” for further exploration, as the resulting performances were not significantly different from the highest performance. Here,

[\begin{matrix} 0.3 & 0.3 & 0.4 \end{matrix}]

improved three performance metrics in Dataset I and one performance metric in Datasets II, III, and IV, highlighting the utility of using a more evenly distributed weight factors for the performance metrics. The weight factors

[\begin{matrix} 0.5 & 0.5 & 0.0 \end{matrix}]

place equal emphasis on Precision and Recall only, which is analogous to the

F_{1}

-score. This suggests the merit of exploring the

β

-landscape used in the general

F_{β}

-score as an alternative objective. This is crucial, especially since using the simple arithmetic weighted mean of the performance metrics as the objective to optimize during the hyperparameter tuning procedure is insensitive to a low value in one of the metrics, leading to undesired unbalanced performance, e.g., a high score in one metric and a significantly low or zero score in another. For example, consider three cases where

P = {1, 1, and 0.2}

and

R = {1, 0, and 0.8}

, the arithmetic average of the three cases with equal weights is

{1, 0.5, and 0.5}

, respectively, which could mislead the optimizer in the second and third considered cases. Comparing those arithmetic averages to the

F_{1}

-score of the three cases,

{1, 0, and 0.32}

, respectively, the

F_{1}

-score provides a more balanced score for the optimizer to optimize both Precision and Recall. To demonstrate this, a comparative optimization was conducted on Dataset I using the

F_{1}

-score as the objective function. While the weighted arithmetic mean with weight factors of

[\begin{matrix} 0.5 & 0.5 & 0.0 \end{matrix}]

achieved a higher Precision (

0.7887

) at the expense of Recall (

0.67

), the

F_{1}

-score objective produced a more balanced profile (

P = 0.7319, R = 0.7199, mAP 50 = 0.7044

). This confirms that the harmonic nature of the

F_{1}

-score effectively penalizes extreme disparities between metrics, preventing the optimizer from neglecting one metric to maximize another. Thus, for applications where we do not want to sacrifice any of the metric performances, the family of

F_{β}

-score is a more appropriate objective function that can be explored further in the future.

Considering only datasets with the original size,

[\begin{matrix} 0.3 & 0.4 & 0.3 \end{matrix}]

and

[\begin{matrix} 0.0 & 0.4 & 0.6 \end{matrix}]

showed the highest metric improvement in Datasets I and II, i.e., three in Dataset I and two in Dataset II. Meanwhile, the vector of weight factors

[\begin{matrix} 0.4 & 0.4 & 0.2 \end{matrix}]

provides the highest improvement across all three performance metrics for Dataset I. For safety-critical applications such as aircraft defect detection applications, where minimizing false negatives is paramount, the weight factor distributions that led to the highest improvement in

R

are desirable. Here,

[\begin{matrix} 0.3 & 0.4 & 0.3 \end{matrix}]

led to the highest averaged

R

improvement across both Datasets I and II compared to the other 25 observed weight factor vectors. The vectors of weight factors that resulted in the highest Recall improvement are

[\begin{matrix} 0.4 & 0.4 & 0.2 \end{matrix}]

and

[\begin{matrix} 0.2 & 0.4 & 0.4 \end{matrix}]

for Datasets I and II, respectively.

5. Limitations and Future Works

While this study provides insight into the influence of varying metric emphases during the hyperparameter tuning process, several limitations should be acknowledged. Primarily, the five weight vectors of interest that were identified are empirical observations derived from a specific experimental framework involving the YOLOv12 architecture and BO solver with specific datasets. The generalizability and optimality of the reported results across different computer vision architectures, alternative optimization strategies, or different datasets remain to be verified. As our observation method is model-agnostic, where the computer vision architecture is treated as a black-box function during the tuning procedure, alternative state-of-the-art architectures could be verified utilizing a similar framework. Cross-checking the optimum hyperparameters obtained by BO can be performed by comparing its results with other optimization strategies, such as evolutionary-based and swarm intelligence algorithms, or by performing multiple optimization schemes using different starting points, could effectively validate the optimality of the converged performance. Alternatively, optimization algorithms with distinct convergence properties, such as the AGTO [43] method, could also be employed to ensure that the global “safe spot” trend is identified rather than local minima.

Due to the high computational cost of the tuning procedure, the current study utilized a structured sampling of 26 weight vectors on the 2-simplex, as described in Section 3.6. While the current grid provides a representative skeleton of the landscape, denser sampling strategies to represent the continuous space more exhaustively are required. Furthermore, to identify statistically significant differences between the resulting performances, tests such as the t-test or ANOVA derived from multiple tuning iterations are essential once a sufficient computational budget is available. Further study to reduce computational overhead could involve finding the smallest subset size of a dataset that maintains the distributional characteristics and recovers an consistent optimization trajectory of the full dataset without sacrificing tuning validity. Additionally, a detailed post hoc sensitivity analysis to rank the relative importance of individual hyperparameters remains a critical area for future study. Identifying which specific parameters are the primary drivers of performance in aircraft defect detection would allow for the development of more streamlined search spaces, further reducing the computational cost of the tuning process.

As previously discussed in Section 4, the simple arithmetic weighted average of the performance metrics presented in (5) is relatively insensitive to an unbalanced performance metrics. Thus, alternative forms of objective functions that penalize extreme disparities between metrics, such as the

F_{β}

-score, may be further explored. The objective function could also be designed to maximize certain desired metrics while keeping the other metrics as hard optimization constraints. Such constrained objective functions would ensure that the optimizer cannot converge on configurations that compromise certain metrics, such as Recall score, which encompasses safety-critical detection capabilities. Finally, the integration of different NDT modalities, such as acoustic emissions or ultrasonic sensors, into an automated streamline employing deep learning technologies for a holistic view of aircraft health may also need further study regarding hyperparameter tuning.

6. Conclusions

In this paper, we presented our early investigation of the influence of varying metric emphases during the hyperparameter tuning process on the overall performance improvement and degradation of a computer vision model employed for aircraft skin defect detection. In this preliminary study, we employed YOLOv12 as the detection architecture and the Bayesian Optimization (BO) method as the hyperparameter optimizer. We created the approximate influence landscape by interpolating 26 points of observation. We highlighted the possible performance degradation of the model after the hyperparameter tuning procedure when the distribution of weight factors for the performance metrics is not carefully considered, motivating further exploration in the future. We showed the possible merit in finding a minimum-size dataset subset capable of yielding an insignificant performance difference compared to using the whole dataset, thereby reducing the computational cost of hyperparameter tuning calibration. Although a definitive universal pattern for the vector of weight factors that yields a performance improvement across all datasets could not be determined in this preliminary observation, we noted five weight vectors of interest, i.e.,

[\begin{matrix} 0.3 & 0.3 & 0.4 \end{matrix}]

,

[\begin{matrix} 0.5 & 0.5 & 0.0 \end{matrix}]

,

[\begin{matrix} 0.0 & 0.4 & 0.6 \end{matrix}]

,

[\begin{matrix} 0.3 & 0.4 & 0.3 \end{matrix}]

, and

[\begin{matrix} 0.4 & 0.4 & 0.2 \end{matrix}]

. We also noted that our current method of compiling all performance metrics using weighted arithmetic average may lead to unbalanced performance due to its insensitivity to significantly small or zero values. Thus, we suggest further exploration into the

β

-landscape of the

F_{β}

-score in future studies. Additionally, unlike the current study, which considered only one optimization algorithm, i.e., BO, future work can also explore the use of different algorithms for additional cross-checking. We acknowledge the limitations of the current study in considering only one computer vision algorithm, i.e., YOLOv12; however, we hope this preliminary observation provides a foundational guideline for future research in hyperparameter tuning calibration to achieve optimal and unambiguous aircraft skin defect detection performance.

Author Contributions

C.K.: Conceptualization, Writing—original draft, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing—review and editing. N.S.: Conceptualization, Writing—original draft, Methodology, Data curation, Writing—review and editing. D.W.S.: Project administration, Resources, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this research can be found at: Dataset I: https://universe.roboflow.com/sutd-4mhea/aircraft-ai-dataset (downloaded on 22 March 2024); Dataset II: https://universe.roboflow.com/innovation-hangar/innovation-hangar-v2/dataset/1 (downloaded on 24 April 2024).

Acknowledgments

During the preparation of this manuscript, the authors used Google Gemini Flash 2.5 for the purposes of grammatical error corrections. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cusati, V.; Corcione, S.; Memmolo, V. Impact of Structural Health Monitoring on Aircraft Operating Costs by Multidisciplinary Analysis. Sensors 2021, 21, 6938. [Google Scholar] [CrossRef] [PubMed]
Ballarin, P.; Sala, G.; Airoldi, A. Cost-Effectiveness of Structural Health Monitoring in Aviation: A Literature Review. Sensors 2025, 25, 6146. [Google Scholar] [CrossRef]
Gholizadeh, S. A Review of Non-Destructive Testing Methods of Composite Materials. Procedia Struct. Integr. 2016, 1, 50–57. [Google Scholar] [CrossRef]
Wronkowicz-Katunin, A. A Brief Review on NDT&E Methods For Structural Aircraft Components. Fatigue Aircr. Struct. 2018, 2018, 73–81. [Google Scholar] [CrossRef]
Rahman, M.S.U.; Hassan, O.S.; Mustapha, A.A.; Abou-Khousa, M.A.; Cantwell, W.J. Inspection of Thick Composites: A Comparative Study Between Microwaves, X-Ray Computed Tomography and Ultrasonic Testing. Nondestruct. Test. Eval. 2024, 39, 2054–2071. [Google Scholar] [CrossRef]
Leiva, J.R.; Villemot, T.; Dangoumeau, G.; Bauda, M.A.; Larnier, S. Automatic Visual Detection and Verification of Exterior Aircraft Elements. In Proceedings of the 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and Their Application to Mechatronics (ECMSM), Donostia-San Sebastian, Spain, 24–26 May 2017; pp. 1–5. [Google Scholar] [CrossRef]
Yasuda, Y.D.; Cappabianco, F.A.; Martins, L.E.G.; Gripp, J.A. Aircraft visual inspection: A systematic literature review. Comput. Ind. 2022, 141, 103695. [Google Scholar] [CrossRef]
Sprong, J.P.; Jiang, X.; Polinder, H. Deployment of Prognostics to Optimize Aircraft Maintenance—A Literature Review. J. Int. Bus. Res. Mark. 2020, 5, 26–37. [Google Scholar] [CrossRef]
Liu, Y.; Dong, J.; Li, Y.; Gong, X.; Wang, J. A UAV-Based Aircraft Surface Defect Inspection System via External Constraints and Deep Learning. IEEE Trans. Instrum. Meas. 2022, 71, 5019315. [Google Scholar] [CrossRef]
Papa, U.; Ponte, S. Preliminary Design of An Unmanned Aircraft System for Aircraft General Visual Inspection. Electronics 2018, 7, 435. [Google Scholar] [CrossRef]
Le Clainche, S.; Ferrer, E.; Gibson, S.; Cross, E.; Parente, A.; Vinuesa, R. Improving Aircraft Performance Using Machine Learning: A Review. Aerosp. Sci. Technol. 2023, 138, 108354. [Google Scholar] [CrossRef]
Probst, P.; Boulesteix, A.L.; Bischl, B. Tunability: Importance of Hyperparameters of Machine Learning Algorithms. J. Mach. Learn. Res. 2019, 20, 1–32. [Google Scholar]
Weerts, H.J.P.; Mueller, A.; Vanschoren, J. Importance of Tuning Hyperparameters of Machine Learning Algorithms. arXiv 2020, arXiv:2007.07588. [Google Scholar] [CrossRef]
Suvittawat, N.; Kurniawan, C.; Datephanyawat, J.; Tay, J.; Liu, Z.; Soh, D.W.; Ribeiro, N.A. Advances in Aircraft Skin Defect Detection Using Computer Vision: A Survey and Comparison of YOLOv9 and RT-DETR Performance. Aerospace 2025, 12, 356. [Google Scholar] [CrossRef]
Liu, G.; Gao, W.; Liu, W.; Xu, J.; Li, R.; Bai, W. LFM-Chirp-Square Pulse-Compression Thermography for Debonding Defects Detection in Honeycomb Sandwich Composites Based on THD-Processing Technique. Nondestruct. Test. Eval. 2024, 39, 832–845. [Google Scholar] [CrossRef]
Vu, T.P.; Luong, V.S.; Le, M. Hidden Corrosion Detection in Aircraft Structures with A Lightweight Magnetic Convolutional Neural Network. Nondestruct. Test. Eval. 2025, 40, 1797–1819. [Google Scholar] [CrossRef]
Scarselli, G.; Nicassio, F. Machine Learning for Structural Health Monitoring of Aerospace Structures: A Review. Sensors 2025, 25, 6136. [Google Scholar] [CrossRef]
Zhang, S.; He, Y.; Gu, Y.; He, Y.; Wang, H.; Wang, H.; Yang, R.; Chady, T.; Zhou, B. UAV Based Defect Detection and Fault Diagnosis for Static and Rotating Wind Turbine Blade: A Review. Nondestruct. Test. Eval. 2025, 40, 1691–1729. [Google Scholar] [CrossRef]
Sun, Y.; Ma, O. Drone-based Automated Exterior Inspection of an Aircraft using Reinforcement Learning Technique. In Proceedings of the AIAA SCITECH 2023 Forum, Online, 23–27 January 2023; p. 0107. [Google Scholar] [CrossRef]
Rodríguez, D.A.; Lozano Tafur, C.; Melo Daza, P.F.; Villalba Vidales, J.A.; Daza Rincón, J.C. Inspection of Aircrafts and Airports Using UAS: A Review. Results Eng. 2024, 22, 102330. [Google Scholar] [CrossRef]
Connolly, L.; Garland, J.; O’Gorman, D.; Tobin, E.F. Deep-Learning-Based Defect Detection for Light Aircraft with Unmanned Aircraft Systems. IEEE Access 2024, 12, 83876–83886. [Google Scholar] [CrossRef]
Zurita-Gil, M.A.; Ortiz-Torres, G.; Sorcia-Vázquez, F.D.J.; Rumbo-Morales, J.Y.; Gascon Avalos, J.J.; Reynoso-Romo, J.R.; Rosas-Caro, J.C.; Brizuela-Mendoza, J.A. Nonlinear Control Design for a PVTOL UAV Carrying a Liquid Payload with Active Sloshing Suppression. Technologies 2026, 14, 31. [Google Scholar] [CrossRef]
Ramalingam, B.; Manuel, V.H.; Elara, M.R.; Vengadesh, A.; Lakshmanan, A.K.; Ilyas, M.; James, T.J.Y. Visual Inspection of the Aircraft Surface Using a Teleoperated Reconfigurable Climbing Robot and Enhanced Deep Learning Technique. Int. J. Aerosp. Eng. 2019, 2019, 5137139. [Google Scholar] [CrossRef]
Chatterjee, S.; Byun, Y.C. Highly Imbalanced Fault classification of Wind Turbines using Data Resampling and Hybrid Ensemble Method Approach. Eng. Appl. Artif. Intell. 2023, 126, 107104. [Google Scholar] [CrossRef]
Lahr, G.J.G.; Godoy, R.V.; Segreto, T.H.; Savazzi, J.O.; Ajoudani, A.; Boaventura, T.; Caurin, G.A.P. Improving Failure Prediction in Aircraft Fastener Assembly Using Synthetic Data in Imbalanced Datasets. arXiv 2025, arXiv:2505.03917. [Google Scholar] [CrossRef]
Liu, B.; Tsoumakas, G. Dealing with Class Imbalance in Classifier Chains via Random Undersampling. Knowl.-Based Syst. 2020, 192, 105292. [Google Scholar] [CrossRef]
Dong, Y.; Jiang, H.; Liu, Y.; Yi, Z. Global Wavelet-Integrated Residual Frequency Attention Regularized Network for Hypersonic Flight Vehicle Fault Diagnosis with Imbalanced Data. Eng. Appl. Artif. Intell. 2024, 132, 107968. [Google Scholar] [CrossRef]
Zhao, J.; Yang, S.; Li, Q.; Liu, Y.; Gu, X.; Liu, W. A New Bearing Fault Diagnosis Method Based on Signal-to-Image Mapping and Convolutional Neural Network. Measurement 2021, 176, 109088. [Google Scholar] [CrossRef]
Li, H.; Wang, C.; Liu, Y. Aircraft Skin Defect Detection Based on Fourier GAN Data Augmentation under Limited Samples. Measurement 2025, 245, 116657. [Google Scholar] [CrossRef]
Wang, X.; Jiang, H.; Zeng, T.; Dong, Y. An Adaptive Fused Domain-Cycling Variational Generative Adversarial Network for Machine Fault Diagnosis under Data Scarcity. Inf. Fusion 2026, 126, 103616. [Google Scholar] [CrossRef]
Bouarfa, S.; Doğru, A.; Arizar, R.; Aydoğan, R.; Serafico, J. Towards Automated Aircraft Maintenance Inspection. A Use Case of Detecting Aircraft Dents using Mask R-CNN. In Proceedings of the AIAA SCITECH 2020 Forum, Orlando, FL, USA, 6–10 January 2020. [Google Scholar] [CrossRef]
Doğru, A.; Bouarfa, S.; Arizar, R.; Aydoğan, R. Using Convolutional Neural Networks to Automate Aircraft Maintenance Visual Inspection. Aerospace 2020, 7, 171. [Google Scholar] [CrossRef]
Li, H.; Wang, C.; Liu, Y. YOLO-FDD: Efficient defect detection network of aircraft skin fastener. Signal Image Video Process. 2024, 18, 3197–3211. [Google Scholar] [CrossRef]
Zhang, W.; Liu, J.; Yan, Z.; Zhao, M.; Fu, X.; Zhu, H. FC-YOLO: An aircraft skin defect detection algorithm based on multi-scale collaborative feature fusion. Meas. Sci. Technol. 2024, 35, ad6bad. [Google Scholar] [CrossRef]
Xiong, J.; Li, P.; Sun, Y.; Xiang, J.; Xia, H. An Aircraft Skin Defect Detection Method with UAV Based on GB-CPP and INN-YOLO. Drones 2025, 9, 594. [Google Scholar] [CrossRef]
Kurkin, E.I.; Quijada Pioquinto, J.G.; Lukyanov, O.E.; Chertykovtseva, V.O.; Nikonorov, A.V. Aircraft Propeller Design Technology Based on CST Parameterization, Deep Learning Models, and Genetic Algorithm. Technologies 2025, 13, 469. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Sun, L.; Hu, B.; Zhu, L.; Zhang, J. Deep learning-based defects detection of certain aero-engine blades and vanes with DDSC-YOLOv5s. Sci. Rep. 2022, 12, 13067. [Google Scholar] [CrossRef]
Pasupuleti, S.; K, R.; Pattisapu, V.M. Optimization of YOLOv8 for Defect Detection and Inspection in Aircraft Surface Maintenance using Enhanced Hyper Parameter Tuning. In Proceedings of the 2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT), Greater Noida, India, 29–31 August 2024; Volume 1, pp. 1–6. [Google Scholar] [CrossRef]
Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Sequential Model-Based Optimization for General Algorithm Configuration. In Learning and Intelligent Optimization, Proceedings of the 5th International Conference, LION 5, Rome, Italy, 17–21 January 2011; Coello, C.A., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 507–523. [Google Scholar] [CrossRef]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Lake Tahoe, NV, USA, 2012; Volume 25, pp. 1–9. [Google Scholar]
Bischl, B.; Richter, J.; Bossek, J.; Horn, D.; Thomas, J.; Lang, M. mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions. arXiv 2018, arXiv:1703.03373. [Google Scholar]
Azadi, S.; Okabe, Y.; Carvelli, V. Bayesian-Optimized 1D-CNN for Delamination Classification in CFRP Laminates Using Raw Ultrasonic Guided Waves. Compos. Sci. Technol. 2025, 264, 111101. [Google Scholar] [CrossRef]
Vashishtha, G.; Chauhan, S.; Kumar, S.; Kumar, R.; Zimroz, R.; Kumar, A. Intelligent fault diagnosis of worm gearbox based on adaptive CNN using amended gorilla troop optimization with quantum gate mutation strategy. Knowl.-Based Syst. 2023, 280, 110984. [Google Scholar] [CrossRef]
SUTD. Aircraft AI Dataset. 2024. Available online: https://universe.roboflow.com/sutd-4mhea/aircraft-ai-dataset (accessed on 22 March 2024).
Innovation Hangar. Innovation Hangar v2 Dataset. 2023. Available online: https://universe.roboflow.com/innovation-hangar/innovation-hangar-v2/dataset/1 (accessed on 24 April 2024).
Meng, D.; Boer, W.; Juan, X.; Kasule, A.N.; Hongfu, Z. Visual Inspection of Aircraft Skin: Automated Pixel-Level Defect Detection by Instance Segmentation. Chin. J. Aeronaut. 2022, 35, 254–264. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar]
Ramos, L.T.; Sappa, A.D. A Decade of You Only Look Once (YOLO) for Object Detection. arXiv 2025, arXiv:2504.18586. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Ultralytics. Models Supported by Ultralytics. 2025. Available online: https://docs.ultralytics.com/models/ (accessed on 15 August 2025).
Ultralytics. YOLO12: Attention-Centric Object Detection. 2025. Available online: https://docs.ultralytics.com/models/yolo12/ (accessed on 15 August 2025).
Ultralytics. Performance Metrics Deep Dive. 2025. Available online: https://docs.ultralytics.com/guides/yolo-performance-metrics/ (accessed on 15 August 2025).
Ultralytics. Model Training with Ultralytics YOLO. 2023. Available online: https://docs.ultralytics.com/modes/train/ (accessed on 25 April 2024).
Head, T.; Kumar, M.; Nahrstaedt, H.; Louppe, G.; Shcherbatyi, I. Scikit-Optimize/Scikit-Optimize. 2020. Available online: https://github.com/scikit-optimize/scikit-optimize (accessed on 1 April 2024).
MathWorks. griddata: Interpolate 2-D or 3-D Scattered Data. 2025. Available online: https://www.mathworks.com/help/matlab/ref/griddata.html#bvkwume-method (accessed on 15 November 2025).

Figure 1. Example images from Dataset I with the three different defect types: (a) rust; (b) missing-head; (c) scratch.

Figure 2. Example images from Dataset II with the five different defect types: (a) missing-head; (b) scratch; (c) crack; (d) dent; (e) paint-off.

Figure 3. The approximate influence landscapes of the objective distribution during hyperparameter tuning. The landscapes were interpolated from the differences between the performance after and before the tuning procedure of the 26 observed weight factors, shown as the scatter points. Positive values indicate an improvement while negative values show degradation in the performance, with the standard deviation of the changes displayed in each plot.

Table 1. Summary of the total number of images and defect instances by type for the training, validation, and test splits of Dataset I and Dataset II.

Defect	Dataset I (983 Images)			Dataset II (10,722 Images)
Types	Training	Validation	Test	Training	Validation	Test
rust	1430	425	183	-	-	-
	instances	instances	instances
missing-	921	327	162	822	54	32
head	instances	instances	instances	instances	instances	instances
scratch	1429	422	211	180	16	7
	instances	instances	instances	instances	instances	instances
crack	-	-	-	9606	607	386
				instances	instances	instances
dent	-	-	-	8577	554	413
				instances	instances	instances
paint-off	-	-	-	711	34	23
				instances	instances	instances
number of	688	197	98	9651	642	429
images	images	images	images	images	images	images

Table 2. Summary of the total number of images and defect instances by type for the training, validation, and test splits of Dataset III and Dataset IV.

Defect	Dataset III (200 Images)			Dataset IV (600 Images)
Types	Training	Validation	Test	Training	Validation	Test
rust	197	125	112	-	-	-
	instances	instances	instances
missing-	122	87	63	18	12	10
head	instances	instances	instances	instances	instances	instances
scratch	232	94	110	1	1	2
	instances	instances	instances	instances	instances	instances
crack	-	-	-	308	142	107
				instances	instances	instances
dent	-	-	-	314	139	85
				instances	instances	instances
paint-off	-	-	-	29	6	13
				instances	instances	instances
number of	100	50	50	350	150	100
images	images	images	images	images	images	images

Table 3. List of the 29 hyperparameters ¹ considered in the optimization procedure, showing the range of values for each search space

H

.

Table 3. List of the 29 hyperparameters ¹ considered in the optimization procedure, showing the range of values for each search space

H

.

No.	Argument	Values	No.	Argument	Values
1	batch	${4, 8, 12, \dots, 32}$	15	hsv_h	$[0.0, 1.0]$
2	lr0	$[10^{- 10}, 10^{- 1}]$	16	hsv_s	$[0.0, 1.0]$
3	lrf	$[10^{- 10}, 10^{- 1}]$	17	hsv_v	$[0.0, 1.0]$
4	momentum	$[0.0, 1.0]$	18	degrees	$[- 180, 180]$
5	weight_decay	$[0.0, 0.01]$	19	translate	$[0.0, 1.0]$
6	warmup_epochs	$[1.0, 10.0]$	20	scale	$[0.0, \infty)$
7	warmup_momentum	$[0.0, 1.0]$	21	shear	$[- 180, 180]$
8	warmup_bias_lr	$[0.0, 0.5]$	22	perspective	$[0.0, 0.001]$
9	box	$[1.0, 20.0]$	23	flipud	$[0.0, 1.0]$
10	cls	$[0.0, 1.0]$	24	fliplr	$[0.0, 1.0]$
11	dfl	$[1.0, 2.0]$	25	bgr	$[0.0, 1.0]$
12	pose	$[10.0, 15.0]$	26	mosaic	$[0.0, 1.0]$
13	kobj	$[0.0, 5.0]$	27	mixup	$[0.0, 1.0]$
14	nbs	${4, 8, 12, \dots 64}$	28	copy_paste	$[0.0, 1.0]$
			29	erasing	$[0.0, 0.9]$

¹ Description of the hyperparameters can be found in [53].

Table 4. The Precision (

P

), Recall (

R

), and

mAP 50

scores for Dataset I and its subset Dataset III, shown before and after the hyperparameter tuning using 26 different vectors of weight factors. Gray-shaded cells indicate the highest score achieved for each performance metric and dataset.

Table 4. The Precision (

P

), Recall (

R

), and

mAP 50

scores for Dataset I and its subset Dataset III, shown before and after the hyperparameter tuning using 26 different vectors of weight factors. Gray-shaded cells indicate the highest score achieved for each performance metric and dataset.

Tuning States	Dataset I			Dataset III
Tuning States	$P$	$R$	$mAP 50$	$P$	$R$	$mAP 50$
before tuning	$0.6927$	$0.7184$	$0.7025$	$0.5243$	$0.4552$	$0.4045$
highest after tuning	$0.7957$	$0.7358$	$0.7569$	$0.7335$	$0.2661$	$0.2612$
tuned with $[\begin{matrix} 1.0 & 0.0 & 0.0 \end{matrix}]$	$0.7698$	$0.6647$	$0.702$	$0.0002$	$0.0127$	$0.0002$
tuned with $[\begin{matrix} 0.8 & 0.2 & 0.0 \end{matrix}]$	$0.7868$	$0.6878$	$0.7246$	$0.0003$	$0.0323$	$0.0002$
tuned with $[\begin{matrix} 0.6 & 0.4 & 0.0 \end{matrix}]$	$0.6996$	$0.7183$	$0.6832$	$0.4163$	$0.1886$	$0.1523$
tuned with $[\begin{matrix} 0.4 & 0.6 & 0.0 \end{matrix}]$	$0.7235$	$0.6862$	$0.6989$	$0.3723$	$0.2521$	$0.1758$
tuned with $[\begin{matrix} 0.2 & 0.8 & 0.0 \end{matrix}]$	$0.7644$	$0.7225$	$0.7394$	$0.4075$	$0.1989$	$0.1687$
tuned with $[\begin{matrix} 0.0 & 1.0 & 0.0 \end{matrix}]$	$0.6927$	$0.7184$	$0.7025$	$0.0053$	$0.2661$	$0.0314$
tuned with $[\begin{matrix} 0.8 & 0.0 & 0.2 \end{matrix}]$	$0.7142$	$0.6587$	$0.6734$	0	0	0
tuned with $[\begin{matrix} 0.6 & 0.2 & 0.2 \end{matrix}]$	$0.7431$	$0.7058$	$0.7138$	$0.3397$	$0.243$	$0.1694$
tuned with $[\begin{matrix} 0.4 & 0.4 & 0.2 \end{matrix}]$	$0.7957$	$0.7358$	$0.7569$	0	0	0
tuned with $[\begin{matrix} 0.2 & 0.6 & 0.2 \end{matrix}]$	$0.7182$	$0.7313$	$0.7113$	$0.6456$	$0.1436$	$0.1675$
tuned with $[\begin{matrix} 0.0 & 0.8 & 0.2 \end{matrix}]$	$0.6927$	$0.7184$	$0.7025$	0	0	0
tuned with $[\begin{matrix} 0.6 & 0.0 & 0.4 \end{matrix}]$	$0.7174$	$0.6827$	$0.7311$	0	0	0
tuned with $[\begin{matrix} 0.4 & 0.2 & 0.4 \end{matrix}]$	$0.6927$	$0.7184$	$0.7025$	$0.4113$	$0.2579$	$0.2121$
tuned with $[\begin{matrix} 0.2 & 0.4 & 0.4 \end{matrix}]$	$0.6754$	$0.6966$	$0.6783$	0	0	0
tuned with $[\begin{matrix} 0.0 & 0.6 & 0.4 \end{matrix}]$	$0.7076$	$0.716$	$0.7145$	$0.6193$	$0.2497$	$0.2542$
tuned with $[\begin{matrix} 0.4 & 0.0 & 0.6 \end{matrix}]$	$0.7626$	$0.6972$	$0.7227$	$0.5254$	$0.1967$	$0.2164$
tuned with $[\begin{matrix} 0.2 & 0.2 & 0.6 \end{matrix}]$	$0.7329$	$0.7098$	$0.7253$	0	0	0
tuned with $[\begin{matrix} 0.0 & 0.4 & 0.6 \end{matrix}]$	$0.717$	$0.7341$	$0.7254$	$0.3422$	$0.1952$	$0.1453$
tuned with $[\begin{matrix} 0.2 & 0.0 & 0.8 \end{matrix}]$	$0.7678$	$0.7087$	$0.7361$	$0.4977$	$0.2494$	$0.249$
tuned with $[\begin{matrix} 0.0 & 0.2 & 0.8 \end{matrix}]$	$0.7636$	$0.7233$	$0.7355$	0	0	0
tuned with $[\begin{matrix} 0.0 & 0.0 & 1.0 \end{matrix}]$	$0.703$	$0.7109$	$0.7251$	$0.2421$	$0.2434$	$0.1959$
tuned with $[\begin{matrix} 0.4 & 0.3 & 0.3 \end{matrix}]$	$0.7513$	$0.7074$	$0.7226$	$0.3875$	$0.1916$	$0.1545$
tuned with $[\begin{matrix} 0.3 & 0.4 & 0.3 \end{matrix}]$	$0.7496$	$0.7284$	$0.7375$	$0.6337$	$0.2171$	$0.1659$
tuned with $[\begin{matrix} 0.3 & 0.3 & 0.4 \end{matrix}]$	$0.7749$	$0.7301$	$0.7266$	$0.5553$	$0.1165$	$0.1409$
tuned with $[\begin{matrix} \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{matrix}]$	$0.7859$	$0.6975$	$0.7253$	0	0	0
tuned with $[\begin{matrix} 0.5 & 0.5 & 0.0 \end{matrix}]$	$0.7887$	$0.67$	$0.7179$	$0.7335$	$0.2659$	$0.2612$

Table 5. The Precision (

P

), Recall (

R

), and

mAP 50

scores for Dataset II and its subset Dataset IV, shown before and after the hyperparameter tuning using 26 different vectors of weight factors. Gray-shaded cells indicate the highest score achieved for each performance metric and dataset.

Table 5. The Precision (

P

), Recall (

R

), and

mAP 50

scores for Dataset II and its subset Dataset IV, shown before and after the hyperparameter tuning using 26 different vectors of weight factors. Gray-shaded cells indicate the highest score achieved for each performance metric and dataset.

Tuning States	Dataset II			Dataset IV
Tuning States	$P$	$R$	$mAP 50$	$P$	$R$	$mAP 50$
before tuning	$0.8369$	$0.5845$	$0.6983$	$0.6335$	$0.3704$	$0.4021$
highest after tuning	$0.8398$	$0.727$	$0.7636$	$0.8562$	$0.3179$	$0.1215$
tuned with $[\begin{matrix} 1.0 & 0.0 & 0.0 \end{matrix}]$	$0.5356$	$0.1092$	$0.1019$	$0.0003$	$0.0389$	$0.0002$
tuned with $[\begin{matrix} 0.8 & 0.2 & 0.0 \end{matrix}]$	$0.5074$	$0.0323$	$0.021$	$0.4007$	$0.0019$	$0.0002$
tuned with $[\begin{matrix} 0.6 & 0.4 & 0.0 \end{matrix}]$	$0.0002$	$0.0351$	$0.0002$	$0.0002$	$0.0183$	$0.0001$
tuned with $[\begin{matrix} 0.4 & 0.6 & 0.0 \end{matrix}]$	$0.5241$	$0.2028$	$0.1592$	$0.0005$	$0.0341$	$0.0003$
tuned with $[\begin{matrix} 0.2 & 0.8 & 0.0 \end{matrix}]$	$0.8369$	$0.5845$	$0.6983$	0	0	0
tuned with $[\begin{matrix} 0.0 & 1.0 & 0.0 \end{matrix}]$	$0.8369$	$0.5845$	$0.6983$	$0.0023$	$0.2852$	$0.0544$
tuned with $[\begin{matrix} 0.8 & 0.0 & 0.2 \end{matrix}]$	$0.6722$	$0.6595$	$0.6407$	$0.0002$	$0.0328$	$0.0001$
tuned with $[\begin{matrix} 0.6 & 0.2 & 0.2 \end{matrix}]$	$0.7796$	$0.6913$	$0.73$	$0.2429$	$0.1611$	$0.0808$
tuned with $[\begin{matrix} 0.4 & 0.4 & 0.2 \end{matrix}]$	$0.5168$	$0.4881$	$0.4381$	$0.0005$	$0.074$	$0.0004$
tuned with $[\begin{matrix} 0.2 & 0.6 & 0.2 \end{matrix}]$	$0.2742$	$0.2331$	$0.195$	$0.005$	$0.2858$	$0.1117$
tuned with $[\begin{matrix} 0.0 & 0.8 & 0.2 \end{matrix}]$	$0.8056$	$0.6323$	$0.6878$	$0.4004$	$0.008$	$0.0002$
tuned with $[\begin{matrix} 0.6 & 0.0 & 0.4 \end{matrix}]$	$0.3203$	$0.1718$	$0.0706$	$0.2418$	$0.041$	$0.0371$
tuned with $[\begin{matrix} 0.4 & 0.2 & 0.4 \end{matrix}]$	$0.5074$	$0.0323$	$0.021$	$0.0028$	$0.1078$	$0.0021$
tuned with $[\begin{matrix} 0.2 & 0.4 & 0.4 \end{matrix}]$	$0.6786$	$0.727$	$0.7366$	$0.5273$	$0.0824$	$0.0756$
tuned with $[\begin{matrix} 0.0 & 0.6 & 0.4 \end{matrix}]$	$0.8369$	$0.5845$	$0.6983$	$0.0041$	$0.3179$	$0.0224$
tuned with $[\begin{matrix} 0.4 & 0.0 & 0.6 \end{matrix}]$	$0.7636$	$0.6648$	$0.7068$	$0.0007$	$0.1175$	$0.0006$
tuned with $[\begin{matrix} 0.2 & 0.2 & 0.6 \end{matrix}]$	$0.694$	$0.6466$	$0.6949$	$0.3129$	$0.133$	$0.0393$
tuned with $[\begin{matrix} 0.0 & 0.4 & 0.6 \end{matrix}]$	$0.8229$	$0.665$	$0.7636$	$0.0065$	$0.0603$	$0.0041$
tuned with $[\begin{matrix} 0.2 & 0.0 & 0.8 \end{matrix}]$	$0.8001$	$0.6795$	$0.7215$	$0.4649$	$0.1311$	$0.0484$
tuned with $[\begin{matrix} 0.0 & 0.2 & 0.8 \end{matrix}]$	$0.6645$	$0.678$	$0.6609$	$0.328$	$0.1342$	$0.1215$
tuned with $[\begin{matrix} 0.0 & 0.0 & 1.0 \end{matrix}]$	$0.8398$	$0.117$	$0.1022$	$0.3229$	$0.26$	$0.113$
tuned with $[\begin{matrix} 0.4 & 0.3 & 0.3 \end{matrix}]$	$0.772$	$0.6453$	$0.728$	$0.0001$	$0.0154$	$0.0001$
tuned with $[\begin{matrix} 0.3 & 0.4 & 0.3 \end{matrix}]$	$0.6643$	$0.68$	$0.7056$	$0.0001$	$0.0637$	$0.0001$
tuned with $[\begin{matrix} 0.3 & 0.3 & 0.4 \end{matrix}]$	$0.683$	$0.6592$	$0.6624$	$0.8$	$0.0584$	$0.0966$
tuned with $[\begin{matrix} \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{matrix}]$	$0.7889$	$0.6856$	$0.7356$	$0.8562$	$0.0019$	$0.0006$
tuned with $[\begin{matrix} 0.5 & 0.5 & 0.0 \end{matrix}]$	$0.7245$	$0.6223$	$0.6514$	$0.7296$	$0.1019$	$0.1074$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kurniawan, C.; Suvittawat, N.; Soh, D.W. Influence of Performance Metrics Emphasis in Hyperparameter Tuning for Aircraft Skin Defect Detection: An Early Inspection of Weighted Average Objectives. Technologies 2026, 14, 75. https://doi.org/10.3390/technologies14010075

AMA Style

Kurniawan C, Suvittawat N, Soh DW. Influence of Performance Metrics Emphasis in Hyperparameter Tuning for Aircraft Skin Defect Detection: An Early Inspection of Weighted Average Objectives. Technologies. 2026; 14(1):75. https://doi.org/10.3390/technologies14010075

Chicago/Turabian Style

Kurniawan, Christian, Nutchanon Suvittawat, and De Wen Soh. 2026. "Influence of Performance Metrics Emphasis in Hyperparameter Tuning for Aircraft Skin Defect Detection: An Early Inspection of Weighted Average Objectives" Technologies 14, no. 1: 75. https://doi.org/10.3390/technologies14010075

APA Style

Kurniawan, C., Suvittawat, N., & Soh, D. W. (2026). Influence of Performance Metrics Emphasis in Hyperparameter Tuning for Aircraft Skin Defect Detection: An Early Inspection of Weighted Average Objectives. Technologies, 14(1), 75. https://doi.org/10.3390/technologies14010075

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Influence of Performance Metrics Emphasis in Hyperparameter Tuning for Aircraft Skin Defect Detection: An Early Inspection of Weighted Average Objectives

Abstract

1. Introduction

2. Related Work

3. Objectives and Settings

3.1. Experiment Datasets

3.2. Computer Vision Model

3.3. Performance Metrics

3.4. Hyperparameter Tuning Objective

3.5. Optimization Procedure

3.6. Inspection Procedure

4. Results and Discussion

5. Limitations and Future Works

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI