From Quality Grading to Defect Recognition: A Dual-Pipeline Deep Learning Approach for Automated Mango Assessment

Lin, Shinfeng; Chiu, Hongting

doi:10.3390/electronics15030549

Open AccessArticle

From Quality Grading to Defect Recognition: A Dual-Pipeline Deep Learning Approach for Automated Mango Assessment

by

Shinfeng Lin

^* and

Hongting Chiu

Department of Computer Science and Information Engineering, National Dong Hwa University, Hualien 974301, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 549; https://doi.org/10.3390/electronics15030549

Submission received: 29 December 2025 / Revised: 24 January 2026 / Accepted: 26 January 2026 / Published: 27 January 2026

(This article belongs to the Special Issue Next-Generation Machine Learning and Deep Learning Models for Complex Data, Vision, and Intelligent Applications)

Download

Browse Figures

Versions Notes

Abstract

Mango is a high-value agricultural commodity, and accurate and efficient appearance quality grading and defect inspection are critical for export-oriented markets. This study proposes a dual-pipeline deep learning framework for automated mango assessment, in which surface defect classification and quality grading are jointly implemented within a unified inspection system. For defect assessment, the task is formulated as a multi-label classification problem involving five surface defect categories, eliminating the need for costly bounding box annotations required by conventional object detection models. To address the severe class imbalance commonly encountered in agricultural datasets, a copy–paste-based image synthesis strategy is employed to augment scarce defect samples. For quality grading, mangoes are categorized into three quality levels. Unlike conventional CNN-based approaches relying solely on spatial-domain information, the proposed framework integrates decision-level fusion of spatial-domain and frequency-domain representations to enhance grading stability. In addition, image preprocessing is investigated, showing that adaptive contrast enhancement effectively emphasizes surface textures critical for quality discrimination. Experimental evaluations demonstrate that the proposed framework achieves superior performance in both defect classification and quality grading compared with existing detection-based approaches. The proposed classification-oriented system provides an efficient and practical integrated solution for automated mango assessment.

Keywords:

deep learning; convolutional neural networks; mango quality grading; image processing; machine learning

1. Introduction

With the expansion of global fruit trade and increasingly stringent export standards, automated fruit inspection has become a critical component of modern agricultural supply chains. Packaging facilities must process large volumes of produce while maintaining efficient, consistent, and objective classification and grading. Among commercial fruits, mangoes pose particular challenges for automated inspection due to their irregular surface textures, non-uniform coloration, wide ripeness variations, and diverse surface defects. These characteristics impose high requirements on the robustness and generalization capability of vision-based inspection systems [1,2].

Conventional mango grading and defect inspection still rely heavily on manual visual assessment, which is labor-intensive, time-consuming, and inherently subjective. Such approaches are easily affected by inconsistent judgment and environmental factors such as illumination changes, often leading to unstable grading performance and reduced throughput. Consequently, there is a strong demand for automated inspection systems that can operate reliably at an industrial scale while minimizing human intervention [3].

In recent years, deep learning-based computer vision techniques have become the dominant approach for automated fruit grading and surface defect analysis. Chuquimarca et al. [4] provided a comprehensive survey of CNN-based methods for evaluating fruit appearance, size, and defects, along with commonly used real and synthetic datasets. Beyond conventional visual imaging, Chopra et al. [5] integrated spectrophotometry with machine learning to develop an automated apple grading system, achieving 82% accuracy on validation data and 72% accuracy on real-world samples, demonstrating the potential of spectral information for fruit quality assessment.

In mango defect analysis, many studies focus on object detection and segmentation frameworks to localize surface defects. Methods based on YOLO [6], Faster R-CNN, RetinaNet, and Mask R-CNN have been widely applied at the bounding-box or pixel levels. For instance, Matsui et al. [7] enhanced a YOLO-based framework by introducing an improved loss function (SBCE) to better detect small surface defects, highlighting the importance of loss design. Wu et al. [8] conducted five-category mango defect detection on nearly 50,000 images using a Mask R-CNN framework with an X101-FPN backbone, achieving an average precision (AP) of 67.2%.

Although detection and segmentation-based methods provide precise defect localization, they require extensive manual annotation, resulting in high cost and limited scalability. From a practical perspective, precise localization is not always necessary. In many studies [9,10], the primary objective is to determine whether a fruit should be accepted or rejected based on the presence of defects rather than their exact locations. Motivated by this observation, classification-oriented approaches have been explored as an alternative. Lee et al. [11] proposed a multi-camera apple classification system that reduced missed defect inspections, demonstrating improved annotation efficiency. Similarly, Nithya et al. [12] developed a CNN-based mango defect classification system using image-level labels, showing that effective defect recognition can be achieved without bounding-box supervision. However, these studies typically adopt binary classification (normal vs. defective), which may be insufficient for high-value fruits that require defect-specific pricing, motivating further investigation into multi-class defect classification.

Mango quality grading represents another essential task in automated inspection. Li et al. [13] proposed a CNN-based grading framework combining CGAN and YOLOv4, where CGAN augmented limited training data and YOLOv4 performed grade classification. While data augmentation improved performance, the reliance on detection-based annotation increased labeling costs, suggesting that multi-class classification models may be a more efficient alternative. Wu et al. [14] evaluated several CNN architectures, including AlexNet, VGG, and ResNet, for mango quality grading. Although Mask R-CNN–based background removal improved results, manual annotation was still required for transfer learning, motivating the search for more generalizable background removal methods with smoother segmentation boundaries.

Both defect classification and quality grading are typically deployed in automated production lines, where real-time performance is critical. Naik [15] explored a non-destructive mango grading method using MobileNet combined with an SVM classifier, achieving high accuracy with low inference latency suitable for real-time applications. In addition, class imbalance is a common issue in agricultural inspection due to uncertainties in image acquisition, leading to insufficient representation of certain defect categories and degraded model performance [16].

From a modeling perspective, most existing approaches rely on spatial-domain CNNs to analyze surface color, texture, and overall appearance. However, spatial information alone is often insufficient to distinguish subtle quality differences, particularly under uneven illumination or noisy conditions. Recent studies [17] have shown that frequency-domain representations can effectively capture texture-related information and provide complementary cues that are less sensitive to illumination variations.

Based on these observations, this study proposes a unified deep learning framework that jointly addresses mango surface defect classification and quality grading through two task-specific branches. Surface defect assessment is reformulated as a multi-label classification problem involving five defect categories, eliminating the need for costly bounding-box annotations while enabling the identification of co-occurring defects. To mitigate class imbalance, a copy–paste–based image synthesis strategy is employed to augment scarce samples, and self-attention mechanisms are incorporated to further enhance classification accuracy.

Meanwhile, a dedicated quality grading pipeline integrates spatial-domain and frequency-domain CNNs to exploit complementary appearance and texture representations. Adaptive contrast enhancement is applied to emphasize grade-related surface textures, improving robustness under varying illumination and surface noise. Overall, the proposed framework aims to provide a comprehensive and deployment-friendly solution for large-scale automated mango assessment.

The main contributions of this work are summarized as follows:

Problem reformulation for defect inspection:

Mango surface defect inspection is reformulated as a classification-oriented multi-label problem, demonstrating that effective defect recognition can be achieved without bounding-box or pixel-level annotations and is better aligned with industrial inspection requirements.

System-level integration for automated decision making:

A unified inspection framework integrates defect screening and three-level quality grading into a single decision flow, enabling simultaneous rejection of defective mangoes and grading of acceptable products in automated production lines.

Annotation-efficient and robust empirical validation:

The proposed framework is designed with annotation efficiency as a core objective and is validated on large-scale, imbalanced datasets, demonstrating stable grading and improved recognition of rare defect categories under practical conditions.

2. Materials and Methods

This study proposes a dual-pipeline deep learning framework designed to perform both mango quality grading and surface defect classification. For the quality grading task, we adopt two separate neural network architectures—ResNet-18 for the frequency-domain input and ResNet-34 for the spatial-domain input. Additionally, we apply background removal and Adaptive Contrast Enhancement (ACE) [18] to the input mango images to enhance surface texture visibility and improve classification accuracy.

For the five-class defect classification task, we employ ResNet-34 as the backbone network and incorporate self-attention modules to enhance feature representation. In addition, we apply the Copy and Paste data augmentation technique [19] to increase the training data for scarce classes, addressing the class imbalance issue.

The two tasks operate independently, and the system can be deployed either jointly or separately based on the needs of industrial inspection pipelines.

Section 2.1 presents the composition of the dataset, while Section 2.2 outlines the preprocessing procedures applied to the images. Section 2.3 introduces the design of the proposed quality-grading pipeline, and Section 2.4 details the defect-classification pipeline that completes the overall framework. Section 2.5 further describes the end-to-end deployment logic of the proposed system. Finally, Section 2.6 presents the training settings and implementation details used in all experiments.

2.1. Dataset

The datasets used in our experiments are sourced from the Taiwan AI CUP competition, specifically the Irwin mango image dataset [20]. The dataset consists of two major subsets.

The first subset corresponds to mango appearance quality grading and is divided into three grades: A, B, and C. It consists of 5600 images for training and 800 images for testing. Owing to the limited data availability, the test set was also used as a validation set to monitor training convergence and prevent overfitting; however, it was not used for hyperparameter tuning. Grade A mangoes exhibit uniform coloration without visible black spots or scratches, representing the highest quality. Grade B mangoes show slight color inconsistency or minor surface defects and are regarded as medium quality, whereas Grade C mangoes display pronounced color non-uniformity and frequently contain large dark patches, representing the lowest quality. The sample distributions for the three quality grades are illustrated in Figure 1a.

The second subset contains five types of mango surface defects, with the sample distribution shown in Figure 1b. This subset was originally annotated for object detection tasks and was reconstructed in this study as a multi-label classification dataset, comprising more than 50,000 images in total. From this subset, 6000 images were randomly selected for training, 900 images were used as a test set that was also employed as a validation set to monitor training convergence, and an additional 990 images formed an independent test set for final performance evaluation.

This subsampling strategy was adopted based on two primary considerations. First, the number of samples in the D2 defect category is extremely limited in the original dataset. Under such severe class imbalance, training directly on the full dataset of over 50,000 images would further exacerbate the skewed class distribution and hinder effective model learning. Therefore, the complete dataset was not used; instead, random subsampling was employed for experimental design. Second, to maintain a comparable data scale between the defect classification task and the mango quality grading experiment, the size of the training set was deliberately controlled. This design choice aims to simulate practical application scenarios in which large-scale annotated datasets may not be readily available, thereby enhancing the applicability and robustness of the proposed method in real-world settings.

The five defect categories include:

D1: latex residue
D2: mechanical scratch
D3: anthracnose
D4: color inconsistency
D5: black spot disease

The distribution of samples in each category is shown in Figure 1b. A significant imbalance can be observed, particularly for D2 (mechanical scratches), which contains only 177 images. This scarcity is likely due to scratches being non-intrinsic defects, typically caused by handling or conveyor belt friction, and therefore naturally less frequent.

All images in both subsets have a resolution of 1280 × 720 pixels and were captured under normal indoor lighting conditions. Representative samples of the quality-grading classes and defect types are shown in Figure 2 and Figure 3, respectively.

2.2. Data Preprocessing

2.2.1. Background Removal

Since the mango images in the dataset contain production-line backgrounds, background removal is necessary to prevent irrelevant visual information from affecting classification performance. We adopt the well-known U²-Net [21] model for this task. Beyond its strong reputation in medical image segmentation, U²-Net provides highly generalizable pretrained weights that work effectively even on datasets outside its training domain. Our experiments show that these pretrained weights can remove the background of our mango dataset with minimal configuration, and the model remains effective even when multiple mangoes appear in the same frame. Example results are shown in Figure 4. In terms of computational efficiency, the average processing time for background removal using U²-Net is approximately 88 ms per image, measured on an NVIDIA RTX 3060 GPU. To avoid redundant computation during training and reduce overall training time, background removal is performed as an offline preprocessing step, and the resulting images are stored for subsequent experiments.

2.2.2. Contrast Enhancement

After background removal and resizing, contrast enhancement is applied to emphasize surface details. Specifically, we employ Adaptive Contrast Enhancement (ACE). As illustrated in Figure 5 and Figure 6, the mango surface texture becomes visually more pronounced after enhancement. The pixel value distribution further confirms this improvement: originally, most pixel intensities were concentrated in the range of 40–60, indicating low contrast. For example, around the intensity level of 50, approximately 17,500 pixels were present before enhancement; this number decreases to about 12,000 after stretching, with values in other intensity ranges correspondingly increasing. This redistribution reflects the expansion of mid-range intensities and indicates an overall improvement in image contrast.

The ACE operation is implemented on the CPU, with an average processing time of approximately 48 ms per image. Similarly to background removal, contrast enhancement is also conducted offline, and the preprocessed images are saved in advance to avoid repetitive computational overhead during training.

It is worth noting that excessively strong ACE parameters may amplify noise and degrade image quality. Since illumination conditions vary across images, selecting ACE parameters based on a small number of samples can easily lead to over-enhancement and performance deterioration. To address this issue, we conduct repeated batch-level evaluations on diverse images to identify robust parameter settings. Based on empirical testing, the optimal ACE parameters are determined as a window size (m, n) of 55 and a scaling factor α of 0.2. This configuration effectively enhances surface details while preventing excessive amplification of irrelevant noise in the majority of images.

M (i, j) = \frac{1}{(2 n + 1) (2 m + 1)} \sum_{x = i - n}^{i + n} \sum_{y = j - m}^{j + m} f (x, y)

(1)

σ^{2} (i, j) = \frac{1}{(2 n + 1) (2 m + 1)} \sum_{x = i - n}^{i + n} \sum_{y = j - m}^{j + m} {(f (x, y) - M (i, j))}^{2}

(2)

G = α \frac{M g}{σ (i, j)}, 0 < α < 1

(3)

I (i, j) = M (i, j) + G (f (i, j) - M (i, j))

(4)

Specifically, the process begins by computing the local mean value M within a window of size [(2n + 1), (2m + 1)] using Equation (1), where f(x,y) denotes the intensity of the pixel at coordinates (x,y). The local standard deviation σ within the same window is then calculated using Equation (2). In Equation (3), α represents the gain factor, and Mg denotes the global mean intensity of the image. Finally, Equation (4) is applied to amplify the high-frequency components of the image, thereby enhancing image details. The resulting enhanced pixel intensity at location (i,j) is denoted as I(i,j).

2.3. Quality Grading Pipeline

The goal of the quality grading pipeline is to categorize mangoes into three predefined quality levels based on their surface appearance. To improve prediction robustness under varying imaging conditions, the proposed system integrates both spatial-domain and frequency-domain information.

In the spatial-domain branch, background-removed and contrast-enhanced images are fed into a ResNet-34 network. In the frequency-domain branch, the images undergo a discrete wavelet transform (DWT) [22] before being processed by a ResNet-18 network.

The class probabilities produced by the two branches are then fused at the decision level to generate the final prediction, as illustrated in Figure 7. Specifically, a probability-gap–based fusion strategy is adopted (illustrated in Figure 8). If the difference between the highest and second-highest predicted probabilities from the spatial-domain model is smaller than a confidence threshold TH, the spatial-domain prediction is considered uncertain, and the system defaults to the prediction from the frequency-domain model. Otherwise, the spatial-domain result is used.

To determine an appropriate confidence threshold, we analyze the relationship between the probability margin (i.e., the difference between the top two predicted probabilities) and prediction correctness. We observe that when this margin falls below approximately 0.2, the reliability of the spatial-domain predictions decreases noticeably. The selected confidence threshold is therefore dataset-dependent and may require re-tuning when applied to different datasets or application scenarios.

This decision-level fusion strategy enables the system to exploit the complementary strengths of both domains and improves robustness in ambiguous or low-contrast cases.

2.4. Defect Classification Pipeline

The defect classification pipeline is designed to identify five different types of mango surface defects. Unlike conventional object detection approaches that require extensive manual annotation of bounding boxes, our method formulates this task as a multi-label classification problem, as illustrated in Figure 9. ResNet-34 is adopted as the backbone network. Inspired by Non-Local Neural Networks [23], which demonstrate performance gains by inserting non-local blocks at different network stages, we incorporate a self-attention module into the layer-4 of ResNet-34 after considering the trade-off between computational cost and performance improvement.

Although many object detection studies rely on open-source annotation tools such as LabelMe or LabelImg, these tools are primarily designed for bounding-box or polygon-level annotations and do not natively support efficient image-level multi-label annotation.

To facilitate fast and efficient multi-label annotation, we therefore developed a lightweight annotation tool, as illustrated in Figure 10. Each mango image is directly assigned a binary label vector indicating the presence or absence of each defect category, enabling rapid and consistent multi-label annotation without the need for region-level labeling.

Real-world datasets often exhibit highly imbalanced defect distributions, with minor defect categories occurring infrequently. Such imbalance can adversely affect model learning. To address this challenge, we adopt the Copy and Paste image synthesis technique proposed by Ghiasi et al. [19] to artificially augment the number of samples in scarce defect categories. Specifically, we manually extract D2 defect regions from the original images to create masks, which are then randomly scaled, rotated, and blended onto clean mango surfaces. Synthesized images that exhibit visually implausible artifacts are manually filtered out, including cases where the pasted defect extends beyond the mango region, appears at an unrealistic scale (excessively large or small), or violates natural surface appearance. Through this process, the D2 category is expanded from 177 original samples to 672, significantly improving dataset diversity, as illustrated in Figure 11.

2.5. End-to-End Deployment Logic

In a practical packhouse deployment, mango inspection follows a sequential inspection strategy commonly adopted in industrial mango sorting systems [24], where surface defect screening and quality grading are jointly considered. The five-category surface defect classification task is performed as the first screening stage to separate mangoes into pass and fail groups.

Mangoes that fail the defect screening stage are removed from the premium processing line. In addition to disposal, the distribution of defect categories is recorded and reported to the cultivation department for further analysis of potential causes. Mangoes that pass the defect screening are subsequently evaluated by the quality grading pipeline, which produces three-level grade predictions. Mangoes classified as Grade C are designated for low-price markets, while those predicted as Grade A or Grade B are labeled as “Premium” and “Standard”, respectively, for high-value sales.

Through this sequential decision strategy, the proposed framework produces a unified end-to-end inspection outcome with three system-level decisions: Reject, Downgrade, and Accept. This deployment logic reflects real-world industrial inspection requirements and enables simultaneous defect filtering and quality grading within a single operational workflow.

2.6. Training Settings

All experiments in this study were conducted on a custom-built workstation equipped with an AMD Ryzen 5 CPU, 16 GB of RAM, and running the Windows 10 operating system. The deep learning models were implemented using PyTorch [25] version 2.1.0, and all model training was performed on an NVIDIA RTX 3060 GPU.

2.6.1. Training Settings for Quality Grading

For the mango quality grading experiment, a ResNet-based architecture was adopted for both spatial-domain and frequency-domain inputs, sharing the same training configuration. Data augmentation was applied during training to enhance model generalization, including random affine transformations with rotation up to 20°, translation up to 20%, scaling in the range of 0.8–1.2, shear of 0.2, and random horizontal flipping with a probability of 0.5. The batch size was set to 32. The final fully connected layer of the backbone network was replaced by a custom classifier composed of two hidden layers with 512 and 64 neurons, respectively. Dropout with a rate of 0.5 was applied after each fully connected layer to mitigate overfitting, followed by batch normalization and ReLU activation. A softmax activation was used at the output layer to produce normalized prediction scores. The model was optimized using the Adam optimizer with an initial learning rate of 0.001 and a weight decay of

{1 \times 10}^{- 4}

. The Cross-Entropy loss function was employed for training. A step-based learning rate scheduler (StepLR) was applied, reducing the learning rate by a factor of 0.5 every three epochs. Each model was trained for a maximum of 50 epochs. In practice, the best validation performance was consistently achieved between 14 and 20 epochs, after which performance saturated or slightly degraded.

2.6.2. Training Settings for Defect Classification

For the mango defect classification experiment, the same ResNet backbone was employed, while the training configuration was adapted to the multi-label nature of defect recognition. During training, only random horizontal flipping with a probability of 0.5 was applied for data augmentation. Rotation, scaling, and affine transformations were intentionally excluded, as preliminary experiments showed that such augmentations consistently degraded classification accuracy. This performance drop is likely caused by partial cropping or distortion of small defect regions, which are critical for accurate defect identification. The batch size was fixed at 32 for all experiments. The network was trained using the Binary Cross-Entropy loss with logits (BCEWithLogitsLoss), which integrates a sigmoid activation internally to improve numerical stability. Optimization was performed using the Adam optimizer with an initial learning rate of 0.001. A StepLR scheduler was adopted to decay the learning rate by a factor of 0.5 every three epochs. Training was conducted for up to 50 epochs, with optimal validation performance typically observed between 15 and 25 epochs.

3. Results

3.1. Evaluation Metrics

For the quality grading task, we adopt the Weighted Average Recall (WAR), which is consistent with the official evaluation metric used by the AI CUP organizers and other studies utilizing the same dataset. The metric is computed as shown in Equation (5):

WAR = \sum_{i = 1}^{N} ω_{i} \times {Recall}_{i}

(5)

In the formula above, N denotes the total number of classes, ωᵢ represents the assigned weight for class i, and Recall_i corresponds to the recall for that particular class.

For the five-class defect detection task, we employ two commonly used metrics in multi-label classification: Macro-F1 and Micro-F1, defined in Equations (6)–(8):

F 1 s c o r e = 2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(6)

Macro - F 1 = \frac{\sum_{i = 1}^{n} F 1 {s c o r e}_{i}}{n}

(7)

Micro - F 1 = 2 \times \frac{{p r e c i s i o n}_{(m i)} \times {r e c a l l}_{(m i)}}{{p r e c i s i o n}_{(m i)} + {r e c a l l}_{(m i)}}

(8)

3.2. Comparison of Experimental Results

3.2.1. Quality Grading

To determine the appropriate model architecture, we conducted experiments using ResNet variants of different depths, as well as several alternative models (Table 1). Based on performance comparisons, ResNet-34 was selected as the primary backbone network for the spatial-domain branch.

To evaluate the contribution of each module within the proposed system, we performed a structured set of ablation studies. After comparing multiple backbone networks, ResNet-34 was chosen for subsequent experiments as it exhibited the most reliable performance across all metrics. We further examined how different input resolutions influenced classification outcomes. The results indicate that larger resolutions, particularly 448 × 448, consistently led to higher accuracy.

In the image enhancement stage, different preprocessing techniques were examined. As shown in Table 2, Adaptive Contrast Enhancement (ACE) achieved the most significant improvement in classification performance, suggesting its effectiveness in emphasizing fine-grained surface textures and minor defects relevant to quality assessment.

Following the spatial-domain experiments, we evaluated the proposed framework in the frequency domain. Table 3 presents the results of both DCT-based and wavelet-based CNN models, along with corresponding ablation studies. Comparisons between Rows 1 and 2 indicate that applying ACE prior to frequency transformation remains beneficial, leading to consistent performance gains. Rows 3 and 4 show that input resolution has only a marginal impact on frequency-domain models, with an accuracy difference of approximately 0.3%. Rows 5 and 6 compare ResNet-18 and ResNet-34 backbones, revealing that increasing network depth slightly degrades performance in the frequency domain. Finally, the wavelet-based CNN achieved a higher WAR of 83.7% compared to 82.3% for the DCT-based model, as reported in Group 4.

To further improve performance, spatial-domain and frequency-domain predictions were integrated using a decision-level fusion strategy. The fusion results are summarized in Table 4, where the frequency-domain branch was implemented using either DCT-based or DWT-based CNNs. Two priority schemes, frequency-first and spatial-first, were evaluated. Under the spatial-first configuration with a confidence threshold of 0.2 and a DWT-based frequency branch, the proposed fusion framework achieved the highest accuracy of 87.2%, outperforming each individual model by 0.4%. A comparison with existing studies is provided in Table 5.

3.2.2. Defect Classification

For mango defect classification, we again evaluated multiple backbone architectures, guided by findings from the quality grading task. As shown in Table 6, ResNet-34 remained the most effective backbone for the five-class defect classification problem.

An analysis of per-class performance revealed substantial variation across defect categories. Precision values for classes D1–D5 were 83%, 46%, 92%, 99%, and 88%, respectively. The D2 class exhibited notably low precision due to its severe data scarcity, as illustrated in Figure 1b. Although this imbalance had limited impact on micro-averaged metrics, the Macro-F1 score decreased to 68.2%, indicating degraded performance on minority classes. To address this issue, several data augmentation strategies were explored. FastGAN [31] failed to generate effective D2 samples due to dominance by majority classes, while GANs fine-tuned exclusively on D2 produced low-quality images (Figure 12).

Subsequently, we applied the Copy-Paste sample synthesis method [19], as described in Section 2.4, to the D2 defect category. Through this approach, the D2 dataset was expanded to 672 images, as illustrated in Figure 11. Incorporating this augmented dataset resulted in a significant improvement in model performance.

To systematically evaluate the contribution of each component in the proposed framework, we conducted a series of ablation experiments, as summarized in Table 7. In addition, the final performance of our method was compared with results reported in related studies, as shown in Table 8.

According to the ablation results in Table 7, Columns 1 and 2 examine the performance difference with and without applying the Copy-Paste strategy to increase the number of D2 samples. The comparison between Columns 2 and 3 demonstrates the effectiveness of introducing self-attention modules at the fourth stage of the ResNet-34 backbone. Columns 3 and 4 were evaluated under identical training configurations and model parameters, with the only difference being the evaluation datasets. The test data used in the first three columns were also employed to monitor training convergence, without providing any feedback for parameter adjustment or model selection. In contrast, Column 4 reports performance on a fully independent test set that was never accessed during the training process. The comparison shows that although Precision on the independent test set slightly decreased to 83.8%, the Macro-F1 score improved to 78.2%, thereby validating the generalization capability of the proposed framework.

To account for the stochastic nature of neural network training, we take the defect classification task as an example and repeat the experiment three times using the Copy-Paste + Self-Attention configuration. Each run is conducted with identical hyperparameter settings but different random seeds. The mean precision is 84.3% with a standard deviation of 0.55%, indicating that the proposed method exhibits stable and consistent performance across multiple runs.

4. Discussion

In the model selection experiments for the mango quality grading task, the superior performance of ResNet-34 indicates that a moderate network depth can achieve better generalization capability without overfitting the training set. Although MobileNetV2 exhibits slightly lower recognition performance than ResNet-34, its computational cost is significantly lower (0.3 GFLOPs compared to 3.6 GFLOPs for ResNet-34). This highlights the potential of MobileNetV2 for deployment on resource-constrained platforms. Subsequent spatial-domain experiments further demonstrate that higher input resolutions are particularly beneficial for mango quality grading, emphasizing the importance of preserving fine-grained surface details. The effectiveness of the ACE algorithm also suggests that contrast enhancement facilitates the extraction of discriminative features. In contrast, the frequency-domain experiments show relatively lower sensitivity to input resolution. This may be attributed to the fact that frequency-domain representations rely less on precise spatial locations and instead focus more on global texture characteristics and spectral distributions. Furthermore, deeper network architectures in the frequency-domain branch do not yield additional performance gains, implying that excessive network depth may introduce redundant features or hinder optimization when processing transformed inputs. The strong performance of wavelet-based CNNs further supports the advantage of multi-scale decomposition in capturing feature information at different scales [22].

The decision-level fusion strategy (late fusion) effectively exploits the complementary nature of spatial-domain and frequency-domain information. By selecting either spatial-domain or frequency-domain predictions based on the confidence differences across classes, the proposed system leverages the strengths of both domains, resulting in improved robustness and overall performance. In future work, we plan to further investigate feature-level fusion (early fusion) by integrating spatial-domain and frequency- domain CNN networks more tightly.

In the defect classification experiments, the results highlight the challenges posed by extreme class imbalance. Conventional data augmentation methods based on generative adversarial networks (GANs) perform poorly when training data are severely limited, whereas the copy-paste strategy effectively enriches samples of minority classes. However, it should be noted that the manual filtering of unrealistic synthesized defect images may introduce a certain degree of subjectivity. For example, determining what constitutes an unrealistic defect size can be ambiguous. In addition, the scalability of this manual filtering process, as well as its applicability to other fruit varieties or defect types, remains an open issue and warrants further investigation.

In addition, ablation studies indicate that introducing a self-attention module into the layer4 stage of ResNet-34 improves classification accuracy without incurring excessive computational overhead. As demonstrated in Table 8, reformulating the original mango defect detection task as a multi-label defect classification problem not only alleviates the substantial annotation burden associated with pixel-level defect masks but also leads to improved classification performance.

In multi-label classification, each class is associated with an independent decision threshold and produces probability outputs in the range of 0–1, rather than enforcing a normalized probability distribution across classes as in multi-class classification. Consequently, threshold optimization for individual classes in multi-label classification [32] represents an important research direction. In future work, we plan to further investigate adaptive threshold optimization strategies to enhance multi-label classification performance.

5. Conclusions

This study proposed an integrated deep learning framework that simultaneously addresses mango quality grading and surface defect detection. For defect classification, five types of surface imperfections were reformulated into a multi-label classification problem, and a copy-paste–based data augmentation strategy was introduced to alleviate class imbalance, particularly for rare defect categories. Although generative approaches such as Fast-GAN were evaluated, they proved ineffective for extremely scarce classes, whereas the copy-paste augmentation consistently improved performance.

For quality grading, decision-level fusion of the output probability scores from spatial-domain and frequency-domain CNNs, together with adaptive contrast enhancement, effectively captured fine surface textures essential for grade discrimination. Experimental results demonstrate that the proposed system achieves strong performance, with a Micro-F1 score of 84% for defect classification and a weighted average recall (WAR) of 87.2% for quality grading.

Despite these advantages, several limitations remain. The proposed framework was evaluated on a single mango dataset, and its generalization capability to other mango varieties or fruit types requires further validation. In addition, the effectiveness of adaptive contrast enhancement for fruits with less distinctive surface textures, such as apples, remains to be investigated.

At the current stage, the proposed system demonstrates practical applicability for industrial production environments, with the potential to reduce reliance on manual visual inspection and alleviate operator fatigue in conventional mango sorting lines. Looking forward, integrating ripeness estimation and other maturity-related indicators into the dual-pipeline framework represents a promising research direction. Such extensions would further enhance the completeness of automated mango evaluation and contribute to more intelligent, efficient, and scalable smart agriculture systems.

Author Contributions

Conceptualization, S.L.; methodology, S.L. and H.C.; software, H.C.; validation, H.C.; formal analysis, S.L. and H.C.; investigation, H.C.; resources (dataset provision), AI CUP; data curation, H.C.; writing—original draft preparation, H.C.; writing—review and editing, S.L. and H.C.; visualization, H.C.; supervision, S.L.; project administration, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available from AI CUP: Mango Grades Classification Competition, Aidea, 2020. Available online: https://www.aidea-web.tw/topic/72f6ea6a-9300-445a-bedc-9e9f27d91b1c (accessed on 20 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shen, X.; Li, L.; Ma, Y.; Xu, S. VLCIM: A Vision–Language Cyclic Interaction Model for Industrial Defect Detection. IEEE Trans. Instrum. Meas. 2025, 74, 2538713. [Google Scholar] [CrossRef]
Kong, C.; Chen, B.; Li, H.; Wang, S.; Rocha, A.; Kwong, S. Detect and Locate: Exposing Face Manipulation by Semantic- and Noise-Level Telltales. IEEE Trans. Inf. Forensics Secur. 2022, 17, 1741–1756. [Google Scholar] [CrossRef]
Hayat, A.; Morgado-Dias, F.; Choudhury, T.; Singh, T.P.; Kotecha, K. FruitVision: A Deep Learning Based Automatic Fruit Grading System. Open Agric. 2024, 9, 20220276. [Google Scholar] [CrossRef]
Chuquimarca, L.E.; Vintimilla, B.X.; Velastin, S.A. A Review of External Quality Inspection for Fruit Grading Using CNN Models. J. Agric. Food Res. 2024, 14, 1–20. [Google Scholar] [CrossRef]
Chopra, H.; Singh, H.; Bamrah, M.S.; Mahbubani, F.; Verma, A.; Hooda, N.; Rana, P.S.; Singla, R.K.; Singh, A.K. Efficient Fruit Grading System Using Spectrophotometry and Machine Learning Approaches. IEEE Sens. J. 2021, 21, 16162–16169. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Matsui, A.; Ishibashi, R.; Meng, L. YOLO Loss Optimization for Detecting Fruit Defects. In Proceedings of the IIKI 2024, Shiga, Japan, 7 December 2024; pp. 6–11. [Google Scholar] [CrossRef]
Wu, W.-T.; Lin, C.-S. Analysis and Prediction for Defective Irwin Mangos Based on Neuron Networks for Image Recognition. Int. J. Sci. Eng. 2022, 12, 91–108. [Google Scholar] [CrossRef]
Su, W.; Yang, Y.; Zhou, C.; Zhuang, Z.; Liu, Y. Multiple Defect Classification Method for Green Plum Surfaces Based on Vision Transformer. Forests 2023, 14, 1323. [Google Scholar] [CrossRef]
Chuquimarca, L.; Vintimilla, B.; Velastin, S. Classifying Healthy and Defective Fruits with a Multi-Input Architecture and CNN Models. arXiv 2024, arXiv:2410.11108. [Google Scholar] [CrossRef]
Lee, J.-H.; Vo, H.-T.; Kwon, G.-J.; Kim, H.-G.; Kim, J.-Y. Multi-Camera-Based Sorting System for Surface Defects of Apples. Sensors 2023, 23, 3968. [Google Scholar] [CrossRef] [PubMed]
Nithya, R.; Santhi, B.; Manikandan, R.; Rahimi, M.; Gandomi, A.H. Computer Vision System for Mango Fruit Defect Detection Using Deep Convolutional Neural Network. Foods 2022, 11, 3483. [Google Scholar] [CrossRef] [PubMed]
Li, L.-H.; Jiang, L.-Q.; Peng, Y.-F.; Liu, Y.-S.; Chung, K.-L. Combining Conditional Generative Adversarial Networks and YOLOv4 for Mango Classification. In Proceedings of the 2022 International Conference on Technologies and Applications of Artificial Intelligence (TAAI), Tainan, Taiwan, 1–3 December 2022; pp. 54–59. [Google Scholar] [CrossRef]
Wu, S.-L.; Tung, H.-Y.; Hsu, Y.-L. Deep Learning for Automatic Quality Grading of Mangoes: Methods and Insights. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 15–17 December 2020; pp. 446–453. [Google Scholar] [CrossRef]
Naik, S. Non-Destructive Mango (Mangifera Indica L., CV. Kesar) Grading Using Convolutional Neural Network and Support Vector Machine. In Proceedings of the International Conference on Sustainable Computing in Science, Technology and Management (SUSCOM), Jaipur, India, 26–28 February 2019. [Google Scholar] [CrossRef]
Sun, H.; Zhang, S.; Ren, R.; Su, L. Surface Defect Detection of “Yuluxiang” Pear Using Convolutional Neural Network with Class-Balance Loss. Agronomy 2022, 12, 2076. [Google Scholar] [CrossRef]
Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet Convolutional Neural Networks for Texture Classification. arXiv 2017, arXiv:1707.07394. [Google Scholar] [CrossRef]
Narendra, P.M.; Fitch, R.C. Real-Time Adaptive Contrast Enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 1981, 3, 655–661. [Google Scholar] [CrossRef] [PubMed]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.-Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation. arXiv 2021, arXiv:2012.07177. [Google Scholar] [CrossRef]
AI CUP. Mango Grades Classification Competition. Aidea. 2020. Available online: https://www.aidea-web.tw/topic/72f6ea6a-9300-445a-bedc-9e9f27d91b1c (accessed on 20 March 2025).
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jägersand, M. U²-Net: Going Deeper with Nested U-Structure for Salient Object Detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Zhao, X. Wavelet-Attention CNN for Image Classification. arXiv 2022, arXiv:2201.09271. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Ministry of Agriculture, Taiwan. Agricultural Knowledge Portal—Mango Theme Pavilion. Available online: https://kmweb.moa.gov.tw/subject/subject.php?id=10899 (accessed on 19 January 2026).
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 8026–8037. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 2, pp. 1097–1105. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
Liu, B.; Zhu, Y.; Song, K.; Elgammal, A. Towards Faster and Stabilized GAN Training for High-Fidelity Few-Shot Image Synthesis. arXiv 2021, arXiv:2101.04775. [Google Scholar] [CrossRef]
Lin, Y.-J.; Lin, C.-J. On the Thresholding Strategy for Infrequent Labels in Multi-Label Classification. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023), Birmingham, UK, 21–25 October 2023; ACM: New York, NY, USA, 2023; pp. 1–10. [Google Scholar] [CrossRef]

Figure 1. Bar charts showing the dataset distribution for the mango experiments: (a) mango quality grading and (b) mango defect classification.

Figure 2. Sample images used in the mango quality grading experiment. From left to right are Grade A, Grade B, and Grade C mangoes.

Figure 3. Sample images used in the mango defect classification experiment. From the top-left to the top-right are: (1) latex adhesion defect (D1), (2) mechanical damage (D2), and (3) anthracnose (D3). From the bottom-left to the bottom-right are: (4) no defect, (5) black spot disease (D5), and (6) uneven color (D4).

Figure 4. Comparison of mango images before and after background removal using U²-Net.

Figure 5. Texture variation map produced by applying the ACE sliding-window computation to the detailed mango image.

Figure 6. Changes in surface texture before and after applying Adaptive Contrast Enhancement (ACE) (left) and the comparison of pixel grayscale distributions (right).

Figure 7. Flowchart of the proposed method for the three-category mango quality classification experiment, where the left branch represents the frequency-domain pipeline and the right branch represents the spatial-domain pipeline.

Figure 8. Flowchart of the decision-level fusion process (spatial-priority mode).

Figure 9. Overview of the proposed framework for five-class mango defect classification.

Figure 10. Interface of the fast labeling tool (the final label shown is 01010, corresponding to defect categories D2 and D4).

Figure 11. Illustration of the Copy-Paste image synthesis method.

Figure 12. The upper row shows mango images generated by training Fast-GAN with 6000 mango images; the lower row shows results generated by transfer learning using only 177 D2-category (scratch) images.

Table 1. Performance comparison of different neural network models applied to mango quality classification.

Model	Precision (WAR)
ResNet-34 [26]	86.8%
ResNet-50	85.3%
ResNet-101	85.9%
MobileNetV2 [27]	85.8%

Table 2. Ablation study on the effects of image processing techniques and input resolutions using the ResNet-34 backbone.

Background Removal	Contrast Enhancement	Image Size	WAR
X (raw image)	X	224	84.0%
U²-net	X	224	85.0%
U²-net	X	448	86.3%
U²-net	CLAHE	224	84.4%
U²-net	CLAHE	448	85.8%
U²-net	ACE	224	86.3%
U²-net	ACE	448	86.8%

Table 3. Evaluation of Frequency-Domain Model Performance under Various Parameter Configurations.

	Method	Image Size	Model	ACE	WAR
1.	DCT	224	Resnet-18	X	81.2
1.	DCT	224	Resnet-18	V	82.3
2.	DCT	224	Resnet-18	V	82.3
2.	DCT	448	Resnet-18	V	82
3.	DCT	448	Resnet-18	V	82
3.	DCT	448	Resnet-34	V	81
4.	DCT	224	Resnet-18	V	82.3
4.	DWT	224	Resnet-18	V	83.7
5.	DWT	448	Resnet-18	V	83.5
5.	DWT	224	Resnet-18	V	83.7

Table 4. Effectiveness of Decision-Level Fusion of Output Probability Scores from Spatial- and Frequency-Domain CNNs (Th denotes threshold).

	Frequency Domain	Spatial Domain	Priority	Th	WAR
DCT	82.3	86.8	frequency	0.83	86.7
DCT	82.3	86.8	spatial	0.16	86.9
DWT	83.7	86.8	frequency	0.7	87.0
DWT	83.7	86.8	spatial	0.2	87.2

Table 5. Comparison of the proposed method with different methods.

	Model	WAR
Taiwan AI CUP [20]	AlexNet [28]	69.6%
	VGG16 [29]	64.5%
	DenseNet [30]	52.4%
	Top 1	83.6%
CGAN + YOLOV4 [13] (416 × 416)	CGAN + YOLOV4	84.88%
Proposed method (spatial domain 224 × 224)	ResNet34 + ACE	86.3%
Proposed method (spatial domain 448 × 448)	ResNet34 + ACE	86.8%
Proposed method (frequency domain 224 × 224 + spatial domain 448 × 448)	DWT + spatial domain	87.2%

Table 6. Accuracy comparison of various model architectures for the mango defect classification task.

Model	Precision (Average = Samples)
ResNet-18	83.78%
ResNet-34	85.77%
ResNet-50	83%

Table 7. Ablation study on classification accuracy improvements achieved by different components (the asterisk (*) in the fourth column indicates that an independent test set was used).

Model	Copy-Paste	Self-Attention	Precision (ma)	Macro-F1	Micro-F1
ResNet34	x	x	82.2%	68.2%	83.5%
ResNet34	v	x	87.8%	72%	84.3%
ResNet34	v	v	88%	74.6%	85.7%
ResNet34 *	v	v	83.8%	78.2%	84%

Table 8. Comparison of the proposed method with different methods.

	Model	Precision (ma)	Macro-F1	Micro-F1
Mask R-CNN + GrabCut [8]	Mask R-CNN	62.7%	63.9%	X
Proposed method	ResNet34 (multi-label)	82.2%	68.2%	83.5%
Proposed method	ResNet34 + Copy-Paste + self-attention	83.8%	78.2%	84%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, S.; Chiu, H. From Quality Grading to Defect Recognition: A Dual-Pipeline Deep Learning Approach for Automated Mango Assessment. Electronics 2026, 15, 549. https://doi.org/10.3390/electronics15030549

AMA Style

Lin S, Chiu H. From Quality Grading to Defect Recognition: A Dual-Pipeline Deep Learning Approach for Automated Mango Assessment. Electronics. 2026; 15(3):549. https://doi.org/10.3390/electronics15030549

Chicago/Turabian Style

Lin, Shinfeng, and Hongting Chiu. 2026. "From Quality Grading to Defect Recognition: A Dual-Pipeline Deep Learning Approach for Automated Mango Assessment" Electronics 15, no. 3: 549. https://doi.org/10.3390/electronics15030549

APA Style

Lin, S., & Chiu, H. (2026). From Quality Grading to Defect Recognition: A Dual-Pipeline Deep Learning Approach for Automated Mango Assessment. Electronics, 15(3), 549. https://doi.org/10.3390/electronics15030549

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Quality Grading to Defect Recognition: A Dual-Pipeline Deep Learning Approach for Automated Mango Assessment

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Data Preprocessing

2.2.1. Background Removal

2.2.2. Contrast Enhancement

2.3. Quality Grading Pipeline

2.4. Defect Classification Pipeline

2.5. End-to-End Deployment Logic

2.6. Training Settings

2.6.1. Training Settings for Quality Grading

2.6.2. Training Settings for Defect Classification

3. Results

3.1. Evaluation Metrics

3.2. Comparison of Experimental Results

3.2.1. Quality Grading

3.2.2. Defect Classification

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI