An Efficient Multi-Scale Feature Fusion Network for Tiny Defect Detection on Ceramic Cup Surfaces

Xiao, Shikang; Deng, Xiaojun; Sun, Yuanhao

doi:10.3390/pr14101560

Open AccessArticle

An Efficient Multi-Scale Feature Fusion Network for Tiny Defect Detection on Ceramic Cup Surfaces

by

Shikang Xiao

,

Xiaojun Deng

^* and

Yuanhao Sun

School of Computer Science and Artificial Intelligence, Hunan University of Technology, Zhuzhou 412007, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(10), 1560; https://doi.org/10.3390/pr14101560

Submission received: 1 April 2026 / Revised: 8 May 2026 / Accepted: 11 May 2026 / Published: 12 May 2026

(This article belongs to the Section Automation Control Systems)

Download

Browse Figures

Versions Notes

Abstract

In ceramic cup manufacturing, manual inspection is prone to missed detections and false positives, particularly for small surface defects. To address these challenges, this study presents an effective and efficient YOLOv11m-based detection framework, termed CEL-YOLOv11m, for precise identification of small-scale defects on ceramic surfaces. Specifically, a multi-scale convolution module (EMSC) is introduced to enhance the backbone feature extraction structure. By integrating convolution kernels of varying sizes, the module improves multi-scale feature representation, while grouped convolution is employed to reduce computational overhead. In the feature aggregation stage, a CRGseg-based structure is incorporated, and a refinement component (RCM) is designed to strengthen fine-grained information for small targets. Additionally, a cross-scale feature fusion strategy is applied to improve contextual representation across different resolutions. For the detection stage, a Layer-shared Detail-Enhanced Convolutional Detection Head (LSDECD) is adopted to improve fine-grained localization while improving computational efficiency through parameter sharing. Experiments conducted on a self-constructed ceramic defect dataset and the VisDrone2019 benchmark show that the proposed framework achieves competitive performance compared with representative methods. The model attains an mAP@50(%) of 54.8% with an inference speed of 89.9 FPS, providing a favorable trade-off between detection accuracy and computational efficiency while maintaining strong precision in small defect detection.

Keywords:

ceramic cup defects; YOLOv11; multi-scale features; object detection

1. Introduction

As essential daily-use products, porcelain cups are widely used in many fields. However, due to the complexity of the manufacturing process, surface defects such as blemishes, voids, and cracks are likely to occur during firing and post-processing stages. Serious defects can directly affect both the functionality and appearance of porcelain cups, leading to economic losses for manufacturers. At present, many enterprises still rely on traditional manual visual inspection methods [1], which depend heavily on the subjective judgment of inspectors. This often results in low detection efficiency, missed detections, and false alarms, making it difficult to satisfy the demands of large-scale industrial production. Therefore, developing an accurate and efficient defect detection method for ceramic manufacturing [2] is of considerable practical and economic significance.

Traditional machine vision methods have been widely explored for surface defect detection. For example, Li Junhua et al. [3] combined an improved scale-invariant feature transform algorithm with color moment fusion features and employed a support vector machine (SVM) for defect classification. Mingshuai Yin et al. [4] adopted Shift-Invariant Wavelet Transform (SWT), Fast Fourier Transform (FFT), low-pass filtering, defect extraction, and edge detection for defect identification. Although these methods can provide relatively stable results while reducing manual intervention, they often exhibit limited robustness under complex backgrounds and insufficient flexibility when facing new defect categories.

With the rapid development of deep learning, ceramic surface defect detection methods can generally be divided into one-stage and two-stage detectors. One-stage methods include the YOLO and SSD series, while two-stage methods include Fast R-CNN [5], Faster R-CNN [6], and Mask R-CNN [7]. Owing to its high inference speed and practical adaptability, the YOLO family has been widely adopted in ceramic defect inspection tasks. For instance, Pan Jinjing et al. [8] proposed an improved YOLOv5-based ceramic surface defect detector by introducing a global attention mechanism and optimizing the loss function. Wu Hangxing et al. [9] developed a ceramic tile defect detection method based on YOLOv5s by incorporating the CIoU metric into anchor matching and enhancing the backbone residual structure. Yu Songsen et al. [10] proposed a large-format tile defect detection method based on YOLOv8, where large separable kernel attention and full-dimensional dynamic convolution were introduced to improve small-defect recognition.

Recent studies have further demonstrated the importance of feature enhancement for tiny-object detection. Bai et al. [11] proposed SFFEF-YOLO for UAV imagery through fine-grained feature extraction and feature fusion. Gu et al. [12] introduced a multimodal dynamic gated fusion framework for UAV object detection. Among them, SFFEF-YOLO achieved strong performance on the VisDrone2019 benchmark, indicating the importance of fine-grained feature extraction for tiny-object detection. These studies indicate that effective feature representation and multi-scale fusion are crucial for tiny target detection, which also motivates the present work on ceramic surface defect inspection.

2. YOLOv11 Network Architecture

YOLO (You Only Look Once) is a fast single-stage object detection network that formulates object detection as a regression task by directly predicting bounding box coordinates and class probabilities from input images. The object detection experiments in this study were implemented using Ultralytics YOLO11 (version 11.0.0, Ultralytics Inc., Frederick, MD, USA), hereafter referred to as YOLOv11 in this manuscript. Building upon previous YOLO architectures, YOLOv11 introduces the C3k2 and C2PSA modules and further extends the NMS-free training strategy of YOLOv10, enabling end-to-end object detection while improving detection performance and efficiency.

The YOLOv11 model is available in multiple scales, including YOLOv11n, YOLOv11s, YOLOv11m, YOLOv11l, and YOLOv11x. Among them, YOLOv11m provides a favorable balance between detection accuracy and inference speed, making it suitable for ceramic surface defect detection tasks in industrial production environments.

As shown in Figure 1, the YOLOv11 network mainly consists of three components: Backbone, Neck, and Head. The Backbone is responsible for feature extraction, where the traditional C2f structure is replaced by the C3k2 module. The C3k module adopts customizable convolution kernels, allowing flexible adjustment according to different computational resources and task requirements. In addition, the C2PSA module, which is based on the Pyramid Squeeze Attention (PSA) mechanism, is introduced to enhance feature representation capability through attention modeling. The Neck continues to adopt the FPN and PANet structures for multi-scale feature fusion.

3. CEL-YOLOv11m

In order to effectively solve the problem of leakage and misdetection of defective products in the porcelain cup production line, this paper develops the CEL-YOLOv11m framework and the overall network structure is shown in Figure 2. In this study, the Backbone of YOLOv11m is improved, and the EMSC (Efficient Multi-Scale Conv) structure is designed to enhance the C3k2 structure of the Backbone. The CGRFPN (Context-Guided Spatial Feature Reconstruction Feature Pyramid Network) structure is used to improve the Neck. Finally, the LSDECD (Layer-Shared Detail Enhanced Convolutional Detection Head) is introduced in the Head to enhance its detail-capturing capability and further improve detection accuracy.

3.1. A Multi-Scale Convolution Module

The design of the EMSC structure combines the concepts of GhostNet [13] and MobileNet [14], as shown in Figure 3. Its core idea is to extract multiscale feature information using a convolutional network with low parameter count and computational complexity. The output feature maps in the middle layers of the model usually contain rich and sometimes redundant features, which can be obtained by combining other feature maps with a low number of parameters. The EMSC structure first divides the input feature maps into four equal parts using the grouping strategy, and then performs independent convolutions on these parts using convolution kernels of 1 × 1, 3 × 3, 5 × 5, and 7 × 7, respectively. Since these features are processed on separate channels, and the information between the individual channel features is independent, the PWConv pointwise convolution operation is used to fuse the channel features after the grouped independent convolutions. The EMSC structure utilizes the grouping strategy to perform parallel multi-scale feature extraction, enhancing the ability to capture both global and local information. As the feature map is processed and the number of input feature channels increases, this grouping structure demonstrates its high efficiency while significantly reducing both the computational burden and parameter count. As a result, only the last layer of the Backbone’s C3k2 structure is improved.

3.2. CRGseg

The Neck of YOLOv11 still follows the PANet [15] structure for fusing features. The bottom-up path aggregation of PANet enhances the utilization of high-level semantic information, but it still fails to enhance the detailed information of small targets, thus limiting the performance of small target detection. In addition, PANet may lead to redundant transfer of information across different feature layers during the feature aggregation process. Furthermore, the details of low-level features (e.g., texture information) may be lost during high-level semantic fusion. In this paper, we introduce the spatial feature reconstruction network structure CRGseg [16], which first models the axial global context of the rectangular key region by capturing the global context in both horizontal and vertical directions through the rectangular self-calibration module (RCM). Second, we use the pyramid context extraction module (PCE), which effectively integrates feature information at different levels and enhances the context-awareness of the model. Third, the Fuse Block Multi (FBM) and Dynamic Interpolation Fusion (DIF) modules are used to perform multi-scale feature fusion through dynamic interpolation. The DIF module enables efficient integration of multi-scale features, thereby improving the model’s target recognition ability and multi-scale feature representation capability, especially in complex backgrounds.

3.3. Rectangular Self-Calibration Module

The rectangular self-calibration module (RCM) consists of a rectangular self-calibration attention mechanism (RCA), a batch normalization (BN) layer, and a multilayer perceptron (MLP), as shown in Figure 4.

Rectangular self-calibrating attention captures the global context along both axes through horizontal and vertical pooling, and then models the rectangular region of interest using broadcasting addition. A shape self-calibration function is also invoked to calibrate the region of interest, aligning it more closely with the foreground object. The shape self-calibration function first calibrates the shape in the horizontal direction using horizontal bar convolutions, adjusting each row of elements to align the horizontal shape with the foreground object. The features are then batch-normalized, and nonlinearities are introduced using the ReLU activation function. Similarly, in the vertical direction, vertical bar convolutions are used to calibrate the shape. The calibration is performed using two bar convolutions with large kernels, which decouple the convolutions in both directions, allowing them to adapt to various shapes. The shape self-calibration function

ξ_{C}

can be expressed as in Equation (1):

ξ_{C} (\bar{y}) = δ (ψ_{k \times 1} (ϕ (ψ_{1 \times k} (\bar{y}))))

(1)

where

ψ

denotes the large-kernel strip convolution,

k

denotes the length of the strip convolution,

ϕ

denotes batch normalization (BN) and ReLU activation, and

δ

denotes the sigmoid function.

A feature fusion function

ξ_{F}

is invoked to further extract the local details of the features using a 3

\times

3 depthwise convolution. The calibrated attentional features are then weighted by the Hadamard product to fuse the attentional features with the input features, which is computed as shown in Equation (2):

ξ_{F} (x, y) = ψ_{3 \times 3} (x) ⊙ y

(2)

where

ψ_{3 \times 3}

denotes a convolution with a 3

\times

3 kernel, y is the attentional feature computed by Equation (1), and

⊙

denotes the Hadamard product. The overall structure of the RCM is shown in Equation (3):

F_{out} = ρ (ξ_{F} (x, ξ_{C} (H_{P} (x) \oplus V_{P} (x)))) + F_{i n}

(3)

where

\oplus

denotes broadcast addition,

H_{p}

denotes the horizontal pooling operation,

V_{p}

denotes the vertical pooling operation, and

ρ

denotes batch normalization (BN) and multilayer perceptron (MLP).

3.4. Pyramid Context Extraction Module

The structure of the PCE module is shown in Figure 5, where the features of P3, P4, and P5 obtained from the backbone network are first subjected to pyramid pooling operations, followed by splicing to obtain multi-scale feature information. The spliced feature map passes through three rectangular self-calibration modules (RCM), which dynamically calibrate the input features through adaptive adjustment of rectangular features, enabling the model to learn key regional features more accurately. The RCM dynamically assigns weights to the feature map and performs region-based feature reinforcement to reduce the influence of irrelevant regions on the feature representation. After obtaining the calibrated feature maps, the feature maps are split to separate the processed P3, P4, and P5. The PCE module effectively integrates feature information at different levels and enhances the feature maps at different scales after splicing, improving the context-awareness capability of the model.

3.5. The Fuse Block

The structure of the FBM module is shown in Figure 6, which achieves the efficient fusion of multi-scale features by introducing a dynamic attention mechanism. First, the low-resolution feature map and the high-resolution feature map undergo feature extraction by two independent convolutional layers, with the high-resolution feature map passing through the sigmoid activation function to produce a dynamic weight map, ensuring that the weight values are in the range of [0, 1]. Subsequently, the weight map is up-sampled using bilinear interpolation to match the spatial size of the low-resolution feature map, and then undergoes element-wise multiplication with the extracted low-resolution feature map to achieve weighted fusion. This module effectively integrates multi-scale features, utilizes the attention mechanism to dynamically regulate the contributions of different features, and emphasizes the key regions, thus significantly improving the model’s ability to represent multi-scale information.

The structure of the DIF module is shown in Figure 7, which achieves efficient fusion of multi-scale features through interpolation and convolution operations. First, it receives two input feature maps, one of which is spatially resized by bilinear interpolation upsampling to match the size of the other feature map. Then, the channel dimension is transformed using a 1

\times

1 convolution. Finally, the interpolated feature maps and the target-sized feature maps undergo element-wise summation to achieve information fusion. DIF is structurally simple and computationally efficient, and it can quickly align and fuse features of different scales, effectively improving the model’s ability to recognize targets in complex backgrounds.

3.6. LSDECD

The inspection head in the original YOLOv11 model contains detection layers for P3, P4, and P5 scales. In the study of detecting small defects on the surface of porcelain cups, defects of the same type can vary in size and shape, which affects detection efficiency. The surface of porcelain cups can be affected by light reflection, complex backgrounds, and other factors, making defect detection difficult. At the same time, the inspection head requires an efficient head design to maintain high inspection efficiency. To solve the above problems, a shared-parameter head, the Layer-Shared Detail-Enhanced Convolutional Detection Head (LSDECD), is introduced into YOLOv11. The structure is shown in Figure 8, where the three-scale features undergo independent convolution operations and Group Normalization (GN). The detail-enhanced convolution (DEConv) replaces standard convolution to improve feature characterization and generalization ability, thereby enhancing detection performance.

The LSDECD detection head first performs independent 1 × 1 convolution and group normalization (GN) for the P3, P4, and P5 feature maps, respectively, to reduce the number of channels and align the feature maps for subsequent operations. The feature maps after dimensionality reduction are merged and further fused to extract higher-order features using 3 × 3 detail-enhanced convolution (DEConv) with shared parameters. The output after convolution is divided into two paths: one for bounding box regression (Conv_Reg), which outputs the predicted coordinate information, and the scale is adjusted through the Scale module to cope with the inconsistency in bounding box scales detected by each check head; the other for categorization (Conv_Cls), which outputs the probability of each target category.

While batch normalization is more commonly used, group normalization (GN) is more advantageous in this paper’s algorithm: when the batch size is small, the stability of the statistics can be greatly affected, leading to unstable training or a decline in performance, whereas group normalization does not depend on the batch size. In the FCOS [17] paper, group normalization (GN) is shown to improve the performance of detection head localization and classification, which is also demonstrated in this experiment.

The two 3

\times

3 detail-enhanced convolutions used in the LSDECD detection head, referred to as DEConv [18], are structured as shown. There are two branches in DEConv: one passes through the standard convolution, referred to as Vanilla Convolution (VC), and the other passes through a differential convolution layer consisting of Center Difference Convolution (CDC), Angle Difference Convolution (ADC), Vertical Difference Convolution (VDC), and Horizontal Difference Convolution (HDC). Standard convolution is used to obtain intensity-level information, while differential convolution is used to enhance gradient-level information. Differential convolution learns and enhances gradient-based features to improve representation and generalization. The reparameterization technique is also applied in the DEConv module, and the structure is shown in Figure 9.

Multiple 2D convolutional kernels of the same size operating on the same input with the same stride and padding can be used to sum their respective outputs as the final result. A new equivalent convolutional kernel is obtained by summing the weights of these convolutional kernels at the corresponding positions, and the same final output can be obtained directly by performing a single convolutional operation on the input with this equivalent kernel. DEConv is effectively converted to a standard convolution, reducing computational and storage overheads. Moreover, the shared-parameter design and reparameterization strategy enable LSDECD to reduce redundant computations in the detection stage, thereby improving computational efficiency despite the increase in representational capacity.

4. Experiments

4.1. Experimental Environment

The configuration of the experimental environment in this paper is shown in Table 1.

Training was conducted for 100 epochs with a batch size of 32 and an input resolution of 640 × 640. The SGD optimizer was used with an initial learning rate of 0.005, final learning rate ratio of 0.01, momentum of 0.937, and weight decay of 0.0005. Pretrained weights and AMP mixed-precision training were enabled. The random seed was fixed to 0, and deterministic mode was activated to ensure reproducibility. Mosaic augmentation was applied with a probability of 1.0 and disabled during the last 10 epochs. All experiments were implemented using the Ultralytics YOLOv11 framework based on PyTorch 2.1.0.

4.2. Dataset

The original dataset contains 362 defect images collected from a real ceramic production line, covering four common defect categories: speckles, mud slag, dissolved holes, and mounting cracks, which were manually annotated using the Labelme tool [19]. To avoid potential data leakage, the original images were first divided into training, validation, and test subsets with a ratio of 7:2:1 at the image level. Data augmentation was then applied only to the training subset, including horizontal flipping, vertical flipping, saturation adjustment, noise injection, and Mosaic augmentation. The Mosaic probability was set to 1.0 during training and was disabled in the last 10 epochs. Validation and test subsets consisted only of original non-augmented images. Examples of the augmented ceramic defect images are presented in Figure 10, illustrating the diversity introduced through the enhancement strategies. Although the dataset size is relatively limited, all samples were collected under varying illumination conditions, viewpoints, and surface textures, providing practical diversity. In addition, cross-dataset validation on VisDrone2019 was conducted to evaluate the transferability of the proposed method. Future work will expand the dataset with multi-factory and multi-device samples.

4.3. Evaluation Metrics

The performance metrics used in this paper are Precision, Recall, mAP@0.5(%), mAP0.5:0.95(%), Parameters, Computation (GFLOPS), and Speed (Frames per Second, FPS).

Precision = \frac{T P}{T P + F P}

(4)

Recall = \frac{T P}{T P + F N}

(5)

m A P = \frac{\sum_{i - 1}^{N} \int_{0}^{1} P (R_{i}) d R_{i}}{c N}

(6)

where TP is the number of true positive samples correctly detected, FP is the number of negative samples incorrectly detected as positive, and FN is the number of positive samples incorrectly detected as negative. mAP@0.5(%) represents the mean Average Precision (AP) value at a 50% IoU threshold. mAP@0.5:0.95(%) represents the mean AP across IoU thresholds ranging from 50% to 95%. Inference speed (FPS) was measured on a single NVIDIA RTX 3090 GPU using batch size = 1 after model warm-up.

All GFLOPs values were computed under a unified input resolution of 640 × 640 using the built-in profiling tools of the Ultralytics YOLOv11 framework based on PyTorch 2.1.0. The measurements were conducted using batch size = 1 after model initialization under identical inference settings. GFLOPs represent theoretical floating-point operations during forward inference and therefore may not exhibit strict linear consistency with practical runtime speed due to implementation efficiency, memory access patterns, and hardware-level parallelism.

4.4. Backbone C3k2 Improvement Analysis

To verify the effectiveness of using C3k2, the backbone part of EMSC, in detecting small defects on the surface of porcelain cups, YOLOv11m is used as the benchmark model. Existing modules from the literature, including PKI [20], PPA [21], Star [22], and RVB [23], are incorporated to improve C3k2. Comparative experiments are conducted on the self-constructed dataset.

The experimental results are shown in Table 2, which demonstrate that C3K2-EMSC exhibits better detection performance than other improved algorithms, confirming the effectiveness of the design improvement. C3K2-EMSC significantly improves detection accuracy while maintaining a lower computational overhead. mAP@0.5(%) and mAP@0.5:0.95(%) are higher than the other improvements. C3K2-EMSC’s mAP@0.5(%) reached 53.8%, a 9.8% improvement compared to 0.49 in C3K2-PKI, and mAP@0.5:0.95(%) also improved by 4.4%.

Although the number of parameters (19.8 M) is slightly higher than C3K2-Star’s (18.6 M), its overall computational overhead of 68.1 G and inference speed of 103.5 FPS remain competitive.

4.5. Ablation Study

To verify the effectiveness of the proposed modules in CEL-YOLOv11m, YOLOv11m was used as the baseline model, and ablation experiments were conducted on the self-built ceramic defect dataset. The results are presented in Table 3.

The experimental results show that all proposed modules contribute positively to detection performance. After introducing the EMSC module, mAP@0.5(%) improves from 51.3% to 52.0%, while mAP@0.5:0.95(%) increases from 33.4% to 33.9%. Meanwhile, the computational cost remains nearly unchanged (68.2 GFLOPs to 68.1 GFLOPs), indicating that EMSC enhances feature extraction efficiently.

When only CGRFPN is introduced, mAP@0.5(%) reaches 52.8% and mAP@0.5:0.95(%) reaches 34.4%. Although CGRFPN introduces additional computational overhead, its primary contribution lies in strengthening contextual interaction and cross-scale feature aggregation. The effectiveness of this module becomes more evident when combined with EMSC and LSDECD, indicating that CGRFPN mainly functions as a complementary feature-fusion component within the overall framework. When only LSDECD is adopted, mAP@0.5(%) improves to 53.6% and mAP@0.5:0.95(%) improves to 35.2%, while GFLOPs decrease to 63.5. This demonstrates that the shared-parameter detection head can improve accuracy while maintaining computational efficiency.

Combining EMSC and CGRFPN further improves mAP@0.5(%) to 54.1% and mAP@0.5:0.95(%) to 36.0%, indicating a clear synergistic effect between backbone enhancement and neck optimization.

Finally, the complete CEL-YOLOv11m model achieves the best overall performance, reaching 54.8% mAP@0.5(%) and 37.1% mAP@0.5:0.95(%), with 72.1 GFLOPs and 25.56 M parameters. These results demonstrate that the proposed method achieves a favorable tradeoff between detection accuracy and computational cost. Among the individual modules, LSDECD provides the largest standalone improvement, while the combination of all modules yields the highest overall performance.

4.6. Repeated Comparison Under Different Random Seeds

Before comparing with multiple baseline models, repeated experiments were first conducted to evaluate the statistical reliability of the proposed method under different random seeds. The results are summarized in Table 4.

CEL-YOLOv11m consistently outperformed YOLOv11m in all three runs. The average mAP@0.5(%) increased from 50.7% to 54.2%, while the average mAP@0.5:0.95(%) improved from 33.2% to 36.5%. These results demonstrate that the performance gains are stable rather than caused by random variation.

4.7. Comparative Experiment

To further evaluate the effectiveness of the proposed framework, comparative experiments were conducted with representative detectors, including YOLOv8, YOLOv9 [24], YOLOv10 [25], Faster R-CNN, Mask R-CNN, and RT-DETR [26], as shown in Table 5. All baseline models were retrained under identical settings, including input resolution, optimizer configuration, training epochs, and evaluation protocol, to ensure fair comparison.

It should be noted that all models were trained and evaluated on the same self-built ceramic defect dataset rather than directly adopting COCO benchmark results. Therefore, the absolute mAP values reported in this study should not be directly compared with those commonly reported on large-scale benchmarks such as COCO, where objects are generally larger and training samples are substantially more abundant.

From the results in Table 5, CEL-YOLOv11m achieves competitive overall performance among the compared models. In terms of mAP@0.5(%), the proposed method reaches 54.8%, which is higher than several representative baselines, including YOLOv11m (51.3%), YOLOv10m (50.3%), YOLOv9m (49.2%), and YOLOv8m (50.7%). In addition, CEL-YOLOv11m attains 37.1% mAP@0.5:0.95(%), indicating improved localization accuracy for small defects.

Although CEL-YOLOv11m contains slightly more parameters than YOLOv11m, it achieves higher detection accuracy with moderate computational overhead. Compared with YOLOv11l, which has a similar parameter scale, the proposed framework reduces computational cost while maintaining favorable detection performance.

Overall, the proposed CEL-YOLOv11m framework provides a promising balance between detection accuracy and efficiency, making it suitable for tiny ceramic defect detection tasks.

4.8. Heatmap Analysis

To further verify the effectiveness of the CEL-YOLOv11m model, four types of defect images from the custom dataset are selected for detection and visualized as heatmaps. The results, shown in Figure 11, intuitively demonstrate the improvement in the model’s detection performance.

The analysis of the heatmap detection results shows that the YOLOv11m benchmark model is affected by complex environmental factors, such as changes in lighting conditions and light reflection on the surface of the cup. As a result, the detection of small target defects is ineffective, leading to false negatives and false positives. In this paper, the CEL-YOLOv11m model achieves efficient integration of multi-scale features, improving the model’s object recognition capability and feature representation in complex backgrounds. The algorithm captures global context in both horizontal and vertical directions, enabling the model to focus on key areas and improving small target detection.

In summary, the algorithm demonstrates improved performance in detecting small defects on the surface of porcelain cups.

4.9. Cross-Domain Validation on the VisDrone2019 Dataset

To further evaluate the cross-domain transferability of the proposed framework, additional experiments were conducted on the public VisDrone2019-DET dataset [27], which contains 10 object categories captured in complex urban aerial scenes. The comparative results are summarized in Table 6.

As shown in Table 6, CEL-YOLOv11m consistently outperforms the baseline YOLOv11m on both datasets. On the self-built ceramic defect dataset, mAP@0.5(%) improves from 51.3% to 54.8%, mAP@0.5:0.95(%) increases from 33.4% to 37.1%, and the F1-score rises from 55.6% to 58.5%. On the VisDrone2019 benchmark, mAP@0.5(%) improves from 41.2% to 42.5%, mAP@0.5:0.95(%) increases from 24.8% to 25.5%, and the F1-score rises from 45.0% to 46.0%.

These results indicate that the proposed EMSC, CGRFPN, and LSDECD modules not only enhance tiny-defect detection performance on the industrial dataset, but also provide stable gains on another small-object detection benchmark, demonstrating certain cross-domain transferability.

Although some specialized methods report higher absolute accuracy on VisDrone2019, these approaches were specifically optimized for UAV imagery or adopt ensemble inference strategies. In contrast, the proposed CEL-YOLOv11m is primarily designed for industrial ceramic surface defect detection while maintaining favorable robustness and computational efficiency.

5. Conclusions

In this study, an efficient CEL-YOLOv11m framework was proposed for tiny defect detection on ceramic cup surfaces. By integrating the EMSC module, CGRFPN structure, and LSDECD detection head into the YOLOv11m architecture, the proposed method improves multi-scale feature extraction, contextual information fusion, and fine-grained localization capability.

Experimental results on the self-built ceramic defect dataset show that CEL-YOLOv11m achieves 54.8% mAP@0.5(%) and 37.1% mAP@0.5:0.95(%), outperforming the baseline YOLOv11m under the same training settings. Additional validation on the VisDrone2019 benchmark also demonstrates consistent improvements, indicating that the proposed architecture has certain cross-domain transferability for small-object detection tasks.

Compared with the original YOLOv11m model, the proposed framework provides a favorable balance between detection accuracy and computational efficiency, which is promising for practical ceramic defect inspection applications.

Nevertheless, the current self-built dataset is relatively limited in scale, and more comprehensive validation on larger industrial datasets is still needed. Future work will focus on collecting multi-source defect datasets, improving robustness under varying production conditions, and further optimizing deployment efficiency for practical industrial applications. The proposed framework emphasizes a practical balance between detection accuracy and computational efficiency rather than pursuing minimum model complexity alone.

Author Contributions

Conceptualization, S.X.; Methodology, S.X. and Y.S.; Software, S.X.; Validation, S.X. and Y.S.; Formal analysis, S.X. and Y.S.; Investigation, S.X.; Resources, S.X. and Y.S.; Data curation, S.X. and Y.S.; Writing—original draft, S.X. and Y.S.; Visualization, Y.S.; Supervision, X.D.; Project administration, X.D.; Funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

The work of this paper is supported by Hunan Provincial Natural Science Foundation of China with Grant No. 2024JJ7148, Hunan Provincial Natural Science Foundation Youth Student Basic Research Project with Grant No. 2025JJ60931.

Data Availability Statement

The original contributions presented in this study are included in the article Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank the reviewers for their constructive feedback.

Conflicts of Interest

The authors declare no competing interests.

References

Su, Y.M.; He, R.J.; Liu, Y.J.; Tian, J. Surface defect detection method based on small sample learning. Inf. Control 2025, 54, 502–512. [Google Scholar]
Wen, W. The road to industrial design of contemporary ceramics. Ceram. Sci. Art. 2023, 57, 28–29. [Google Scholar]
Li, J.H.; Quan, X.X.; Wang, Y.L. Research on Defect Detection Algorithm of Ceramic Tile Surface with Multi-feature Fusion. Comput. Eng. Appl. 2020, 56, 191–198. [Google Scholar]
Yin, M.S.; Zeng, X.; Huang, J.W. Surface Defect Detection of Bolt Thread of Ceramic Matrix Composite Material Based on Machine Vision. China Ceram. Ind. 2021, 28, 19–22. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2961–2969. [Google Scholar]
Pan, J.J.; Zeng, C.; Zhang, J. Ceramic Surface Defect Detection Algorithm Based on Improved YOLOv5. Mod. Inf. Technol. 2024, 8, 70–75. [Google Scholar]
Wu, H.X.; Zhang, H.Y.; Tan, X.Q.; Lin, H.F. An Algorithm for Enhanced Tile Surface Defect Detection Based on YOLOv5s. J. Lujiang Univ. 2023, 31, 66–74. [Google Scholar]
Yu, S.S.; Lin, Z.F.; Xu, G.P.; Xu, J.Y. Lightweight large-format tile defect detection algorithm based on improved YOLOv8. J. Comput. Appl. 2025, 45, 647–654. [Google Scholar]
Bai, C.; Zhang, K.; Jin, H.; Qian, P.; Zhai, R.; Lu, K. SFFEF-YOLO: Small object detection network based on fine-grained feature extraction and fusion for unmanned aerial images. Image Vis. Comput. 2025, 156, 105469. [Google Scholar] [CrossRef]
Gu, Y.; Chen, W.; Peng, D. UAV-based multimodal object detection via feature enhancement and dynamic gated fusion. Pattern Recognit. 2025, 172, 112722. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 1580–1589. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 8759–8768. [Google Scholar]
Ni, Z.; Chen, X.; Zhai, Y.; Tang, Y.; Wang, Y. Context-guided spatial feature reconstruction for efficient semantic segmentation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 239–255. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 9627–9636. [Google Scholar]
Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef] [PubMed]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 27706–27716. [Google Scholar]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. Hcf-net: Hierarchical context fusion network for infrared small object detection. In 2024 IEEE International Conference on Multimedia and Expo (ICME); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 5694–5703. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 15909–15920. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Structural framework of YOLOv11m.

Figure 2. Structural framework of CEL-YOLOv11m.

Figure 3. Structure of EMSC.

Figure 4. Structure of RCM.

Figure 5. Structure of PCE.

Figure 6. Structure of FBM.

Figure 7. Structure of DIF.

Figure 8. Structure of Detect head LSDECD.

Figure 9. Structure of DEConv.

Figure 10. Enhanced results of the Defect Dataset on Ceramic Cup Surfaces.

Figure 11. Heatmap results.

Table 1. Experimental environment setting.

Name	Version
CPU	Intel Xeon Platinum 8362 2.80 GHz
GPU	NVIDIAGeForce RTX 3090, 24 GB
Programming language	Python 3.10
Operating System	Ubuntu 22.04
Deep learning framework	PyTorch 2.1.0, Cuda 12.1

Table 2. Comparative experimental results of different improvements.

Models	C3k2	GFLOPS (G)	Params (M)	mAP@0.5(%)	mAP@0.5:0.95(%)	FPS $(f \cdot s^{- 1})$
YOLOv11m	PKI	79.4	20	49.0	30.9	101.1
	PPA	72.2	21.2	50.7	31.0	44.4
	Star	62.8	18.6	49.3	32.7	56
	RVB	58.7	17.5	51.8	33.3	72.9
	EMSC	68.1	19.8	53.8	35.3	103.5

Table 3. Ablation study results on the self-built ceramic defect dataset.

EMSC	CGRFPN	LSDECD	GFLOPS (G)	Params (M)	mAP@0.5(%)	mAP@0.5:0.95(%)
□	□	□	68.2	20	51.3	33.4
√	□	□	68.1	19.8	52	33.9
□	√	□	91.1	24.8	52.8	34.4
□	□	√	63.5	23.3	53.6	35.2
√	√	□	91.6	24.8	54.1	36
√	√	√	72.1	25.56	54.8	37.1

Table 4. Repeated comparison results under different random seeds.

Run	Model	mAP@0.5(%)	mAP@0.5:0.95(%)
1	YOLOv11m	50.2	32.2
1	CEL-YOLOv11m	54.3	34.8
2	YOLOv11m	51.1	34.1
2	CEL-YOLOv11m	54.0	38.1
3	YOLOv11m	50.7	33.2
3	CEL-YOLOv11m	54.2	36.5
Avg	YOLOv11m	50.7	33.2
Avg	CEL-YOLOv11m	54.2	36.5

Table 5. Comparative experimental results with representative object detectors.

Model	Type	GFLOPS/G	Params/M	F1-Score/%	mAP0.5(%)	mAP0.5:0.95(%)
Faster_RCNN	Two-stage	284	60.6	47.3	45.1	23.7
Mask_RCNN	Two-stage	67.1	19.7	52.2	49.3	32.7
RTDETR-L	DETR	66.0	19.6	49.5	48.0	32.7
RTDETR-R18	DETR	58.3	17.2	51.5	49.0	31.7
YOLOv8l	one-stage	145.5	39.4	53.6	50.8	31.5
YOLOv8m	one-stage	68.1	19.8	52.4	50.7	31.0
YOLOv9e	one-stage	169.8	46.8	50.4	49.5	30.6
YOLOv9m	one-stage	60.8	15.6	49.9	49.2	31.5
YOLOv10l	one-stage	120.3	24.3	51.9	50.6	34.1
YOLOv10m	one-stage	59.1	15.3	52.6	50.3	33.6
YOLOv11l	one-stage	87.3	25.3	54.7	51.2	34.1
YOLOv11m	one-stage	68.2	20.0	55.6	51.3	33.4
CEL-YOLOv11m	one-stage	72.1	25.56	58.5	54.8	37.1

Table 6. Cross-domain comparison on the VisDrone2019 benchmark dataset.

Dataset	Model	Category	mAP@0.5(%)	mAP@0.5:0.95(%)	F1-Score (%)
Self-built	YOLOv11m	Baseline	51.3	33.4	55.6
Self-built	CEL-YOLOv11m	Proposed	54.8	37.1	58.5
VisDrone2019	YOLOv11m	Baseline	41.2	24.8	45.0
VisDrone2019	CEL-YOLOv11m	Proposed	42.5	25.5	46.0
VisDrone2019	SFFEF-YOLO	Literature	50.1 *	31.0 *	—
VisDrone2019	DPNet-ensemble	Literature	54.0 *	—	—

Note: * Results are directly cited from the corresponding original publications and were not reproduced in this study. Due to differences in training settings and implementation details, these values are provided for reference only.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiao, S.; Deng, X.; Sun, Y. An Efficient Multi-Scale Feature Fusion Network for Tiny Defect Detection on Ceramic Cup Surfaces. Processes 2026, 14, 1560. https://doi.org/10.3390/pr14101560

AMA Style

Xiao S, Deng X, Sun Y. An Efficient Multi-Scale Feature Fusion Network for Tiny Defect Detection on Ceramic Cup Surfaces. Processes. 2026; 14(10):1560. https://doi.org/10.3390/pr14101560

Chicago/Turabian Style

Xiao, Shikang, Xiaojun Deng, and Yuanhao Sun. 2026. "An Efficient Multi-Scale Feature Fusion Network for Tiny Defect Detection on Ceramic Cup Surfaces" Processes 14, no. 10: 1560. https://doi.org/10.3390/pr14101560

APA Style

Xiao, S., Deng, X., & Sun, Y. (2026). An Efficient Multi-Scale Feature Fusion Network for Tiny Defect Detection on Ceramic Cup Surfaces. Processes, 14(10), 1560. https://doi.org/10.3390/pr14101560

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Multi-Scale Feature Fusion Network for Tiny Defect Detection on Ceramic Cup Surfaces

Abstract

1. Introduction

2. YOLOv11 Network Architecture

3. CEL-YOLOv11m

3.1. A Multi-Scale Convolution Module

3.2. CRGseg

3.3. Rectangular Self-Calibration Module

3.4. Pyramid Context Extraction Module

3.5. The Fuse Block

3.6. LSDECD

4. Experiments

4.1. Experimental Environment

4.2. Dataset

4.3. Evaluation Metrics

4.4. Backbone C3k2 Improvement Analysis

4.5. Ablation Study

4.6. Repeated Comparison Under Different Random Seeds

4.7. Comparative Experiment

4.8. Heatmap Analysis

4.9. Cross-Domain Validation on the VisDrone2019 Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI