Next Article in Journal
PriFed-IDS: A Privacy-Preserving Federated Reinforcement Learning Framework for Secure and Intelligent Intrusion Detection in Digital Health Systems
Next Article in Special Issue
CSOOC: Communication-State Driven Online–Offline Coordination Strategy for UAV Swarm Multi-Target Tracking
Previous Article in Journal
Layer-Pipelined CNN Accelerator Design on 2.5D FPGAs
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RDT-YOLO: An Improved Lightweight Model for Fish Maw Authenticity Detection

1
College of Artificial Intelligence, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China
2
School of Optoelectronic Engineering, Guilin University of Electronic Technology, Guilin 541004, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2025, 14(23), 4588; https://doi.org/10.3390/electronics14234588
Submission received: 21 October 2025 / Revised: 17 November 2025 / Accepted: 22 November 2025 / Published: 23 November 2025

Abstract

With the rapid expansion of the global fish maw industry, the increasing prevalence of counterfeit products has made authenticity detection a critical challenge. Traditional detection methods rely on organoleptic assessment, chemical analysis, or molecular techniques, which limits their practical application. This paper treats fish maw authenticity detection as an object detection problem and proposes RDT-YOLO, a lightweight detection algorithm based on YOLO11n. Specifically, to address the challenges of insufficient fine texture feature extraction and computational redundancy in fish maw detection, we design hierarchical reparameterized feature extraction modules that utilize reparameterization technology to enhance texture feature extraction capability at different scales. To mitigate information loss during multi-scale feature fusion, we develop a Dynamic Adaptive Multi-Scale Pyramid Processing (DAMSPP) module that incorporates dynamic convolution mechanisms for adaptive feature aggregation. Additionally, we propose an Adaptive Task-Aligned Detection Head (ATADH) that combines task interaction and shared convolution to reduce model parameters while improving detection accuracy. Furthermore, a Wise-ShapeIoU loss function is introduced by incorporating a focusing coefficient into Shape-IoU, enhancing model detection performance through improved bounding box shape optimization. Experimental validation demonstrates that RDT-YOLO achieves 91.9% precision, 89.6% recall, and 94% mAP@0.5 while reducing parameters, model size, and computational complexity by 75.6%, 73.8%, and 63.8%, respectively, compared to YOLO11s. When evaluated against YOLOv10s and YOLOv12s, RDT-YOLO shows mAP@0.5 improvements of 0.8% and 0.5%, respectively. This work provides an automated solution for fish maw authenticity detection with potential for broader food safety applications.

1. Introduction

According to FAO reports, global aquaculture has developed rapidly [1], with China serving as a major market for premium dried marine products such as shark fin, abalone, and fish maw [2]. Fish maw is a dried marine product made from specific fish swim bladders through traditional processing techniques. It is rich in proteins, vitamins, and trace elements, with medicinal properties such as hemostasis and anti-inflammatory effects [3,4]. However, as consumer demand for fish maw keeps growing, an increasing number of fish species are being utilized in its production [5]. The market value of fish maw is primarily determined by its species and quality, and the resulting price disparities among different species and grades have led to issues such as counterfeiting and mislabeling [6,7,8]. Food fraud has become an increasingly serious global problem [9]. Therefore, research on fish maw authenticity detection is highly necessary.
Traditional fish maw authenticity detection methods primarily depend on organoleptic assessment, chemical analysis, and molecular biology techniques. Organoleptic assessment requires extensive expertise, but remains highly subjective with considerable error rates [10]. Chemical analysis encompasses various spectroscopic [11], chromatographic, and mass spectrometry techniques [12], while molecular biology techniques involve DNA-based analyses such as PCR [13], DNA barcoding, and sequencing [14]. Although both chemical and molecular methods offer high accuracy, they are time-consuming, costly, and require professional instrumentation and skilled operators, which constrains their practical application [15,16]. Therefore, it becomes critical to establish non-destructive and portable image detection technology for fish maw authenticity detection.
Recent advances in AI have achieved breakthroughs in the application of deep learning technologies in food science [17]. Particularly in computer vision, deep learning technology has transformed object detection methodologies, establishing neural network architectures as the dominant techniques for object detection tasks. Contemporary object detection methods are primarily categorized into two-stage detectors such as R-CNN [18], Fast R-CNN [19], and Faster R-CNN [20] and one-stage detectors including SSD [21], RetinaNet [22], and the You Only Look Once (YOLO) [23,24,25,26,27] series. While two-stage methods achieve higher accuracy through sequential region proposal and classification, one-stage detectors excel in computational efficiency and real-time processing.
YOLO is extensively used in food detection scenarios due to its fast inference speed, fewer parameters, smaller model size, and high accuracy. The versatility of YOLO models is evidenced by their successful application in numerous food authentication tasks. For instance, Geng et al. [28] proposed a fishmeal adulteration identification method based on microscopic images and deep learning, using Mobilenetv2 as the qualitative identification model and combining YOLOv3-Mobilenetv2 for component identification, achieving accurate adulteration detection. Kong et al. [29] proposed an authentication method for Astragalus membranaceus based on AM-YOLO, which improved the accuracy and inference speed of genuine and counterfeit herbal medicine recognition. Jubayer et al. [30] applied YOLOv7 for honey pollen detection and classification, enhancing honey authenticity identification. Zhang et al. [31] designed a lightweight Faster-YOLO algorithm based on YOLOv8n, incorporating group convolution hybrid attention mechanisms to solve rice adulteration classification problems. Jose et al. [32] proposed a spice classification method based on YOLOv8 with variable correlation kernel convolutional neural network, which improved the accuracy and precision of visually similar spice recognition. These studies demonstrate the effectiveness and versatility of YOLO-based approaches in food authentication applications.
However, fish maw species identification presents unique challenges that remain inadequately addressed, despite the successful application of YOLO models in various food detection tasks. Unlike other food items with distinct visual features, fish maw samples from different species exhibit highly similar appearances with only subtle variations in surface textures and patterns, demanding more sophisticated feature extraction capabilities. Moreover, the significant variations in size and shape across different fish maw processing methods necessitates robust multi-scale detection mechanisms. Additionally, practical deployment scenarios such as on-site inspection at wholesale markets or retail stores require lightweight models that are suitable for portable devices while maintaining high detection accuracy. These specific requirements highlight the urgent need for developing specialized detection algorithms tailored to fish maw authenticity verification.
To address these challenges, this study proposes RDT-YOLO, an enhanced YOLO11-based model specifically designed for fish maw authenticity detection. Unlike existing generic object detection models, RDT-YOLO introduces a systematic framework tailored to the unique characteristics of fish maw authentication, including subtle texture differences, significant size variations, and the need for lightweight deployment. The overall framework of this study is illustrated in Figure 1.
The main contributions and innovations of this work are as follows:
  • This is the first work to apply deep learning-based object detection technology to fish maw authenticity detection, transforming a traditional analytical chemistry problem into a computer vision task and establishing a comprehensive dataset focusing on easily confused species.
  • We design a hierarchical dual-stage feature extraction architecture with C3k2-RVB for shallow layers and C3k2-RVB-iEMA for deep layers. By replacing SE attention with our designed iEMA module in a RepViT block, we enhance spatial feature perception through cross-spatial learning specifically optimized for fine-grained texture recognition in fish maw detection.
  • We design the DAMSPP module with novel Adaptive Inception Convolution (AIConv) and Dynamic Kernel Weighting (DKW) mechanism. This adaptively captures fish maw texture information at different scales through dynamic feature enhancement, overcoming the limitations of traditional SPPF fixed pooling operations.
  • We develop an Adaptive Task-Aligned Detection Head (ATADH), which synergistically combines task decomposition, shared convolution, and DyDCNv2 mechanisms to achieve effective collaboration between classification and localization tasks while maintaining a lightweight architecture.
  • We integrate the Wise-IoU V3 focusing coefficient into Shape-IoU to form a Wise-ShapeIoU loss function, addressing the challenge of varying quality samples in real-world fish maw detection scenarios and achieving improved bounding box optimization.

2. Materials

2.1. Data Acquisition

The dataset used in this study was primarily sourced from Shanghao Jiao retail store in Shantou City, Guangdong Province, China and from Jiexun Aquaculture Co., Ltd. in Raoping County, China, employing multi-channel collection methods to enhance representativeness and diversity. The final fish maw dataset encompasses five major categories: Red-mouth fish maw, White croaker fish maw, Annan fish maw, Douhu fish maw, and Duck-tongue shaped maw.
The biological sources of various fish maws exhibit significant differences. The sources of Red-mouth fish maw are quite complex, primarily derived from fish such as Protonibea diacanthus and Megalonibea fusca. Due to the scarcity and extreme rarity of these fish species in nature, Red-mouth fish maw commands high prices in the market and is considered a precious variety among fish maws. In contrast, the source fish for several other types of fish maw are relatively more common: Annan fish maw is sourced from Otolithoides biauritus of the Otolithoides genus; both White croaker fish maw and Duck-tongue shaped maw are saltwater fish maws derived from fish of the Pseudotolithus genus; while Douhu fish maw is sourced from Galeoides. These types of fish maw, due to their source fish being relatively common, are priced much lower in the market than Red-mouth fish maw. However, their physical characteristics are extremely similar; therefore, they are often used to counterfeit Red-mouth fish maw.
Due to the striking physical similarities across fish maw varieties and the intricate nature of market deception, constructing a dataset capable of accurately distinguishing these subtle differences is of paramount importance. To this end, this study employed a Canon EOS 6D camera (Canon Inc., Tokyo, Japan) with 4032 × 3024 pixel resolution for high-quality image acquisition. To ensure data diversity and representativeness, image capture encompassed various lighting conditions, placement positions, quality grades, and shooting angles, including single-target and multi-target scenarios as well as complex background environments. Figure 2 presents representative sample images, clearly demonstrating the typical characteristics and subtle differences among the five fish maw types.

2.2. Data Preprocessing

A data preprocessing workflow was implemented to ensure dataset quality. First, quality assessment was conducted on the collected fish maw images, removing samples that were blurred, improperly exposed, or had substandard resolution. Second, all selected images underwent dimensional standardization, which was uniformly scaled to 640 × 640 pixels. During the annotation phase, precise bounding box annotations were performed using the Roboflow platform for all five fish maw categories. Finally, random grouping was performed on the labeled dataset according to a 7:2:1 proportional distribution, constructing training, validation, and testing data subsets.

2.3. Data Augmentation

The YOLO11n model was initially trained for 200 epochs to establish baseline performance. Results showed that the ZJCZ class achieved the highest recognition accuracy, while the BM and DH categories showed relatively lower performance. Data augmentation techniques were employed to address this performance imbalance [33].
Augmentation methods included ±15° random rotation, brightness adjustment within ± 25 % , and Gaussian blur with up to 1.5-pixel radius. These techniques simulate various viewing angles, lighting conditions, and imaging quality variations. Each original training image generated one augmented version, expanding the training samples from 1749 to 3498, while validation and test sets remained unchanged. Figure 3 illustrates typical augmentation effects. Table 1 presents the sample distribution after augmentation.

3. Methods

3.1. RDT-YOLO Network Structure

RDT-YOLO is developed on the YOLO11 framework and specifically engineered to tackle challenges related to inadequate fine-grained textural feature mining, constrained cross-scale feature integration performance, and computational parameter redundancy in fish maw authenticity detection. C3k2-RVB and C3k2-RVB-iEMA modules have been developed as substitutes for the backbone C3k2 component in YOLO11, effectively reducing model parameters while significantly enhancing recognition capability for fine texture features of fish maw, thereby solving recognition accuracy problems caused by similar textures and complex backgrounds. To address the limitations of traditional SPPF fixed pooling, the DAMSPP module is developed as well. This module adaptively captures fish maw texture information at different scales through dynamic feature enhancement and multi-scale feature extraction, significantly improving feature representation capability.
Hierarchical feature representations extracted from the backbone architecture are subsequently processed through the neck module to achieve multi-scale feature integration. The resulting three-layer feature maps are then transmitted to the detection head. The ATADH is developed to substitute for the original detection head, significantly reducing parameters through task interaction mechanisms while efficiently decoding feature information to determine target position, category, and confidence. Additionally, the Wise-ShapeIoU loss function is formulated to enhance detection accuracy. Figure 4 illustrates the architectural framework of the optimized RDT-YOLO, with the core components detailed below.

3.2. Reparameterized Feature Extraction Module

Fish maw authenticity detection fundamentally requires distinguishing species with highly similar appearances through subtle surface texture variations. As shown in Figure 5a, traditional C3k2 modules face two critical limitations: first, the bottleneck structure loses fine-grained information through dimensionality reduction, causing feature confusion among similar textures; second, the absence of spatial attention mechanisms prevents adaptive focus on discriminative texture patterns. These deficiencies directly lead to misclassifications when confronting visually similar fish maw species.
To address these challenges, we propose a hierarchical dual-stage architecture implementing progressive texture learning. In shallow layers, the RepViT structure [34] forms the C3k2-RVB module, shown in Figure 5b, which employs reparameterization to preserve fine-grained details while maintaining efficiency. In the deep layers we develop the C3k2-RVB-iEMA module, shown in Figure 5c, by integrating our designed iEMA module, which captures directional texture patterns through cross-spatial learning mechanisms via bidirectional 1D pooling.

3.2.1. iEMA Feature Enhancement Module

In fish maw authenticity detection, traditional attention mechanisms face significant challenges when processing complex surface texture features. To address this limitation, we propose the integrated Efficient Multi-scale Attention (iEMA) feature enhancement module, illustrated in Figure 6, which combines EMA attention with an inverted residual structure.
The EMA [35] mechanism captures spatial correlations through bidirectional 1D global average pooling along the width and height dimensions:
z c H = 1 H h = 0 H 1 x c ( h , w )
z c W = 1 W w = 0 W 1 x c ( h , w )
where z c H and z c W respectively encode spatial information along the height and width axes for the c-th channel.
The complete iEMA module follows an inverted residual architecture with integrated attention. The forward computation can be formulated as follows:
X o u t = X i n + Conv 1 × 1 ( DWConv 3 × 3 ( EMA ( Conv 1 × 1 ( X i n ) ) ) )
where Conv 1 × 1 ( · ) performs pointwise convolution for channel modulation, EMA ( · ) applies efficient spatial attention through bidirectional 1D pooling, and DWConv 3 × 3 ( · ) extracts local texture patterns via depthwise convolution. The residual connection ensures information preservation throughout the transformation.

3.2.2. RepViT-iEMA Block

The RepViT block achieves computational efficiency through structural reparameterization techniques; during training, multi-branch architectures like RepVGGDW enable rich feature learning, while computational efficiency is enhanced at inference by reparameterizing multiple branches to a unified convolution operation. As shown in Figure 7a, the original RepViT block follows the token mixer and channel mixer paradigm.
X t o k e n = SE ( RepVGGDW ( X i n ) )
X c h a n n e l = Conv 1 × 1 ( GELU ( Conv 1 × 1 ( X t o k e n , 2 C ) ) )
X o u t = X i n + X c h a n n e l
However, the SE attention mechanism in the RepViT block ignores spatial dimension information. This makes it insufficient for capturing fine texture variations, which is an essential element of fish maw authenticity assessment. To address this limitation, we replace the SE attention with our proposed iEMA module, forming the RepViT-iEMA block, as illustrated in Figure 7b:
X t o k e n = iEMA ( RepVGGDW ( X i n ) )
X c h a n n e l = Conv 1 × 1 ( GELU ( Conv 1 × 1 ( X t o k e n , 2 C ) ) )
X o u t = X i n + X c h a n n e l
where RepVGGDW ( · ) performs reparameterizable depthwise convolution, GELU ( · ) denotes the Gaussian Error Linear Unit (GELU) activation function that provides smooth and non-monotonic activation characteristics, C represents the input channel dimension, and the 2 C indicates channel expansion in the inverted residual structure. This modification retains the advantages of reparameterization while enhancing spatial feature perception capabilities through iEMA’s cross-spatial learning mechanism.

3.3. Dynamic Adaptive Multi-Scale Pyramid Processing

Fish maw detection requires adaptive multi-scale processing for varying sizes and shapes. Traditional SPPF modules suffer from fixed pooling that loses fine-grained details and lacks the adaptivity to adjust based on input characteristics, resulting in suboptimal representations.
Here, we design Dynamic Adaptive Multi-Scale Pyramid Processing (DAMSPP), as shown in Figure 8, which dynamically adjusts feature extraction through a Adaptive Multi-Scale Inception Block (AMSIB), shown in Figure 9. The AMSIB incorporates Adaptive Inception Convolution with Dynamic Kernel Weighting to learn scale-specific importance, which is combined with shared convolution for efficiency. This adaptively captures texture at multiple scales while preserving fine-grained details.

3.3.1. Adaptive Inception Convolution

This paper first designs Adaptive Inception Convolution (AIConv), the core of which lies in the use of a Dynamic Kernel Weighting (DKW) mechanism to achieve global adaptive processing of input feature maps.
Three parallel depthwise separable convolution branches are designed in the multi-scale feature extraction stage, respectively adopting square, horizontal, and vertical convolution kernel configurations to capture information in different spatial directions. Multi-scale feature extraction can be represented as follows:
X i = DWConv i ( X i n ) , i { s , h , v }
where X i n represents the input tensor and DWConv s , DWConv h , DWConv v represent square, horizontal, and vertical depth convolutions, respectively.
Meanwhile, the convolution kernel sizes follow the following constraint relationship:
k h , k v = 3 × k s + 2 .
In the weight generation stage, the DKW mechanism analyzes global information from input features and dynamically generates adaptive weights for each branch. The calculation formula is as follows:
W = ϕ ( Conv 1 × 1 ( GAP ( X i n ) ) )
where ϕ ( · ) is the softmax function and GAP is Global Average Pooling. This mechanism first performs global statistical aggregation on input features, then expands the channel dimension from C to 3 C through pointwise convolution to generate corresponding attention weights for the three processing branches.
Each sub-tensor W i corresponds to the weight coefficient of the respective branch. The output of AIConv is achieved through weighted fusion:
X ˜ = i { s , h , v } W i X i
X o u t = SiLU ( BN ( X ˜ ) )
where ⊙ represents element-wise multiplication, X ˜ represents the dynamic weighted fusion result, and the fused features are processed through batch normalization and a SiLU activation function to achieve stable training and accelerated convergence.

3.3.2. Multi-Scale Feature Mixer

To improve cross-scale feature fusion and information aggregation, we design and implements a Multi-Scale Feature Mixer (MSFM) module based on AIConv. The mathematical definition of this mixer is presented below.
Y i = Split ( X i n ) , i = 1 , 2
Y ^ 1 = AIConv 3 × 3 ( Y 1 )
Y ^ 2 = AIConv 5 × 5 ( Y 2 )
Y ˜ = Conv 1 × 1 ( Concat ( Y ^ 1 , Y ^ 2 ) )
Specifically, the input feature X i n is first evenly split along the channel dimension, obtaining two subfeatures Y 1 and Y 2 . Subsequently, the two subfeatures are respectively input into AIConv units configured with different receptive field sizes, achieving differentiated feature extraction. Finally, through feature concatenation and channel modulation, the fused output Y ˜ is generated.

3.3.3. Adaptive Multi-Scale Inception Block

Finally, the MetaFormer design paradigm [36] is adopted to construct AMSIB, with MSFM as the core mixing operator combined with the Convolutional Gated Linear Unit [37] from TransNeXt as the feedforward network. To achieve fine-grained feature fusion control, learnable scale parameters are introduced. The detailed computational process is as follows:
Z 1 = X i n + γ 1 · MSFM ( BN ( X i n ) )
Z 2 = Z 1 + γ 2 · CGLU ( BN ( Z 1 ) )
where γ 1 and γ 2 are trainable scale modulation parameters initialized to 0.01 to ensure training stability. A residual connection strategy is adopted to ensure effective information transfer and stable gradient backpropagation.
In summary, AMSIB adaptively weights multi-scale features based on input content, enabling effective capture of fine texture information on fish maw surfaces while preserving detail and enhancing overall feature representation.

3.4. Adaptive Task-Aligned Detection Head

Fish maw detection requires synergistic collaboration between classification and localization tasks, as accurate species identification depends on precise positional context. Traditional YOLO11 detection heads employ a decoupled dual-branch design with two critical limitations: first, independent processing prevents effective information exchange between tasks, resulting in misalignment where high classification confidence may correspond to poor localization accuracy; second, separate branches create redundant feature extraction and excessive computational overhead. These deficiencies degrade detection accuracy and efficiency, particularly for fish maw samples with irregular shapes and complex backgrounds.
To address these challenges, inspired by TOOD [38] and as illustrated in Figure 10, we propose an Adaptive Task-Aligned Detection Head (ATADH). First, task-aligned sample assignment ensures consistency between classification and localization objectives. Second, multi-layer shared convolution with hierarchical feature fusion reduces parameter redundancy while enabling effective task information sharing. Third, integration of a Task Decomposition module with dynamic kernel modulation and a DyDCNv2 mechanism achieves effective task separation through hierarchical attention while adaptively modeling irregular fish maw geometries. This synergistic design significantly reduces parameters through task interaction while efficiently determining target position, category, and confidence, effectively addressing fish maw detection challenges in complex environments.
ATADH first employs a multi-layer shared convolution structure for initial feature extraction, resulting in enhanced feature representation capability through hierarchical feature fusion design. Specifically, input features are sequentially processed through two shared convolutional layers, with outputs from each layer concatenated and fused to form rich multi-level feature representations. This approach not only captures semantic information at different levels but also promotes gradient flow, improving training effectiveness.
The Task Decomposition module serves as the core component of ATADH, achieving effective separation and collaboration between classification and regression tasks. This module employs hierarchical attention mechanisms to identify and strengthen features that are more critical to the current task, avoiding feature conflicts between tasks. The computational formulas for the Task Decomposition module are as follows:
f = GAP ( f )
W = σ Conv 1 × 1 ReLU Conv 1 × 1 ( f )
K dynamic = W K base
f task = ReLU GN Conv dynamic ( f , K dynamic )
where f is the input feature map, GAP is Global Average Pooling, f is the global average pooled feature, σ ( · ) is the sigmoid function, W is the dynamically generated hierarchical attention weight, K base is the learnable base convolution kernel parameters, K dynamic is the dynamically modulated convolution kernel, GN ( · ) is Group Normalization, and ⊙ denotes tensor broadcasting multiplication.
The input feature f undergoes global average pooling to obtain f , which then generates attention weights W through two convolutional layers; W dynamically modulates the convolution kernel to transform the original feature f, ultimately yielding task-specific features. In the task decomposition architecture, the classification module and regression module operate independently to achieve effective task separation, with the former specifically extracting semantic features and the latter focusing on geometric features.
The DyDCNv2 module utilizes intermediate layer features to dynamically compute offsets and masks, modulating convolution operations to adapt to irregular fish maw shapes and texture variations. Its computational formulas are as follows:
{ Δ x , Δ y , M } = Conv spatial ( f )
M = σ ( M )
y i , j ( DyDCNv 2 ) = m , n x i + m + Δ x i , j , m , n , j + n + Δ y i , j , m , n · W m , n · M i , j , m , n
where M represents the modulation mask, while Δ x and Δ y denote the learned horizontal and vertical offsets respectively; in addition, Conv spatial serves as a specialized convolutional layer for offset and mask generation. In the deformable convolution operation, W denotes the convolutional kernel weight parameter, with i,j indicating spatial coordinates in the output feature map and m,n representing relative coordinates within the sampling grid. The sigmoid function is denoted by σ ( · ) .
The input feature f simultaneously generates offsets and masks through spatial convolution, with the masks activated by Sigmoid. The dynamic deformable convolution adjusts sampling positions based on the learned offsets and modulates response intensity according to mask weights, achieving adaptive modeling of irregular geometric shapes.
In conclusion, ATADH effectively addresses core challenges of fish maw authenticity detection such as complex background interference, different target scales, and irregular texture shapes by integrating dynamic feature adaptation, lightweight design, and adaptive decomposition. This technical integration approach substantially improves the detection accuracy and system stability of fish maw authenticity identification in complex environments.

3.5. Optimized Loss Function

The YOLO11 architecture implements CIoU as the regression optimization metric for fish maw detection. This loss function integrates overlap ratios, centroid displacement, and dimensional consistency for predicted versus actual annotation boundaries, as formulated in Equations (28)–(30):
L C I o U = 1 I o U + ρ 2 ( b , b g t ) c 2 + α v
v = 4 π 2 arctan w g t h g t arctan w h 2
α = v 1 I o U + v
where I o U denotes the spatial overlap coefficient derived from reference and predicted detection regions, ρ 2 ( b , b g t ) represents the quadratic Euclidean separation of centroid positions, c signifies the diagonal span of the minimal bounding rectangular region, α serves as the adaptive weight coefficient, v quantifies aspect ratio consistency, w and h represent the forecasted bounding box measurements, and w g t and h g t denote the ground-truth box dimensions.
Analysis of Equations (29) and (30) reveals a fundamental limitation: when aspect ratios achieve consistency, the penalty components α and v become ineffective. Contemporary regression optimization research emphasizes geometric relationships, while underestimating the morphological diversity and dimensional variations that are critical for fish maw detection targets.
For specimens that exhibit substantial morphological diversity, including fish maw, Shape-IoU [39] incorporates both target morphological characteristics and dimensional factors into regression loss computation, as detailed in Equations (31)–(36):
L S h a p e I o U = 1 I o U + d i s t a n c e s h a p e + 0.5 × Ω s h a p e
w w = 2 × ( w g t ) s c a l e ( w g t ) s c a l e + ( h g t ) s c a l e
h h = 2 × ( h g t ) s c a l e ( w g t ) s c a l e + ( h g t ) s c a l e
d i s t a n c e s h a p e = h h × ( x x g t ) 2 + w w × ( y y g t ) 2 c 2
Ω s h a p e = t = w , h ( 1 e ω t ) θ , θ = 4
ω w = h h × | w w g t | max ( w , w g t ) ω h = w w × | h h g t | max ( h , h g t )
where d i s t a n c e s h a p e characterizes morphological distance loss, s c a l e indicates the scaling parameter, Ω s h a p e represents morphological value loss, ω w and h h constitute weighting coefficients in the horizontal and vertical dimensions, and x, y, x g t , and y g t specify the centroid positions of model predictions and reference annotations.
Despite improvements over CIoU, Shape-IoU remains insufficient for addressing low-quality data samples in practical fish maw detection applications. Real-world detection encounters quality degradation factors including capture angle variations, illumination inconsistencies, and specimen desiccation differences, creating substantial dataset quality disparities that generate non-uniform gradient contributions during training.
Drawing from Wise-IoU V3 frameworks [40], the proposed Wise-ShapeIoU methodology integrates the focusing coefficient r with Shape-IoU to implement dynamic gradient allocation based on anchor box quality assessment. This approach minimizes detrimental contributions from low-quality samples while maintaining optimization effectiveness, as formulated in Equations (37)–(39):
L W i s e - S h a p e I o U = L W i s e I o U + d i s t a n c e s h a p e + 0.5 × Ω
L W i s e I o U = L I o U · r , r = β δ · α β δ
β = L I o U L I o U [ 0 , + )
where r characterizes the non-monotonic focusing coefficient, β quantifies the outlier degree, α and δ constitute hyperparameters, and L I o U represents the gradient gain magnitude.
The focusing coefficient mechanism enables prioritization of medium-quality anchor boxes while reducing emphasis on high-quality anchors and minimizing negative gradient interference generated by low-quality samples. This quality-aware optimization addresses environmental variability and imaging condition challenges inherent in fish maw detection, achieving improved detection performance through more accurate quality-aware anchor matching.

4. Experiments and Results

4.1. Experimental Settings and Implementation Details

The experimental environment consisted of a Windows 11 operating system, 12th-Gen Intel® Core™ i5-12400F 2.50 GHz CPU, NVIDIA GeForce RTX 4060 GPU, and the PyTorch 1.12.1 with CUDA 11.3 deep learning framework. The hyperparameters for model training were configured as outlined below: 200 training epochs, batch size of 16, initial learning rate of 0.01, no pretrained weights, SGD optimizer with momentum of 0.937 and weight decay of 0.0005, and input image resolution of 640 × 640 pixels.

4.2. Evaluation Metrics

For comprehensive model evaluation, this study employed multiple core evaluation metrics. Among these, the mean Average Precision (mAP) calculations incorporated two threshold configurations, mAP@0.5 and mAP@0.5:0.95. Precision (P) quantifies the proportion of correct positive predictions among all positive classifications made by the model, while Recall (R) measures the capability to identify positive samples across the complete ground truth dataset. Average Precision (AP) computes the area under the precision–recall curve. The 0.5 IoU threshold aggregates AP values across all object categories, while mAP@0.5:0.95 averages AP values across multiple IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. The mathematical definitions are presented below:
P = T P T P + F P
R = T P T P + F N
A P i = 0 1 P i ( R i ) d R i
m A P = 1 n i = 1 n A P i
where n represents the total class count, T P denotes correctly predicted positive instances, F P encompasses negative instances misclassified as positive, and F N quantifies positive instances undetected by the model.

4.3. Comparison Experiment

4.3.1. Benchmark Algorithm of Prior Experiments

To determine the most suitable YOLO series version for fish maw authenticity detection, a comprehensive comparison of various YOLO models was first conducted. The assessment parameters covered mAP@0.5, mAP@0.5:0.95, parameters, computational overhead, and model size, with the results summarized in Table 2. The comparative analysis revealed that although YOLOv8n achieved optimal accuracy with a 90.6% mAP@0.5 score, its overhead in terms of parameters, computational overhead, and model size was relatively large. In contrast, YOLO11n demonstrated excellent balanced performance, achieving mAP@0.5 of 90.5% and mAP@0.5:0.95 of 84.9% with only a 5.2 MB model size and 2.58 M parameters. This precision–efficiency equilibrium renders YOLO11n exceptionally appropriate for fish maw authenticity detection applications in computationally constrained deployment scenarios.

4.3.2. Comparison of Different SPPF

To evaluate the performance of the DAMSPP component designed within this research framework, multiple controlled experiments were established. Three schemes were selected to compare with SPPF, namely, SPPELAN [41], AIFI [42], and DAMSPP, with comprehensive findings documented in Table 3.
Empirical results reveal that although processing overhead and model complexity increased slightly after implementing the DAMSPP module, system performance exhibited substantial enhancement. Across four critical performance indicators, DAMSPP achieved 88.8% precision, 85.4% recall, 92.2% mAP@0.5, and 86.7% mAP@0.5:0.95. Therefore, the aforementioned experimental data confirm the validity of the proposed DAMSPP architecture.

4.3.3. Comparison of Different Detection Heads

Comprehensive comparative evaluations were conducted to assess the efficacy of the ATADH detection architecture developed within this research framework. Four detection head schemes, namely, DyHead [43], Aux [44], MultiSEAMHead [45], and ATADH, were selected for comparison with the original YOLO11 detection head, as demonstrated in Table 4.
The experimental outcomes reveal that implementation of the ATADH detection architecture yielded optimal performance across all key indicators. Specifically, regarding detection precision metrics, ATADH attains 88.6% precision, 86.4% recall, 92.1% mAP@0.5, and 87% mAP@0.5:0.95. In terms of computational efficiency, ATADH has a parameter count of only 2.2 M with a 4.5 MB storage footprint, successfully realizing a compact model architecture. Therefore, the ATADH detection head architecture demonstrates superior effectiveness across comprehensive performance evaluation metrics.

4.3.4. Comparison of Different Convolution Modules

To validate the effectiveness of DyDCNv2 on ATADH, we compared four schemes under identical conditions: standard convolution, depthwise separable convolution, DCNv2, and DyDCNv2, with the evaluation results presented in Table 5. The experimental results demonstrate that DyDCNv2 achieves mAP@0.5 and mAP@0.5:0.95 of 92.1% and 87%, respectively, while maintaining a similar parameter count. Compared to standard convolution, these metrics show respective improvements of 1.5 and 1.9 percentage points, fully validating the effectiveness of the DyDCNv2 dynamic deformable convolution mechanism.

4.3.5. Comparison of Different Loss Functions

To validate how well the proposed loss function performs, comparative experiments were designed between Wise-ShapeIoU and mainstream loss functions including CIoU, DIoU [46], GIoU [47], EIoU [48], and PIoU [49], with the results presented in Table 6 and the loss function variation curves during training organized into the comparison charts shown in Figure 11.
Comparative analysis indicates that WIoU V3 and ShapeIoU exhibit enhanced performance in both training convergence and final results. Wise-ShapeIoU integrates the advantages of both the WIoU V3 and ShapeIoU loss functions, showing improved convergence characteristics during training and achieving optimal performance across all key indicators. Specifically, Wise-ShapeIoU achieves 91.9% precision, 89.6% recall, 94% mAP@0.5, and 88.9% mAP@0.5:0.95, demonstrating optimized detection performance.

4.3.6. Comparison of RDT-YOLO with Mainstream Models

To systematically examine the capability of the proposed RDT-YOLO architecture for fish maw authenticity detection tasks, controlled comparative evaluations were established using several mainstream models with standardized experimental setups and datasets to enable objective assessment. Comprehensive findings are documented in Table 7. Although YOLOv12s achieved an mAP@0.5:0.95 of 89.3%, which is 0.4% higher than RDT-YOLO, its parameter count, computational overhead, and storage requirements substantially exceeded those associated with RDT-YOLO. Furthermore, RDT-YOLO outperformed competing mainstream architectures across four critical detection metrics, achieving 91.9% precision, 89.6% recall, 94% mAP@0.5, and 88.9% mAP@0.5:0.95, representing respective enhancements of 6.2%, 7.8%, 3.5%, and 4% compared to the reference architecture YOLO11n.
When evaluated against two-stage architectures, including Faster R-CNN and Cascade R-CNN, RDT-YOLO demonstrates significant enhancement in mAP@0.5 by 8.3% and 5.8%, respectively. When compared with one-stage models, including TOOD and SSD, RDT-YOLO realizes respectively performance enhancements of 10.2% and 13.7% in mAP@0.5. In comparison with the transformer-based detection models DINO and RT-DETR, RDT-YOLO exhibits mAP@0.5 enhancements of 10% and 10.7%, respectively. Compared with other one-stage models, including YOLOv10s and YOLOv12s, RDT-YOLO achieves mAP@0.5 gains of 0.8% and 0.5%, respectively. Significantly, RDT-YOLO’s mAP@0.5 performance exceeds that of the YOLO11s architecture by 0.5%, despite the latter’s increased parameter complexity within the same model family.
Regarding computational overhead, although RDT-YOLO requires 1.4G FLOPs more than the baseline YOLO11n model, it maintains the lowest FLOPs, parameter count, and model size among all other comparison models. Overall, RDT-YOLO achieves optimal comprehensive performance among all compared algorithms while attaining superior detection precision; simultaneously, it demonstrates exceptional performance in terms of both model weight and processing efficiency. This renders it exceptionally viable in time-sensitive detection scenarios, which has specific relevance to practical fish maw authenticity detection applications.

4.4. Ablation Experiment

4.4.1. Positional Ablation Study of C3k2-RVB and C3k2-RVB-iEMA Modules

To assess the efficacy of the hierarchical placement approach, controlled experiments were conducted, as documented in Table 8. In the table, Base represents the original C3k2 module configuration as the control condition; Strategies A and B implement uniform substitution approaches, replacing all backbone C3k2 modules with C3k2-RVB and C3k2-RVB-iEMA modules, respectively; fianlly, Strategy C introduces the hierarchical hybrid methodology, in which the first two shallow C3k2 modules are replaced with C3k2-RVB modules for enhanced basic feature extraction while the last two deep C3k2 modules are substituted with C3k2-RVB-iEMA modules for improved semantic feature representation.
Strategy C demonstrates optimal results in both mAP@0.5 and mAP@0.5:0.95 evaluations, outperforming Base, Strategy A, and Strategy B while preserving competitive architectural complexity. These findings confirm that Strategy C is more suitable for practical applications in fine-grained texture detection of fish maw.

4.4.2. RDT-YOLO Model Overall Ablation Study

For a comprehensive assessment of the proposed enhanced components for the YOLO11n base model, controlled ablation studies were established and executed across the test set. Each group of experiments was performed under the same environmental configuration and parameter settings. Base, A, B, C, and D correspond to the reference architecture, C3k2-RVB and C3k2-RVB-iEMA components, DAMSPP module, ATADH module, and Wise-ShapeIoU loss function, respectively. Six performance metrics served as comparative benchmarks: P, R, mAP@0.5, mAP@0.5:0.95, parameters (Param), and model size (Size). The overall ablation findings are documented in Table 9.
From Table 9, with YOLO11n serving as the reference framework, both detection accuracy metrics demonstrate measurable gains when these architectural enhancements operate independently. When all four components function synergistically, the results indicate an improvement in mAP@0.5 compared to adding a single module.
Specifically, adding the improved C3k2-RVB and C3k2-RVB-iEMA modules to YOLO11n enables the network to extract more hierarchical information, improving mAP@0.5 to 91.8%, with concurrent reductions in parameter count and storage footprint. Further adding the DAMSPP module improves P to 91.1%, while both primary detection metrics achieve substantial enhancements to 92.4% and 86.8%, respectively, demonstrating the effectiveness of the DAMSPP module’s adaptive feature processing and hierarchical extraction capabilities. Further adding the designed ATADH detection head enhances inter-task interaction capability, with results showing comprehensive metric enhancement: detection accuracy advances to 93.4% and 88.5% for the respective measures, while parameters decrease to 2.3 M and model size reduces to 4.8 MB. Finally, introducing the Wise-ShapeIoU loss function further elevates detection precision to 94%, with P and R reaching 91.9% and 89.6%, respectively.
Relative to the reference architecture, the cumulative enhancements are as follows: P increases by 6.2%, R by 7.8%, mAP@0.5 by 3.5%, and mAP@0.5:0.95 by 4%, while parameters are reduced by 11% and storage footprint by 7.7%. These findings validate that the enhanced methodology significantly advances fish maw authenticity detection accuracy while achieving computational efficiency through reduced parameters and model size.

4.5. Visual Analytics

For a more intuitive demonstration of the improved RDT-YOLO model’s effectiveness, Figure 12 presents the visualization results of YOLOv10s, YOLO11n, YOLO11s, YOLOv12s, and RDT-YOLO across detection scenarios of varying complexity, including simple scenarios, complex background environments, occlusion scenarios, and dense small target scenarios. As can be observed from the figure, traditional YOLO models commonly suffer from detection errors, background false positives, sensitivity to occlusion interference, and difficulty in distinguishing dense targets, with YOLO11n performing particularly poorly in dense scenarios. In contrast, RDT-YOLO leverages its optimized multi-scale feature fusion mechanism and enhanced feature extraction capabilities to achieve accurate target localization across all test scenarios, effectively handling complex occlusion situations and showing the ability to accurately distinguish individual targets in dense scenarios. The visualization results fully validate the effectiveness of RDT-YOLO, demonstrating that the proposed method exhibits stronger environmental adaptability and robustness that effectively lower the chance of detection failures and false positives.
To further analyze the detection performance of the model, three types of typical scenario images were selected from the test set and KPCA-CAM was employed for heatmap visualization, with the results shown in Figure 13. Analysis of the heatmap results reveals that RDT-YOLO demonstrates more precise attention focusing capability across all scenarios compared to YOLO11n and YOLO11s. The improved model reduces the impact of background noise on target detection to varying degrees, and exhibits stronger feature extraction capability when handling occlusion scenarios. Compared to the baseline algorithm, RDT-YOLO can automatically learn and focus on discriminative texture features. This significantly reduces background interference and inter-target feature confusion, improving both the accuracy and robustness of fish maw authenticity detection.

4.6. Generalization Experiments

To validate the cross-domain generalization capability of RDT-YOLO, we selected the NEU-DET steel surface defect dataset [50] released by Northeastern University for testing. This dataset was chosen because steel defect detection shares similar technical characteristics with fish maw authenticity identification. Both require precise texture feature extraction and subtle difference discrimination capabilities, allowing for effective validation of the model’s generalization performance. Maintaining the same environmental configuration as the fish maw detection experiments, we conducted comparative validation of RDT-YOLO against mainstream object detection models, with the experimental results presented in Table 10.
As shown in Table 10, RDT-YOLO achieves 78.9% mAP@0.5 and 43% mAP@0.5:0.95 on the NEU-DET dataset, outperforming all comparison models. Compared to the baseline YOLO11n model, RDT-YOLO improves mAP@0.5 and mAP@0.5:0.95 by 3.5 and 2.1 percentage points, respectively, while reducing parameters and model size by 10.9% and 7.7%, respectively. Compared to YOLOv10s, YOLO11s, and YOLOv12s, RDT-YOLO achieves superior detection performance while maintaining significant advantages in parameters and computational efficiency.
Figure 14 presents a visualization comparing the results between YOLO11n and RDT-YOLO on representative samples from the NEU-DET dataset. The first row shows the original images, the second row displays the detection results for YOLO11n, and the third row presents the detection results for RDT-YOLO. It can be clearly observed that YOLO11n suffers from severe missed detections and misclassifications, while RDT-YOLO accurately localizes defect positions and correctly identifies defect types, effectively resolving these issues.
The generalization experimental results fully validate that RDT-YOLO possesses excellent cross-domain transfer capability, demonstrating applicability not only to fish maw authenticity detection but also to industrial defect detection and other scenarios requiring fine-grained visual feature recognition, thereby exhibiting broad application prospects.

5. Discussion

5.1. Technical Necessity and Method Advantages

Traditional fish maw authenticity identification primarily employs expert sensory evaluation and chemical composition analysis methods. While these approaches demonstrate high accuracy, they face practical challenges such as elevated detection costs, extended time cycles, and heavy dependence on specialized equipment and technical personnel. As a result, they fail to effectively meet the actual demands for highly efficient and large-scale detection in commercial environments. As the market value of fish maw continues to rise and counterfeiting techniques become more sophisticated, there is an increasing demand for automated detection solutions that are efficient, precise, and readily implementable.
Although deep learning has been successfully applied in other areas of food safety detection, including agricultural product quality grading, meat freshness assessment, and tea grade detection, research on fish maw authenticity detection remains limited to traditional analytical chemistry and spectroscopy methods. This study is the first to introduce deep learning object detection technology into the field of fish maw authenticity detection. The proposed RDT-YOLO model transforms fish maw authenticity detection into an object detection problem, achieving end-to-end automated detection.
Compared to traditional methods, this model has significant advantages: it avoids complex sample preprocessing and manual feature design and can automatically learn and extract discriminative visual features of fish maw; it has fast inference speed with real-time detection capability; it has good scalability based on deep learning frameworks, facilitating model optimization and functional expansion; and it has low detection costs, requiring only ordinary digital cameras or smartphones for image acquisition without expensive professional analytical equipment.

5.2. Limitations Analysis and Future Research Directions

Although RDT-YOLO performs excellently in fish maw authenticity detection tasks, it still has limitations. The current dataset is limited in scale and mainly sourced from specific suppliers and regions, which may introduce sample bias. Fish maw samples with different origins, processing techniques, and storage conditions exhibit subtle differences in appearance characteristics, which may affect model generalization ability. Additionally, although the proposed model demonstrates optimized computational efficiency, further lightweight model compression remains necessary for implementation on edge devices with limited computational resources. The current detection categories are relatively fixed, and the model requires continuous updates and expansion as new counterfeiting methods emerge.
Future research directions include constructing larger-scale and more diverse fish maw datasets covering samples from different origins, varieties, quality grades, and counterfeiting types; exploring data augmentation and generative adversarial networks to balance data distribution; researching multimodal fusion detection by combining visual features with physicochemical information including near-infrared spectral techniques and Raman spectral analysis to establish a complete and stable identification framework; and developing lightweight models through network pruning, knowledge distillation, quantization, and other techniques to improve inference efficiency.

6. Conclusions

Fish maw authenticity detection is crucial for market regulation and consumer protection in the rapidly growing global fish maw industry. This research presents the application implementation of RDT-YOLO in fish maw authenticity detection based on computer vision technology. Incorporating an innovative multimodular architecture, including C3k2-RVB, C3k2-RVB-iEMA, and DAMSPP modules along with an ATADH detection head, RDT-YOLO achieves an optimal balance between accuracy and computational efficiency. The proposed model attains exceptional performance of 94% mAP@0.5 while maintaining a lightweight architecture that requires only 4.8 MB of storage space. This superior performance over conventional detection methods in terms of both accuracy and parameter efficiency enables operational deployment for real-time commercial applications. While this research presents a scalable and efficient method for fish maw detection, subsequent investigations are still required in order to improve model generalization performance. Expanding the dataset, incorporating broader commercial contexts and more fish maw varieties, and extending the algorithm’s applicability to additional food safety detection domains could improve detection robustness.

Author Contributions

Conceptualization: C.X., M.L. and T.L.; methodology: C.X. and M.L.; software: M.L. and W.G.; validation: W.Z.; investigation: M.L. and Y.Z.; resources: T.L., S.L. and X.G.; data curation: S.G.H.; writing—original draft preparation: C.X. and M.L.; writing—review and editing: T.L., S.L. and X.G.; visualization: X.G.; supervision: T.L.; project administration: S.L.; funding acquisition: T.L. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62373390), the Guangdong Basic and Applied Basic Research Foundation (Grant Nos. 2023A1515011230 and 2022B1515120059), and the Science and Technology Planning Project of Guangzhou (Grant No. 2023E04J1238).

Data Availability Statement

The data are available at https://data.mendeley.com/datasets/hf54hhzz6t (accessed on 8 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. FAO. The State of World Fisheries and Aquaculture 2024—Blue Transformation in Action; Technical Report; Food and Agriculture Organization of the United Nations: Rome, Italy, 2024. [Google Scholar] [CrossRef]
  2. Ben-Hasan, A.; de Mitcheson, Y.S.; Cisneros-Mata, M.A.; Jimenez, E.A.; Daliri, M.; Cisneros-Montemayor, A.M.; Nair, R.J.; Thankappan, S.A.; Walters, C.J.; Christensen, V. China’s fish maw demand and its implications for fisheries in source countries. Mar. Policy 2021, 132, 104696. [Google Scholar] [CrossRef]
  3. Chen, Y.; Jin, H.; Yang, F.; Jin, S.; Liu, C.; Zhang, L.; Huang, J.; Wang, S.; Yan, Z.; Cai, X.; et al. Physicochemical, antioxidant properties of giant croaker (Nibea japonica) swim bladders collagen and wound healing evaluation. Int. J. Biol. Macromol. 2019, 138, 483–491. [Google Scholar] [CrossRef]
  4. Dai, C.; Dai, L.; Yu, F.J.; Li, X.N.; Wang, G.X.; Chen, J.; Wang, C.; Lu, Y.P. Chemical and biological characteristics of hydrolysate of crucian carp swim bladder: Focus on preventing ulcerative colitis. J. Funct. Foods 2020, 75, 104256. [Google Scholar] [CrossRef]
  5. Sadovy de Mitcheson, Y.; To, A.W.l.; Wong, N.W.; Kwan, H.Y.; Bud, W.S. Emerging from the murk: Threats, challenges and opportunities for the global swim bladder trade. Rev. Fish Biol. Fish. 2019, 29, 809–835. [Google Scholar] [CrossRef]
  6. Giusti, A.; Malloggi, C.; Tinacci, L.; Nucera, D.; Armani, A. Mislabeling in seafood products sold on the Italian market: A systematic review and meta-analysis. Food Control 2023, 145, 109395. [Google Scholar] [CrossRef]
  7. Sun, Y.; Pei, Y.; Li, Q.; Xia, F.; Xu, D.; Shen, G.; Feng, J. An expert system for species discrimination and grade identification of fish maw. Microchem. J. 2024, 207, 112141. [Google Scholar] [CrossRef]
  8. Kendall, H.; Naughton, P.; Kuznesof, S.; Raley, M.; Dean, M.; Clark, B.; Stolz, H.; Home, R.; Chan, M.; Zhong, Q.; et al. Food fraud and the perceived integrity of European food imports into China. PLoS ONE 2018, 13, e0195817. [Google Scholar] [CrossRef]
  9. Böhme, K.; Calo-Mata, P.; Barros-Velázquez, J.; Ortea, I. Recent applications of omics-based technologies to main topics in food authentication. TrAC Trends Anal. Chem. 2019, 110, 221–232. [Google Scholar] [CrossRef]
  10. Zhu, S.Q.; Shi, Y.Q.; Huang, C.H.; Cao, C.F.; Ji, F.; Luo, B.Z.; Zheng, X.C. Research progress of identification techniques for original species of isinglass. J. Food Saf. Qual. 2022, 13, 3593–3601. [Google Scholar] [CrossRef]
  11. Yin, H.; Yang, Q.; Huang, F.; Li, H.; Wang, H.; Zheng, H.; Huang, F. Multimodal fish maw type recognition based on Wasserstein generative adversarial network combined with gradient penalty and spectral fusion. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2025, 327, 125430. [Google Scholar] [CrossRef]
  12. Sha, X.M.; Jiang, W.L.; Hu, Z.Z.; Zhang, L.J.; Xie, Z.H.; Lu, L.; Yuan, T.; Tu, Z.C. Traceability and identification of fish gelatin from seven cyprinid fishes by high performance liquid chromatography and high-resolution mass spectrometry. Food Chem. 2023, 400, 133961. [Google Scholar] [CrossRef]
  13. Silva, A.J.; Kawalek, M.; Williams-Hill, D.M.; Hellberg, R.S. PCR cloning combined with DNA barcoding enables partial identification of fish species in a mixed-species product. Front. Ecol. Evol. 2020, 8, 28. [Google Scholar] [CrossRef]
  14. Xing, B.; Chen, X.; Wu, Q.; Wang, Y.; Wang, C.; Xiang, P.; Sun, R. Species authentication and conservation challenges in Chinese fish maw market using Mini-DNA barcoding. Food Control 2025, 167, 110779. [Google Scholar] [CrossRef]
  15. Zareef, M.; Arslan, M.; Hassan, M.M.; Ahmad, W.; Ali, S.; Li, H.; Ouyang, Q.; Wu, X.; Hashim, M.M.; Chen, Q. Recent advances in assessing qualitative and quantitative aspects of cereals using nondestructive techniques: A review. Trends Food Sci. Technol. 2021, 116, 815–828. [Google Scholar] [CrossRef]
  16. Nimbkar, S.; Auddy, M.; Manoj, I.; Shanmugasundaram, S. Novel techniques for quality evaluation of fish: A review. Food Rev. Int. 2023, 39, 639–662. [Google Scholar] [CrossRef]
  17. Wu, K.; Ji, Z.; Wang, H.; Shao, X.; Li, H.; Zhang, W.; Kong, W.; Xia, J.; Bao, X. A Comprehensive Review of AI Methods in Agri-Food Engineering: Applications, Challenges, and Future Directions. Electronics 2025, 14, 3994. [Google Scholar] [CrossRef]
  18. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  19. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  20. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  21. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  22. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  23. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  24. Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
  25. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
  26. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
  27. Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
  28. Geng, J.; Liu, J.; Kong, X.; Shen, B.; Niu, Z. The fishmeal adulteration identification based on microscopic image and deep learning. Comput. Electron. Agric. 2022, 198, 106974. [Google Scholar] [CrossRef]
  29. Kong, X.; Yu, J.; Guo, J.; Tian, Q.; Li, F.; Liu, Y. AM-YOLO: An Efficient Object Detection Model for Astragalus Mongholicus Authentication with Dynamic Gated Feature Fusion. IEEE Access 2025, 13, 163731–163748. [Google Scholar] [CrossRef]
  30. Jubayer, M.F.; Ruhad, F.M.; Kayshar, M.S.; Rizve, Z.; Alam Soeb, M.J.; Izlal, S.; Md Meftaul, I. Detection and Identification of Honey Pollens by YOLOv7: A Novel Framework toward Honey Authenticity. ACS Agric. Sci. Technol. 2024, 4, 747–758. [Google Scholar] [CrossRef]
  31. Zhang, Y.; Xing, X.; Zhu, L.; Li, X.; Wang, J.; Du, Y.; Han, R. A rapid identification technique for rice adulteration based on improved YOLOV8 model. Meas. Sci. Technol. 2025, 36, 026207. [Google Scholar] [CrossRef]
  32. Jose, R.; Ponmozhi, K. Modified Residual Variable Correlation Kernel Convolutional Neural Network for Classifying Spices. Anal. Lett. 2025, 58, 3133–3154. [Google Scholar] [CrossRef]
  33. Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.Y.; Shlens, J.; Le, Q.V. Learning data augmentation strategies for object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 566–583. [Google Scholar]
  34. Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar]
  35. Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  36. Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–23 June 2022; pp. 10819–10829. [Google Scholar]
  37. Shi, D. Transnext: Robust foveal visual perception for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17773–17783. [Google Scholar]
  38. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
  39. Zhang, H.; Zhang, S. Shape-IoU: More Accurate Metric considering Bounding Box Shape and Scale. arXiv 2024, arXiv:2312.17663. [Google Scholar] [CrossRef]
  40. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar] [CrossRef]
  41. Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
  42. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
  43. Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7373–7382. [Google Scholar]
  44. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  45. Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. Yolo-facev2: A scale and occlusion aware face detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
  46. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 9–11 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
  47. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
  48. Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
  49. Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou loss: Towards accurate oriented object detection in complex environments. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings; Part V 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 195–211. [Google Scholar]
  50. He, Y.; Song, K.; Meng, Q.; Yan, Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [Google Scholar] [CrossRef]
Figure 1. Fish maw authenticity detection framework.
Figure 1. Fish maw authenticity detection framework.
Electronics 14 04588 g001
Figure 2. Samples from a portion of the fish maw dataset: (a) Red-mouth fish maw (ZJCZ); (b) Annan fish maw (AN); (c) White croaker fish maw (BM); (d) Douhu fish maw (DH); (e) Duck-tongue shaped maw (YL).
Figure 2. Samples from a portion of the fish maw dataset: (a) Red-mouth fish maw (ZJCZ); (b) Annan fish maw (AN); (c) White croaker fish maw (BM); (d) Douhu fish maw (DH); (e) Duck-tongue shaped maw (YL).
Electronics 14 04588 g002
Figure 3. Illustration of image enhancement strategies: (a) original and blur up to 1.5px; (b) brightness adjustment between −25% and +25%; (c) rotation between −15° and +15°; (d) mixed augmentation.
Figure 3. Illustration of image enhancement strategies: (a) original and blur up to 1.5px; (b) brightness adjustment between −25% and +25%; (c) rotation between −15° and +15°; (d) mixed augmentation.
Electronics 14 04588 g003
Figure 4. Model structure diagram of RDT-YOLO.
Figure 4. Model structure diagram of RDT-YOLO.
Electronics 14 04588 g004
Figure 5. Structure of the C3k2-RVB-iEMA module: (a) C3k2 module; (b) C3k2-RVB module; (c) C3k2-RVB-iEMA module.
Figure 5. Structure of the C3k2-RVB-iEMA module: (a) C3k2 module; (b) C3k2-RVB module; (c) C3k2-RVB-iEMA module.
Electronics 14 04588 g005
Figure 6. Structure of the iEMA module.
Figure 6. Structure of the iEMA module.
Electronics 14 04588 g006
Figure 7. Reparameterization process diagram: (a) RepViT block and (b) RepViT-iEMA block.
Figure 7. Reparameterization process diagram: (a) RepViT block and (b) RepViT-iEMA block.
Electronics 14 04588 g007
Figure 8. Structure of the DAMSPP module.
Figure 8. Structure of the DAMSPP module.
Electronics 14 04588 g008
Figure 9. Structure of the AMSIB module.
Figure 9. Structure of the AMSIB module.
Electronics 14 04588 g009
Figure 10. Structure of the ATADH module.
Figure 10. Structure of the ATADH module.
Electronics 14 04588 g010
Figure 11. Loss function variation curves during training.
Figure 11. Loss function variation curves during training.
Electronics 14 04588 g011
Figure 12. Comparison of test results for different models: (a) simple single-target detection scenario; (b) complex single-target detection with background interference; (c) occluded target detection with mutual occlusion; (d) dense target detection with closely arranged targets.
Figure 12. Comparison of test results for different models: (a) simple single-target detection scenario; (b) complex single-target detection with background interference; (c) occluded target detection with mutual occlusion; (d) dense target detection with closely arranged targets.
Electronics 14 04588 g012
Figure 13. Comparison of heat maps: (a) single-target detection; (b) occluded target detection; (c) dense target detection.
Figure 13. Comparison of heat maps: (a) single-target detection; (b) occluded target detection; (c) dense target detection.
Electronics 14 04588 g013
Figure 14. Visualization comparison experiments on the NEU-DET dataset: (a) crazing; (b) inclusion; (c) patches; (d) pitted surface; (e) rolled in scale; (f) scratches.
Figure 14. Visualization comparison experiments on the NEU-DET dataset: (a) crazing; (b) inclusion; (c) patches; (d) pitted surface; (e) rolled in scale; (f) scratches.
Electronics 14 04588 g014
Table 1. Distribution of fish maw dataset before and after data augmentation.
Table 1. Distribution of fish maw dataset before and after data augmentation.
CategoriesBefore Data
Augmentation
After Data Augmentation
Training SetValidation SetTesting SetTotal
AN3825447436654
BM3464907130591
DH3635067139616
YL4536189648762
ZJCZ9501340184961620
Table 2. Benchmark algorithm of prior experiment.
Table 2. Benchmark algorithm of prior experiment.
ModelmAP@0.5/%mAP@0.5:0.95/%Param/MFLOPs/GSize/MB
YOLOv8n90.684.43.018.16
YOLOv9t89.584.32.811.723.4
YOLOv10n9084.72.78.25.5
YOLO11n90.584.92.586.35.2
YOLOv12n89.283.12.515.85.2
Table 3. Comparison of different SPPF module improvements.
Table 3. Comparison of different SPPF module improvements.
ModelP/%R/%mAP@0.5/%mAP@0.5:0.95/%Param/MSize/MB
SPPF85.781.890.584.92.585.2
SPPELAN88.778.488.382.52.434.9
AIFI87.184.291.585.73.216.4
DAMSPP88.885.492.286.72.815.7
Table 4. Comparison of different detection head modules.
Table 4. Comparison of different detection head modules.
ModelP/%R/%mAP@0.5/%mAP@0.5:0.95/%Param/MSize/MB
11-Head85.781.890.584.92.585.2
DyHead8881.289.7853.16.3
Aux85.582.190.585.12.314.7
MultiSEAMHead87.480.691.885.74.69.2
ATADH88.686.492.1872.24.5
Table 5. Comparison of different convolution schemes in ATADH.
Table 5. Comparison of different convolution schemes in ATADH.
ModelConvolutionP/%R/%mAP@0.5/%mAP@0.5:0.95/%Param/MFLOPs/GSize/MB
ATADHConv88.582.190.685.12.1784.4
DWConv89.679.189.884.22.147.44.4
DCNv281.982.590.483.82.187.64.4
DyDCNv288.686.492.1872.27.94.5
Table 6. Comparison of different loss functions.
Table 6. Comparison of different loss functions.
MethodP/%R/%mAP@0.5/%mAP@0.5:0.95/%
CIoU89.488.193.488.5
DIoU87.989.293.987.6
GIoU88.188.693.587.3
EIoU91.584.693.387.5
PIoU90.786.593.488.3
WIoU V391.686.993.788.7
ShapeIoU90.986.193.788.8
Wise-ShapeIoU91.989.69488.9
Table 7. Comparison of different mainstream models.
Table 7. Comparison of different mainstream models.
ModelBackboneP/%R/%mAP@0.5/%mAP@0.5:0.95/%Param/MFLOPs/GSize/MB
Faster-RCNNResNet5077.275.885.771.541.37134377
Cascade-RCNNResNet5088.978.388.272.669.21162484
SSDVGG1674.784.783.866.224.2861.1592.11
TOODResNet5083.963.380.373.732.03123294
DINOResNet5081.469.68479.247.55179374
RTDETRResNet1889.880.183.377.519.885738.6
YOLOv10s90.787.293.289.18.0424.515.8
YOLO11n85.781.890.584.92.586.35.2
YOLO11s86.88793.588.89.4121.318.3
YOLOv12s87.883.793.589.39.0819.317.8
RDT-YOLO(Ours)91.989.69488.92.37.74.8
Table 8. Experiments with different positions of C3k2-RVB and C3k2-RVB-iEMA modules.
Table 8. Experiments with different positions of C3k2-RVB and C3k2-RVB-iEMA modules.
ModelP/%R/%mAP@0.5/%mAP@0.5:0.95/%Param/MFLOPs/GSize/MB
Base85.781.890.584.92.586.35.2
A87.781.690.784.82.4465
B82.983.490.785.12.456.15.1
C85.685.691.986.42.456.15.1
Table 9. Ablation study results of the proposed RDT-YOLO model.
Table 9. Ablation study results of the proposed RDT-YOLO model.
Base+A+B+C+DP/%R/%mAP@0.5/%mAP@0.5:0.95/%Param/MSize/MB
----85.781.890.584.92.585.2
---85.585.691.884.92.455.1
---88.885.492.286.72.815.7
---88.686.492.1872.24.5
---91.282.69286.32.585.2
--91.184.792.486.82.685.5
--89.887.692.186.72.074.3
--87.986.393.287.62.434.9
-89.488.193.488.52.34.8
91.989.69488.92.34.8
Note: √ in the table indicates that the module is added.
Table 10. Defect detection results on the NEU-DET dataset.
Table 10. Defect detection results on the NEU-DET dataset.
ModelP/%R/%mAP@0.5/%mAP@0.5:0.95/%Param/MFLOPs/GSize/MB
Faster-RCNN72.469.576.138.941.37134377
Cascade-RCNN7269.976.641.469.21162484
SSD71.768.271.235.224.2861.1592.11
TOOD69.563.265.1337.632.03123294
DINO56.551.660.232.947.55179374
RTDETR69.661.268.63819.885738.6
YOLOv10s72.965.971.63824.58.0415.8
YOLO11n69.669.375.440.92.586.35.2
YOLO11s77.465.376.541.29.4121.318.3
YOLOv12s79.467.377.542.19.0819.317.8
RDT-YOLO (Ours)75.973.278.9432.37.74.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, C.; Liu, M.; Zhang, W.; Zhang, Y.; Hassan, S.G.; Guo, W.; Liu, T.; Liu, S.; Gao, X. RDT-YOLO: An Improved Lightweight Model for Fish Maw Authenticity Detection. Electronics 2025, 14, 4588. https://doi.org/10.3390/electronics14234588

AMA Style

Xie C, Liu M, Zhang W, Zhang Y, Hassan SG, Guo W, Liu T, Liu S, Gao X. RDT-YOLO: An Improved Lightweight Model for Fish Maw Authenticity Detection. Electronics. 2025; 14(23):4588. https://doi.org/10.3390/electronics14234588

Chicago/Turabian Style

Xie, Caijian, Mingguang Liu, Wanzhen Zhang, Yuting Zhang, Shahbaz Gul Hassan, Weijie Guo, Tonglai Liu, Shuangyin Liu, and Xuekai Gao. 2025. "RDT-YOLO: An Improved Lightweight Model for Fish Maw Authenticity Detection" Electronics 14, no. 23: 4588. https://doi.org/10.3390/electronics14234588

APA Style

Xie, C., Liu, M., Zhang, W., Zhang, Y., Hassan, S. G., Guo, W., Liu, T., Liu, S., & Gao, X. (2025). RDT-YOLO: An Improved Lightweight Model for Fish Maw Authenticity Detection. Electronics, 14(23), 4588. https://doi.org/10.3390/electronics14234588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop