1. Introduction
According to FAO reports, global aquaculture has developed rapidly [
1], with China serving as a major market for premium dried marine products such as shark fin, abalone, and fish maw [
2]. Fish maw is a dried marine product made from specific fish swim bladders through traditional processing techniques. It is rich in proteins, vitamins, and trace elements, with medicinal properties such as hemostasis and anti-inflammatory effects [
3,
4]. However, as consumer demand for fish maw keeps growing, an increasing number of fish species are being utilized in its production [
5]. The market value of fish maw is primarily determined by its species and quality, and the resulting price disparities among different species and grades have led to issues such as counterfeiting and mislabeling [
6,
7,
8]. Food fraud has become an increasingly serious global problem [
9]. Therefore, research on fish maw authenticity detection is highly necessary.
Traditional fish maw authenticity detection methods primarily depend on organoleptic assessment, chemical analysis, and molecular biology techniques. Organoleptic assessment requires extensive expertise, but remains highly subjective with considerable error rates [
10]. Chemical analysis encompasses various spectroscopic [
11], chromatographic, and mass spectrometry techniques [
12], while molecular biology techniques involve DNA-based analyses such as PCR [
13], DNA barcoding, and sequencing [
14]. Although both chemical and molecular methods offer high accuracy, they are time-consuming, costly, and require professional instrumentation and skilled operators, which constrains their practical application [
15,
16]. Therefore, it becomes critical to establish non-destructive and portable image detection technology for fish maw authenticity detection.
Recent advances in AI have achieved breakthroughs in the application of deep learning technologies in food science [
17]. Particularly in computer vision, deep learning technology has transformed object detection methodologies, establishing neural network architectures as the dominant techniques for object detection tasks. Contemporary object detection methods are primarily categorized into two-stage detectors such as R-CNN [
18], Fast R-CNN [
19], and Faster R-CNN [
20] and one-stage detectors including SSD [
21], RetinaNet [
22], and the You Only Look Once (YOLO) [
23,
24,
25,
26,
27] series. While two-stage methods achieve higher accuracy through sequential region proposal and classification, one-stage detectors excel in computational efficiency and real-time processing.
YOLO is extensively used in food detection scenarios due to its fast inference speed, fewer parameters, smaller model size, and high accuracy. The versatility of YOLO models is evidenced by their successful application in numerous food authentication tasks. For instance, Geng et al. [
28] proposed a fishmeal adulteration identification method based on microscopic images and deep learning, using Mobilenetv2 as the qualitative identification model and combining YOLOv3-Mobilenetv2 for component identification, achieving accurate adulteration detection. Kong et al. [
29] proposed an authentication method for Astragalus membranaceus based on AM-YOLO, which improved the accuracy and inference speed of genuine and counterfeit herbal medicine recognition. Jubayer et al. [
30] applied YOLOv7 for honey pollen detection and classification, enhancing honey authenticity identification. Zhang et al. [
31] designed a lightweight Faster-YOLO algorithm based on YOLOv8n, incorporating group convolution hybrid attention mechanisms to solve rice adulteration classification problems. Jose et al. [
32] proposed a spice classification method based on YOLOv8 with variable correlation kernel convolutional neural network, which improved the accuracy and precision of visually similar spice recognition. These studies demonstrate the effectiveness and versatility of YOLO-based approaches in food authentication applications.
However, fish maw species identification presents unique challenges that remain inadequately addressed, despite the successful application of YOLO models in various food detection tasks. Unlike other food items with distinct visual features, fish maw samples from different species exhibit highly similar appearances with only subtle variations in surface textures and patterns, demanding more sophisticated feature extraction capabilities. Moreover, the significant variations in size and shape across different fish maw processing methods necessitates robust multi-scale detection mechanisms. Additionally, practical deployment scenarios such as on-site inspection at wholesale markets or retail stores require lightweight models that are suitable for portable devices while maintaining high detection accuracy. These specific requirements highlight the urgent need for developing specialized detection algorithms tailored to fish maw authenticity verification.
To address these challenges, this study proposes RDT-YOLO, an enhanced YOLO11-based model specifically designed for fish maw authenticity detection. Unlike existing generic object detection models, RDT-YOLO introduces a systematic framework tailored to the unique characteristics of fish maw authentication, including subtle texture differences, significant size variations, and the need for lightweight deployment. The overall framework of this study is illustrated in
Figure 1.
The main contributions and innovations of this work are as follows:
This is the first work to apply deep learning-based object detection technology to fish maw authenticity detection, transforming a traditional analytical chemistry problem into a computer vision task and establishing a comprehensive dataset focusing on easily confused species.
We design a hierarchical dual-stage feature extraction architecture with C3k2-RVB for shallow layers and C3k2-RVB-iEMA for deep layers. By replacing SE attention with our designed iEMA module in a RepViT block, we enhance spatial feature perception through cross-spatial learning specifically optimized for fine-grained texture recognition in fish maw detection.
We design the DAMSPP module with novel Adaptive Inception Convolution (AIConv) and Dynamic Kernel Weighting (DKW) mechanism. This adaptively captures fish maw texture information at different scales through dynamic feature enhancement, overcoming the limitations of traditional SPPF fixed pooling operations.
We develop an Adaptive Task-Aligned Detection Head (ATADH), which synergistically combines task decomposition, shared convolution, and DyDCNv2 mechanisms to achieve effective collaboration between classification and localization tasks while maintaining a lightweight architecture.
We integrate the Wise-IoU V3 focusing coefficient into Shape-IoU to form a Wise-ShapeIoU loss function, addressing the challenge of varying quality samples in real-world fish maw detection scenarios and achieving improved bounding box optimization.
2. Materials
2.1. Data Acquisition
The dataset used in this study was primarily sourced from Shanghao Jiao retail store in Shantou City, Guangdong Province, China and from Jiexun Aquaculture Co., Ltd. in Raoping County, China, employing multi-channel collection methods to enhance representativeness and diversity. The final fish maw dataset encompasses five major categories: Red-mouth fish maw, White croaker fish maw, Annan fish maw, Douhu fish maw, and Duck-tongue shaped maw.
The biological sources of various fish maws exhibit significant differences. The sources of Red-mouth fish maw are quite complex, primarily derived from fish such as Protonibea diacanthus and Megalonibea fusca. Due to the scarcity and extreme rarity of these fish species in nature, Red-mouth fish maw commands high prices in the market and is considered a precious variety among fish maws. In contrast, the source fish for several other types of fish maw are relatively more common: Annan fish maw is sourced from Otolithoides biauritus of the Otolithoides genus; both White croaker fish maw and Duck-tongue shaped maw are saltwater fish maws derived from fish of the Pseudotolithus genus; while Douhu fish maw is sourced from Galeoides. These types of fish maw, due to their source fish being relatively common, are priced much lower in the market than Red-mouth fish maw. However, their physical characteristics are extremely similar; therefore, they are often used to counterfeit Red-mouth fish maw.
Due to the striking physical similarities across fish maw varieties and the intricate nature of market deception, constructing a dataset capable of accurately distinguishing these subtle differences is of paramount importance. To this end, this study employed a Canon EOS 6D camera (Canon Inc., Tokyo, Japan) with 4032 × 3024 pixel resolution for high-quality image acquisition. To ensure data diversity and representativeness, image capture encompassed various lighting conditions, placement positions, quality grades, and shooting angles, including single-target and multi-target scenarios as well as complex background environments.
Figure 2 presents representative sample images, clearly demonstrating the typical characteristics and subtle differences among the five fish maw types.
2.2. Data Preprocessing
A data preprocessing workflow was implemented to ensure dataset quality. First, quality assessment was conducted on the collected fish maw images, removing samples that were blurred, improperly exposed, or had substandard resolution. Second, all selected images underwent dimensional standardization, which was uniformly scaled to 640 × 640 pixels. During the annotation phase, precise bounding box annotations were performed using the Roboflow platform for all five fish maw categories. Finally, random grouping was performed on the labeled dataset according to a 7:2:1 proportional distribution, constructing training, validation, and testing data subsets.
2.3. Data Augmentation
The YOLO11n model was initially trained for 200 epochs to establish baseline performance. Results showed that the ZJCZ class achieved the highest recognition accuracy, while the BM and DH categories showed relatively lower performance. Data augmentation techniques were employed to address this performance imbalance [
33].
Augmentation methods included ±15° random rotation, brightness adjustment within
, and Gaussian blur with up to 1.5-pixel radius. These techniques simulate various viewing angles, lighting conditions, and imaging quality variations. Each original training image generated one augmented version, expanding the training samples from 1749 to 3498, while validation and test sets remained unchanged.
Figure 3 illustrates typical augmentation effects.
Table 1 presents the sample distribution after augmentation.
3. Methods
3.1. RDT-YOLO Network Structure
RDT-YOLO is developed on the YOLO11 framework and specifically engineered to tackle challenges related to inadequate fine-grained textural feature mining, constrained cross-scale feature integration performance, and computational parameter redundancy in fish maw authenticity detection. C3k2-RVB and C3k2-RVB-iEMA modules have been developed as substitutes for the backbone C3k2 component in YOLO11, effectively reducing model parameters while significantly enhancing recognition capability for fine texture features of fish maw, thereby solving recognition accuracy problems caused by similar textures and complex backgrounds. To address the limitations of traditional SPPF fixed pooling, the DAMSPP module is developed as well. This module adaptively captures fish maw texture information at different scales through dynamic feature enhancement and multi-scale feature extraction, significantly improving feature representation capability.
Hierarchical feature representations extracted from the backbone architecture are subsequently processed through the neck module to achieve multi-scale feature integration. The resulting three-layer feature maps are then transmitted to the detection head. The ATADH is developed to substitute for the original detection head, significantly reducing parameters through task interaction mechanisms while efficiently decoding feature information to determine target position, category, and confidence. Additionally, the Wise-ShapeIoU loss function is formulated to enhance detection accuracy.
Figure 4 illustrates the architectural framework of the optimized RDT-YOLO, with the core components detailed below.
3.2. Reparameterized Feature Extraction Module
Fish maw authenticity detection fundamentally requires distinguishing species with highly similar appearances through subtle surface texture variations. As shown in
Figure 5a, traditional C3k2 modules face two critical limitations: first, the bottleneck structure loses fine-grained information through dimensionality reduction, causing feature confusion among similar textures; second, the absence of spatial attention mechanisms prevents adaptive focus on discriminative texture patterns. These deficiencies directly lead to misclassifications when confronting visually similar fish maw species.
To address these challenges, we propose a hierarchical dual-stage architecture implementing progressive texture learning. In shallow layers, the RepViT structure [
34] forms the C3k2-RVB module, shown in
Figure 5b, which employs reparameterization to preserve fine-grained details while maintaining efficiency. In the deep layers we develop the C3k2-RVB-iEMA module, shown in
Figure 5c, by integrating our designed iEMA module, which captures directional texture patterns through cross-spatial learning mechanisms via bidirectional 1D pooling.
3.2.1. iEMA Feature Enhancement Module
In fish maw authenticity detection, traditional attention mechanisms face significant challenges when processing complex surface texture features. To address this limitation, we propose the integrated Efficient Multi-scale Attention (iEMA) feature enhancement module, illustrated in
Figure 6, which combines EMA attention with an inverted residual structure.
The EMA [
35] mechanism captures spatial correlations through bidirectional 1D global average pooling along the width and height dimensions:
where
and
respectively encode spatial information along the height and width axes for the
c-th channel.
The complete iEMA module follows an inverted residual architecture with integrated attention. The forward computation can be formulated as follows:
where
performs pointwise convolution for channel modulation,
applies efficient spatial attention through bidirectional 1D pooling, and
extracts local texture patterns via depthwise convolution. The residual connection ensures information preservation throughout the transformation.
3.2.2. RepViT-iEMA Block
The RepViT block achieves computational efficiency through structural reparameterization techniques; during training, multi-branch architectures like RepVGGDW enable rich feature learning, while computational efficiency is enhanced at inference by reparameterizing multiple branches to a unified convolution operation. As shown in
Figure 7a, the original RepViT block follows the token mixer and channel mixer paradigm.
However, the SE attention mechanism in the RepViT block ignores spatial dimension information. This makes it insufficient for capturing fine texture variations, which is an essential element of fish maw authenticity assessment. To address this limitation, we replace the SE attention with our proposed iEMA module, forming the RepViT-iEMA block, as illustrated in
Figure 7b:
where
performs reparameterizable depthwise convolution,
denotes the Gaussian Error Linear Unit (GELU) activation function that provides smooth and non-monotonic activation characteristics,
C represents the input channel dimension, and the
indicates channel expansion in the inverted residual structure. This modification retains the advantages of reparameterization while enhancing spatial feature perception capabilities through iEMA’s cross-spatial learning mechanism.
3.3. Dynamic Adaptive Multi-Scale Pyramid Processing
Fish maw detection requires adaptive multi-scale processing for varying sizes and shapes. Traditional SPPF modules suffer from fixed pooling that loses fine-grained details and lacks the adaptivity to adjust based on input characteristics, resulting in suboptimal representations.
Here, we design Dynamic Adaptive Multi-Scale Pyramid Processing (DAMSPP), as shown in
Figure 8, which dynamically adjusts feature extraction through a Adaptive Multi-Scale Inception Block (AMSIB), shown in
Figure 9. The AMSIB incorporates Adaptive Inception Convolution with Dynamic Kernel Weighting to learn scale-specific importance, which is combined with shared convolution for efficiency. This adaptively captures texture at multiple scales while preserving fine-grained details.
3.3.1. Adaptive Inception Convolution
This paper first designs Adaptive Inception Convolution (AIConv), the core of which lies in the use of a Dynamic Kernel Weighting (DKW) mechanism to achieve global adaptive processing of input feature maps.
Three parallel depthwise separable convolution branches are designed in the multi-scale feature extraction stage, respectively adopting square, horizontal, and vertical convolution kernel configurations to capture information in different spatial directions. Multi-scale feature extraction can be represented as follows:
where
represents the input tensor and
,
,
represent square, horizontal, and vertical depth convolutions, respectively.
Meanwhile, the convolution kernel sizes follow the following constraint relationship:
In the weight generation stage, the DKW mechanism analyzes global information from input features and dynamically generates adaptive weights for each branch. The calculation formula is as follows:
where
is the softmax function and GAP is Global Average Pooling. This mechanism first performs global statistical aggregation on input features, then expands the channel dimension from
C to
through pointwise convolution to generate corresponding attention weights for the three processing branches.
Each sub-tensor
corresponds to the weight coefficient of the respective branch. The output of AIConv is achieved through weighted fusion:
where ⊙ represents element-wise multiplication,
represents the dynamic weighted fusion result, and the fused features are processed through batch normalization and a SiLU activation function to achieve stable training and accelerated convergence.
3.3.2. Multi-Scale Feature Mixer
To improve cross-scale feature fusion and information aggregation, we design and implements a Multi-Scale Feature Mixer (MSFM) module based on AIConv. The mathematical definition of this mixer is presented below.
Specifically, the input feature is first evenly split along the channel dimension, obtaining two subfeatures and . Subsequently, the two subfeatures are respectively input into AIConv units configured with different receptive field sizes, achieving differentiated feature extraction. Finally, through feature concatenation and channel modulation, the fused output is generated.
3.3.3. Adaptive Multi-Scale Inception Block
Finally, the MetaFormer design paradigm [
36] is adopted to construct AMSIB, with MSFM as the core mixing operator combined with the Convolutional Gated Linear Unit [
37] from TransNeXt as the feedforward network. To achieve fine-grained feature fusion control, learnable scale parameters are introduced. The detailed computational process is as follows:
where
and
are trainable scale modulation parameters initialized to 0.01 to ensure training stability. A residual connection strategy is adopted to ensure effective information transfer and stable gradient backpropagation.
In summary, AMSIB adaptively weights multi-scale features based on input content, enabling effective capture of fine texture information on fish maw surfaces while preserving detail and enhancing overall feature representation.
3.4. Adaptive Task-Aligned Detection Head
Fish maw detection requires synergistic collaboration between classification and localization tasks, as accurate species identification depends on precise positional context. Traditional YOLO11 detection heads employ a decoupled dual-branch design with two critical limitations: first, independent processing prevents effective information exchange between tasks, resulting in misalignment where high classification confidence may correspond to poor localization accuracy; second, separate branches create redundant feature extraction and excessive computational overhead. These deficiencies degrade detection accuracy and efficiency, particularly for fish maw samples with irregular shapes and complex backgrounds.
To address these challenges, inspired by TOOD [
38] and as illustrated in
Figure 10, we propose an Adaptive Task-Aligned Detection Head (ATADH). First, task-aligned sample assignment ensures consistency between classification and localization objectives. Second, multi-layer shared convolution with hierarchical feature fusion reduces parameter redundancy while enabling effective task information sharing. Third, integration of a Task Decomposition module with dynamic kernel modulation and a DyDCNv2 mechanism achieves effective task separation through hierarchical attention while adaptively modeling irregular fish maw geometries. This synergistic design significantly reduces parameters through task interaction while efficiently determining target position, category, and confidence, effectively addressing fish maw detection challenges in complex environments.
ATADH first employs a multi-layer shared convolution structure for initial feature extraction, resulting in enhanced feature representation capability through hierarchical feature fusion design. Specifically, input features are sequentially processed through two shared convolutional layers, with outputs from each layer concatenated and fused to form rich multi-level feature representations. This approach not only captures semantic information at different levels but also promotes gradient flow, improving training effectiveness.
The Task Decomposition module serves as the core component of ATADH, achieving effective separation and collaboration between classification and regression tasks. This module employs hierarchical attention mechanisms to identify and strengthen features that are more critical to the current task, avoiding feature conflicts between tasks. The computational formulas for the Task Decomposition module are as follows:
where
f is the input feature map, GAP is Global Average Pooling,
is the global average pooled feature,
is the sigmoid function,
W is the dynamically generated hierarchical attention weight,
is the learnable base convolution kernel parameters,
is the dynamically modulated convolution kernel,
is Group Normalization, and ⊙ denotes tensor broadcasting multiplication.
The input feature f undergoes global average pooling to obtain , which then generates attention weights W through two convolutional layers; W dynamically modulates the convolution kernel to transform the original feature f, ultimately yielding task-specific features. In the task decomposition architecture, the classification module and regression module operate independently to achieve effective task separation, with the former specifically extracting semantic features and the latter focusing on geometric features.
The DyDCNv2 module utilizes intermediate layer features to dynamically compute offsets and masks, modulating convolution operations to adapt to irregular fish maw shapes and texture variations. Its computational formulas are as follows:
where
M represents the modulation mask, while
and
denote the learned horizontal and vertical offsets respectively; in addition,
serves as a specialized convolutional layer for offset and mask generation. In the deformable convolution operation,
W denotes the convolutional kernel weight parameter, with
i,
j indicating spatial coordinates in the output feature map and
m,
n representing relative coordinates within the sampling grid. The sigmoid function is denoted by
.
The input feature f simultaneously generates offsets and masks through spatial convolution, with the masks activated by Sigmoid. The dynamic deformable convolution adjusts sampling positions based on the learned offsets and modulates response intensity according to mask weights, achieving adaptive modeling of irregular geometric shapes.
In conclusion, ATADH effectively addresses core challenges of fish maw authenticity detection such as complex background interference, different target scales, and irregular texture shapes by integrating dynamic feature adaptation, lightweight design, and adaptive decomposition. This technical integration approach substantially improves the detection accuracy and system stability of fish maw authenticity identification in complex environments.
3.5. Optimized Loss Function
The YOLO11 architecture implements CIoU as the regression optimization metric for fish maw detection. This loss function integrates overlap ratios, centroid displacement, and dimensional consistency for predicted versus actual annotation boundaries, as formulated in Equations (
28)–(
30):
where
denotes the spatial overlap coefficient derived from reference and predicted detection regions,
represents the quadratic Euclidean separation of centroid positions,
c signifies the diagonal span of the minimal bounding rectangular region,
serves as the adaptive weight coefficient,
v quantifies aspect ratio consistency,
w and
h represent the forecasted bounding box measurements, and
and
denote the ground-truth box dimensions.
Analysis of Equations (
29) and (
30) reveals a fundamental limitation: when aspect ratios achieve consistency, the penalty components
and
v become ineffective. Contemporary regression optimization research emphasizes geometric relationships, while underestimating the morphological diversity and dimensional variations that are critical for fish maw detection targets.
For specimens that exhibit substantial morphological diversity, including fish maw, Shape-IoU [
39] incorporates both target morphological characteristics and dimensional factors into regression loss computation, as detailed in Equations (
31)–(
36):
where
characterizes morphological distance loss,
indicates the scaling parameter,
represents morphological value loss,
and
constitute weighting coefficients in the horizontal and vertical dimensions, and
x,
y,
, and
specify the centroid positions of model predictions and reference annotations.
Despite improvements over CIoU, Shape-IoU remains insufficient for addressing low-quality data samples in practical fish maw detection applications. Real-world detection encounters quality degradation factors including capture angle variations, illumination inconsistencies, and specimen desiccation differences, creating substantial dataset quality disparities that generate non-uniform gradient contributions during training.
Drawing from Wise-IoU V3 frameworks [
40], the proposed Wise-ShapeIoU methodology integrates the focusing coefficient
r with Shape-IoU to implement dynamic gradient allocation based on anchor box quality assessment. This approach minimizes detrimental contributions from low-quality samples while maintaining optimization effectiveness, as formulated in Equations (
37)–(
39):
where
r characterizes the non-monotonic focusing coefficient,
quantifies the outlier degree,
and
constitute hyperparameters, and
represents the gradient gain magnitude.
The focusing coefficient mechanism enables prioritization of medium-quality anchor boxes while reducing emphasis on high-quality anchors and minimizing negative gradient interference generated by low-quality samples. This quality-aware optimization addresses environmental variability and imaging condition challenges inherent in fish maw detection, achieving improved detection performance through more accurate quality-aware anchor matching.
5. Discussion
5.1. Technical Necessity and Method Advantages
Traditional fish maw authenticity identification primarily employs expert sensory evaluation and chemical composition analysis methods. While these approaches demonstrate high accuracy, they face practical challenges such as elevated detection costs, extended time cycles, and heavy dependence on specialized equipment and technical personnel. As a result, they fail to effectively meet the actual demands for highly efficient and large-scale detection in commercial environments. As the market value of fish maw continues to rise and counterfeiting techniques become more sophisticated, there is an increasing demand for automated detection solutions that are efficient, precise, and readily implementable.
Although deep learning has been successfully applied in other areas of food safety detection, including agricultural product quality grading, meat freshness assessment, and tea grade detection, research on fish maw authenticity detection remains limited to traditional analytical chemistry and spectroscopy methods. This study is the first to introduce deep learning object detection technology into the field of fish maw authenticity detection. The proposed RDT-YOLO model transforms fish maw authenticity detection into an object detection problem, achieving end-to-end automated detection.
Compared to traditional methods, this model has significant advantages: it avoids complex sample preprocessing and manual feature design and can automatically learn and extract discriminative visual features of fish maw; it has fast inference speed with real-time detection capability; it has good scalability based on deep learning frameworks, facilitating model optimization and functional expansion; and it has low detection costs, requiring only ordinary digital cameras or smartphones for image acquisition without expensive professional analytical equipment.
5.2. Limitations Analysis and Future Research Directions
Although RDT-YOLO performs excellently in fish maw authenticity detection tasks, it still has limitations. The current dataset is limited in scale and mainly sourced from specific suppliers and regions, which may introduce sample bias. Fish maw samples with different origins, processing techniques, and storage conditions exhibit subtle differences in appearance characteristics, which may affect model generalization ability. Additionally, although the proposed model demonstrates optimized computational efficiency, further lightweight model compression remains necessary for implementation on edge devices with limited computational resources. The current detection categories are relatively fixed, and the model requires continuous updates and expansion as new counterfeiting methods emerge.
Future research directions include constructing larger-scale and more diverse fish maw datasets covering samples from different origins, varieties, quality grades, and counterfeiting types; exploring data augmentation and generative adversarial networks to balance data distribution; researching multimodal fusion detection by combining visual features with physicochemical information including near-infrared spectral techniques and Raman spectral analysis to establish a complete and stable identification framework; and developing lightweight models through network pruning, knowledge distillation, quantization, and other techniques to improve inference efficiency.