1. Introduction
Chickens are a major source of meat and eggs worldwide [
1,
2], and their health directly affects food safety and quality. Currently, the livestock industry relies on drugs to control more than 80 types of diseases in laying hens [
3], but the health risks caused by drug residues have raised serious concerns. Therefore, the development of timely and accurate disease detection technologies for chickens has become a critical need to reduce drug use and prevent the spread of diseases.
Currently, the most common method for diagnosing sick chickens is manual observation, but this method has drawbacks such as strong subjectivity and high risks of zoonotic diseases. With the development of information technology, automated diagnostic methods for diseased chickens based on machine vision [
4,
5] have made significant progress (e.g., infrared thermal imaging [
6], comb contour analysis [
7], foot feature classification [
8], and Lab color space detection [
9]). However, their reliance on manually designed features limits their generalization capabilities. In recent years, deep learning technology [
10,
11] has driven animal and plant pathology detection into a new phase: Zhou et al. [
12] used Faster R-CNN to detect abnormal chicken feces, while Mizanu et al. [
13] utilized YOLO-V3 to segment regions of interest (ROI) from feces images and employed ResNet50 for disease classification of the segmented images. Chen et al. [
14] used YOLOv5s and tracking technology for the early detection of respiratory diseases in chickens, while Thakur [
15] proposed a Transformer-based automatic disease detection model “PlantViT” for identifying plant diseases.
Chicken manure serves as a non-invasive, sensitive key biological indicator. Abnormal changes in its visual features can provide direct pathological evidence for early warning of diseased chickens. However, current single-modal representations in deep learning struggle to fully capture the complex pathological features of chicken manure, particularly facing the following three challenges: (1) Weak feature expression: Visual features of feces during the incubation period are not prominent, and there are differences in feces features among individual diseased chickens, leading to blurred feature boundaries. (2) Modal limitations: Single-modal images/text lack the ability to comprehensively characterize pathology, resulting in models unable to extract sufficiently specific and robust features from a single modality. (3) Environmental interference: Factors such as litter color and uneven lighting reduce model robustness.
Multi-source data fusion is emerging as a new approach to overcome these challenges [
16]. Dai et al. [
17] used image–text fusion (ITF) to identify pests, while Wang Chunshan [
18] et al. embedded disease text modality information on the basis of disease image modality, achieving joint feature representation learning. Lee et al. [
19] combined multimodal data with hybrid data augmentation techniques to simultaneously predict crop type, detect diseases, and assess disease severity. Ma et al. [
20] dynamically fused multispectral images (MultimodalNet) to predict crop yield, Chen Wenjun et al. [
21] constructed a deep-near-infrared lightweight framework (YOLO-DNA), and Liu Yang et al. [
22] integrated tomato image, near-infrared spectral, and tactile modality information through feature fusion. These studies confirm that strategies such as feature concatenation, weighted fusion, and GAN [
23] can significantly improve model adaptability. However, existing methods still face the challenge of balancing information redundancy and computational efficiency.
Given the unique characteristics of testing sick chicken feces, this paper proposes a multimodal fusion model called MMCD, with core innovations including the following:
- (1)
Dual-modal feature complementarity: Fusing ResNet [
10] visual features with BERT [
24] text semantic features to enhance the model’s sensitivity to weak pathological features.
- (2)
Dynamic gating [
25] fusion mechanism: Replacing feature concatenation with Gated Cross-Attention (GCA) to filter environmental noise and reinforce key pathological features.
- (3)
Lightweight architecture: Gated Cross-attention (GCA) reduces redundant computations, avoids overfitting of weakly correlated information, and improves fusion efficiency and training stability.
- (4)
Data augmentation strategy: Low-quality datasets are constructed using image degradation techniques to enhance robustness in complex scenarios.
Experiments demonstrate that MMCD significantly improves early pathological identification accuracy and generalization performance, providing a new paradigm for intelligent monitoring systems in livestock farms.
3. Experimental Results and Analysis
3.1. Experimental Parameter Setting
The models involved in this study were all run on the same equipment, and the specific configuration is shown in
Table 3. Image feature extraction used the ResNet50 network, text feature extraction used the BERT model, the length of input text (pad_size) was set to 30, the size of input image was set to 224 pixels × 224 pixels, the number of iterations (Epoch) was set to 100, and the training Batch size (Batch Size) was set to 16. The output feature dimension of the two modes was set to 768. The hyperparameter settings in the experiment were as follows: the initial learning rate was set to 0.00001, the cosine annealing algorithm was used for learning rate attenuation, Adam was selected as the optimizer, and He was used for initialization.
Accuracy, Recall, Precision and the harmonic mean of Precision and Recall (F1 score) were used as evaluation indexes. Accuracy is the proportion of samples predicted correctly by the model over all samples, which reflects the overall predictive ability of the model. Recall measures the proportion of samples predicted correctly as positive class, and Precision measures the proportion of samples predicted as positive class that are really positive class. The harmonic mean F1 score is used to measure the combined performance of the model’s Precision and Recall.
3.2. Comparative Experiments
To address the challenge of low classification accuracy when relying solely on image modalities for detecting diseased chicken feces, which is attributed to limited feature availability, this section introduces a multi-modal feature fusion strategy. Specifically, three fusion methods are employed: simple concatenation (denoted as ResBERT-C in the table), a Transformer encoder [
31] (denoted as ResBERT-T in the table), and the novel model proposed in this study (MMCD). Furthermore, to assess the effectiveness of the fusion strategies, these methods are compared against a baseline model utilizing only the image modality (denoted as ResNet50 in the table) and its enhanced version (denoted as ResNet50-MD in the table).
As illustrated in
Table 4, the ResNet50 model, utilized as the foundational image processing model, attained an Accuracy rate of merely 85.58%, the lowest among all evaluated models. This outcome suggests that exclusive dependence on visual features is inadequate for the precise identification of diseased chicken feces, particularly under conditions characterized by low image quality, high noise levels, and minimal inter-class variability. In contrast, the ResNet50-MD model, which integrates a multi-scale feature extraction mechanism, achieved an improved Accuracy rate of 89.24%, with Precision and Recall metrics enhancing to 89.44% and 89.24%, respectively, under the premise of lower parameter number (18.06 M) and smaller computational complexity (2.51 G). These findings substantiate the efficacy of the multi-scale design in capturing local image information, especially in modeling complex texture structures.
Compared with the single-modal model, the overall performance of the dual-modal fusion model is significantly improved. The ResBERT-C model effectively makes up for the lack of image information by introducing text semantic description. Its Accuracy is improved to 95.21%, the F1 value reaches 95.15%, while the parameter number is 60.71 M, and the computational complexity is 3.78 GFLOPs, showing a good balance between performance and efficiency. Although the ResBERT-T model further expands the depth of text modeling, the parameter number is greatly increased to 145.79 M, and the computational complexity is also 4.86 G, its Accuracy (95.11%) and F1 value (95.10%) are not significantly better than ResBERT-C, indicating that in the case of a significant increase in model complexity, the Resbert-T model has a good balance between performance and efficiency. The performance improvement tends to be saturated, which may be affected by redundant parameters. The MMCD model proposed in this paper makes structural innovations in the multi-modal fusion strategy, which not only introduces a cross-modal attention mechanism and an efficient convolution module, but also optimizes the interaction path of image–text information. MMCD achieves the highest Accuracy 97.96%, Precision 98.13%, Recall 97.96%, and F1 score 97.97% while maintaining a moderate parameter scale (64.25 M) and a relatively low computational cost (3.80 GFLOPs).
As can be seen from
Table 5, the Precision and Recall of the basic image model ResNet50 in the Health category are only 75.26% and 89.26%, respectively, indicating that it has obvious misdetection problems and insufficient ability to model the boundary between healthy feces and early mild feces. At the same time, the F1 value of ResNet50 in the Ncd category is only 83.23%, which may be affected by feature confusion. The improved ResNet50-MD model improves in all categories, especially in the Salmo category, where the F1 reaches 90.84%, which verifies the important role of the multi-scale strategy for the detail recognition of stool images.
The overall performance of the multimodal fusion model in each category tended to be consistent, and the F1 values of the four categories were all above 90%, especially in Cocci and Salmo categories (F1 of Salmo category was 96.98% and 97.90%, respectively). This indicates that semantic description can assist in image discrimination, especially in diseases with small differences in appearance but significant differences in semantics, and text information provides a valuable basis for discrimination. The Precision and Recall of the MMCD model proposed in this paper reach 98% or more in the four categories, and the F1 score of the Ncd category reaches 95.74%. In addition, the Recall of the Health category is as high as 96.85%, which is significantly better than that of other models. As can be seen from
Figure 10, the Accuracy of the proposed model MMCD on the test set is significantly improved compared with the single-modal model and the models using other fusion methods.
The above experiments show that the proposed model MMCD has stronger fine-grained identification ability and generalization ability, and has more advantages in dealing with fuzzy boundaries between healthy and diseased feces. Moreover, MMCD achieves a better trade-off among accuracy, computational efficiency, and model complexity, which shows strong potential for practical application.
3.3. Ablation Experiments
3.3.1. GCA Activation Function and Cross-Attention Head Number Ablation Experiments
To validate the impact of different activation functions and cross-attention head counts on model classification accuracy, this section designs ablation experiments targeting the internal activation functions and cross-attention head counts of GCA. The activation functions selected were Sigmoid, Tanh, and ReLU, while the number of attention heads was set to 4, 8, and 12, covering a range from “computational efficiency” to “fine-grained feature capture,” aligning with GCA’s requirements for cross-modal association modeling.
As shown in
Table 6, the accuracy of the Sigmoid + 8 combination (97.96%) is significantly higher than that of Tanh + 8 (95.82%) and ReLU + 8 (94.99%), indicating that its “noise suppression” characteristic is more suitable for multimodal fecal detection tasks. In fecal detection, the association between textual descriptions (e.g., “dark brown and viscous”) and image features remains relatively stable. Sigmoid suppresses irrelevant modalities (e.g., bedding noise), enabling it to focus more precisely on effective features. All activation functions reach their performance peak at eight heads, and accuracy decreases beyond eight heads (e.g., Sigmoid from 97.96% to 96.95%). Excessive head counts can lead to attention dispersion and increased interference between modalities (weak associations between text and images are incorrectly amplified), while 8 heads strike a balance between “modality correlation granularity” and “computational efficiency.” Tanh’s bidirectional regulation capability (which can enhance effective features) is theoretically more suitable for multimodal tasks, but its performance is inferior to Sigmoid in experiments, as “suppressing noise” takes priority over “enhancing features” in low-quality fecal classification. ReLU’s non-negative activation causes a loss of negative input information, performing worst in scenarios with significant modal fluctuations (e.g., low-quality images) (ReLU + 8 accuracy rate is only 94.99%).
As shown in
Table 7, in the Health category, most combinations demonstrated high performance (e.g., Sigmoid + 4 with Precision: 99.52%, Recall: 91.96%, F1 score: 95.59%). However, the Sigmoid + 8 combination further optimizes semantic feature extraction and multimodal information fusion through the synergistic effect of the Sigmoid activation function and 8-head cross-attention, ultimately achieving the optimal performance of 100% Precision, Recall, and F1 score, indicating that this combination has extremely strong stability and accuracy in identifying health samples. The Cocci category exhibits significant shortcomings in certain combinations, such as Tanh + 4, which has a Recall of only 86.86% and an F1 score as low as 92.07%, reflecting a severe missed detection of cocci infection samples by the model. The Sigmoid series combinations perform better in this category, with Sigmoid + 8 maintaining 100% Precision while increasing Recall to 92.70% and achieving an F1 score of 96.21%, significantly reducing the risk of missed detection and effectively balancing identification accuracy and coverage for this category of samples. For the Ncd category, some combinations exhibit imbalances between Precision and Recall rates. For example, ReLU + 4 has a Recall rate of only 88.58% despite a high Precision rate of 94.94%, but its F1 score is only 91.65%, indicating insufficient identification stability. The Sigmoid + 8 combination achieves 100% Recall in this category, while improving Precision to 91.84% and achieving an F1 score of 95.74%, ensuring comprehensive capture of Newcastle disease samples while significantly reducing misclassifications, thereby achieving more reliable identification results. The Salmo category already exhibits good performance in most combinations, but the Sigmoid + 8 combination still demonstrates a significant advantage. In contrast, the Precision of ReLU + 4 is only 89.32%, and while Tanh + 4 has a high Recall (99.61%), its Precision is relatively low (91.49%). Meanwhile, Sigmoid + 8 achieves a perfect 100% in Precision, Recall, and F1 score for this category, fully validating its powerful capability in distinguishing subtle pathological features such as Salmonella infection.
Overall, the Sigmoid + 8 combination achieved optimal or near-optimal performance across all four categories, particularly achieving “zero misclassifications and zero false negatives” in the Health and Salmo categories while effectively balancing recognition accuracy and coverage in the Cocci and Ncd categories. With an overall Accuracy of 97.96%, it emerged as the best choice among all combinations, fully demonstrating the synergistic advantages of the Sigmoid activation function and 8-head cross-attention in the task of multi-modal classification of sick chicken feces.
3.3.2. Ablation Experiments for Each MMCD Module
In order to thoroughly assess the contributions and synergistic effects of each component within the MMCD model, this study implemented a systematic ablation experiment. The process commenced with a basic Concat concatenation model, referred to as ResBERT-C. Subsequently, the DSconv module, the Manhattan attention mechanism (MASA), and the Gated Cross-Attention fusion module (GCA) were sequentially incorporated. This iterative integration culminated in the development of a comprehensive MMCD bimodal fusion model, which facilitated the analysis of the impact of each module on the model’s performance. The results of these experiments are presented in
Table 8 and
Table 9.
It can be seen from
Table 8 that the base model ResBERT-C has the weakest performance in various performance indicators, indicating that the model is still unable to fully extract and fuse the information of image and text modalities. After the introduction of DSconv, the F1 value of DSconv was increased by 1.37 percentage points under the premise of a 4.2% reduction in the number of parameters (58.18 M vs. 60.71 M) and a similar computational cost (3.68 G vs. 3.67 G). This shows that the Depthwise Separable convolution effectively improves the efficiency of feature representation by focusing on the key feature regions, and it verifies the necessity of the local feature screening mechanism. After adding the MASA module (ResBERT + MASA), although the parameters increased by 16.5% (70.74 M vs. 60.71 M), the F1 is improved to 97.04% and the Accuracy was 97.05%, which verifies the effectiveness of MASA in feature recollection and that it can effectively capture the cross-level semantic association. With the introduction of a GCA module (ResBERT + GCA) at a similar parameter scale (71.75 M vs. 70.74 M), GCA obtains better parameter efficiency than MASA and improves the F1 to 96.94%. After further combining two or more modules, the performance continues to improve. For example, the Accuracy of the ResBERT + MASA + GCA model is 96.95%, which is almost optimal. The MMCD model, with a complete fusion of three modules, achieves the best performance on all indicators, its parameter number is only 64.25 M, and the calculation amount is 3.80G, showing a superior balance between accuracy and computational efficiency.
As can be seen from
Table 9, in the Health category, the baseline model showed high performance (Precision: 99.10%, Recall: 98.66%, F1 score: 98.88%). However, by introducing the MASA module, it was further optimized in terms of semantic consistency and spatial attention, and it finally achieved the optimal performance of 100% in Precision, Recall, and F1 value in the MMCD model, indicating that the framework has strong generalization ability in identifying health samples. Under the baseline model, the Recall rate of the Cocci category was only 84.67%, and the F1 value was also significantly low, which reflected the missed detection phenomenon of the model. After the introduction of the DSconv and MASA modules, the Recall rate stably increased to more than 91%, and the F1 value increased synchronously, which significantly reduced the risk of missed detection. Finally, the MMCD model increased the Recall rate to 92.70% while maintaining 100% Accuracy, which effectively balanced the recognition ability and the risk of misjudgment. For the Ncd category, although the Recall rate of the baseline model reached 100%, its Precision was only 88.24%, indicating that there was a large number of misjudgments. After the introduction of the MASA module, the discrimination ability of the model for this category of samples was significantly enhanced, and the Accuracy and F1 value were improved. The MMCD model improves the Accuracy to 91.84% and the F1 value to 95.74% while maintaining a high Recall rate, realizing a more stable and accurate recognition of this type of disease. The Salmo category had high performance under the baseline model, but there was still room for improvement through module-level enhancement. After the introduction of MASA and GCA, the Precision and Recall of the Salmo category were improved. Finally, the MMCD model achieved 100% performance in Precision, Recall, and F1 score in this category, which fully verifies its advantages in dealing with subtle pathological differences.
In summary, the ablation experiments clearly reveal the functional positioning of each component module in the MMCD model and their contribution to the overall performance, and the performance improvement of each module on different categories is different. The combination of MASA and GCA has significant synergistic advantages in multimodal scenarios. Although DSconv can improve the convolution efficiency, it needs to be carefully introduced in the fusion structure to avoid redundancy. The MMCD model constructed by the organic combination of DSconv, MASA, and GCA shows stable and excellent classification performance in all categories. This indicates that the synergy of multi-level semantic alignment and multi-modal feature fusion is a key path to achieving high-precision avian disease diagnosis. The complete MMCD model shows robustness and high accuracy in multi-class recognition, which verifies technical effectiveness of the multi-modal collaborative design strategy in this paper.
As shown in
Figure 11, the model’s misclassifications occurred in the Health and Ncd categories, while all other categories are correctly classified. By analyzing the image information and text descriptions of the misclassified images in
Table 10, it was found that the misclassifications were due to the similarity of features between the two categories in some samples, which are prone to misjudgment under low-light or blurry conditions. Additionally, the boundary features in Ncd images are less distinct compared to other categories, and Health samples exhibit a “long, narrow shape,” while some Ncd samples are “irregularly clustered.” When the shape of Ncd feces resembles a long, narrow shape, the boundary between it and the Health category becomes blurred. Keywords such as “white components” and “green” appear in both categories. If image features are unclear, the discriminative power of textual semantics is weakened. However, overall, the dual-modal fusion model for detecting pathological chicken feces still achieves a significant improvement in classification accuracy compared to single-modal image classification.
The experimental findings presented above indicate that the integration of a Gated Cross-Attention (GCA) module is effective in fusing image and text modalities, thereby enhancing stability and generalization capabilities. In contrast to simple concatenation, the gating mechanism allows for the dynamic adjustment of weights assigned to different modal information, which enables the model to more flexibly leverage pertinent information during classification tasks. Concurrently, the cross-attention mechanism adeptly captures semantic associations between image and text modalities, resulting in a notable improvement in classification accuracy. Nonetheless, despite the observed enhancements across various metrics with cross-modal Transformer encoders, the increased number of layers contributes to heightened computational complexity, which can lead to suboptimal performance in certain categories. This shows that a balance between model complexity and performance needs to be found in multimodal tasks.
4. Discussion
In this study, the three major challenges in poultry pathology fecal detection consisting of weak feature expression, modal limitations, and environmental interference (as described in the Introduction) were effectively addressed by constructing a multimodal fusion model, MMCD. Experiments have demonstrated that the bimodal strategy of fusing visual and textual semantics significantly improves the discriminative ability of pathology features. The core value of this study is reflected in three aspects: first, technological innovation and performance breakthrough. Integrating the MASA mechanism with DSconv in the ResNet50 backbone network successfully mitigates the feature confusion problem caused by the high visual similarity of feces. The amount of model parameters is reduced by 7.5 M, and the computation amount is reduced by 1.62 G, providing a lightweight foundation for agricultural edge computing scenarios. Migration extraction of textual semantic features (e.g., breeding records, symptom descriptions) using a pre-trained BERT compensates for the lack of characterization of weak pathological features by a single image modality. This method reduces the dependence on manual annotation and solves the problem of “fuzzy boundary of latent features” pointed out in the introduction. The proposed Gated Cross-Attention (GCA) module effectively suppresses environmental noise such as matting color/lighting unevenness by using a dynamic weighted screening of key pathological features, reduces the number of parameters by 41% compared with the cross-modal Transformer, breaks through the bottleneck of redundancy of information in the existing methods [
17,
18,
19,
20,
21,
22,
23], and improves the classification Accuracy of Concat by 2.51–2.82%, which demonstrates the superiority and ability of adaptive fusion. Second, performance: Compared with the traditional unimodal methods, the model in this study has advantages in Accuracy (+8.69%), Recall (+8.72%), Precision (+8.67%) and F1 value (+8.72%), which confirms the ability of multimodal fusion to crack the “high visual similarity of fecal matter of multiple etiologies” (as described in the Abstract); furthermore, through the image degradation enhancement strategy, the model maintains stable performance in complex aquaculture environments, which directly addresses the challenge of “environmental disturbances reduce model robustness” emphasized in the introduction. Third are the contributions to the field and paradigm innovation. For the first time, the image–text fusion paradigm is introduced into the field of livestock and poultry pathology detection, which extends the application boundary of multimodal learning in agricultural health monitoring. Compared with existing disease studies [
17,
18,
19,
22], this study verifies the enhancement effect of text modality on animal micropathology features. By improving the timeliness and accuracy of fecal pathology identification (the core objective of the abstract), we can provide technical support to reduce antibiotic misuse, and early and accurate diagnosis can reduce the reliance on drugs for prevention and control of more than 80 laying hens’ diseases [
3], directly addressing the problem of “health risks caused by drug residues”, as emphasized in the introduction. The lightweight design of the GCA gives the model potential to be deployed on the edge of the farm, which can promote the intelligent monitoring system to move from “sensing” to “decision making”.
Despite the excellent performance of MMCD, there are still challenges for scale-up applications. First, the text modality relies on expert descriptions, and semantic subjectivity may lead to fluctuations in model discrimination. In the future, automatic caption generation [
32] based on pre-trained visual language models (e.g., BLIP [
33]) can be explored, and the conversion of fecal images to standardized text can be achieved through model fine-tuning to reduce the reliance on expert descriptions and eliminate manual description bias; second, in complex farming scenarios, the model computational efficiency (64.25 M/3.80 G) still struggles to meet the real-time deployment requirements of low-power devices, and the model computation efficiency needs to be further optimized to adapt to the low-latency deployment of edge devices. In the future, a cross-modal knowledge distillation framework [
34] can be designed to compress the bimodal knowledge into a lightweight network (e.g., MobileNetV3 [
35]) to improve the inference speed at the edge; third, feature similarities between different categories of features in some samples are easily misjudged in low-light or blurred images ((e.g., “healthy” and “n”) and “ncd”), which elevates the risk of misdetection. In the future, we can try to introduce the introduction of depth sensors [
36] to construct a RGB-Depth-Text multimodal system and utilize depth information to resolve the geometric features (e.g., feces height/density) of overlapping targets in 2D vision, so as to solve the problem of interclass confusion.
5. Conclusions
The health status of chickens can be evaluated through an analysis of fecal characteristics. Chickens in poor health often display notable deviations in fecal color, shape, and texture, including darkened coloration, a thin consistency, or the presence of atypical components. Nonetheless, reliance solely on visual assessment may not provide a comprehensive reflection of the chickens’ health status, particularly when features are subtle or overlapping, potentially resulting in inaccurate diagnostic outcomes. Furthermore, in practical farming settings, variables such as the performance of data collection devices, lighting conditions, and surface reflections on feces can contribute to image data that are characterized by low resolution, elevated noise levels, and blurred focus.
In order to solve the above problems, this paper constructs a low-quality dataset under different illumination and noise conditions and proposes an innovative diagnosis model for sick chicken feces. The model fuses image and text information. Compared with the use of computer vision technology to extract feces from images for classification and recognition, the accuracy is greatly improved, and the tedious process of data labeling is significantly simplified. Specifically, this paper uses the improved ResNet network to extract stool image features, combines with the BERT model to extract text description features, and realizes the efficient integration of information through the multimodal fusion method GCA, so as to reduce the cost required for manually cropping images and labeling during training while ensuring good diagnostic performance.
The experimental results show that compared with the single-modal model, the proposed dual-modal fusion model (MMCD) shows a significant improvement in diagnostic performance, especially in evaluation indicators such as Accuracy, Recall rate, and F1 score. Although the cross-modal Transformer encoder also plays a positive role in feature fusion, its comprehensive performance fails to surpass the GCA fusion method due to its high computational complexity. The Gated Cross-Attention (GCA) method was combined to further optimize the information interaction process between different modalities, thereby significantly improving the stability and generalization ability of the model. The Accuracy of the improved dual-modal fusion model for the diagnosis of sick chicken feces reached 97.96%, which was 11.61%, 12.38%, 12.24%, and 12.38% higher than that of the basic ResNet50 in each index. Compared with the improved ResNet50, the Accuracy of the proposed model was increased by 8.69%, 8.72%, 8.67% and 8.72% in each index. Compared with simple Concat concatenation, each index was increased by 2.51%, 2.75%, 2.82% and 2.75%. Compared with the cross-modal Transformer encoder, each index was increased by 2.69%, 2.85%, 2.87%and 2.85%. Compared with the basic ResNet50, the MMCD model using the improved ResNet50 had parameters reduced by 7.5M, and the calculation amount reduced by 1.62G.