YOLOv7 Optimization Model Based on Attention Mechanism Applied in Dense Scenes

With the development of computer vision and target detection technology, real-time detection of dense scenes has become an important application requirement in various industries, which is of great signiﬁcance for improving production eﬃciency and ensuring public safety. However, the current mainstream target detection algorithms have problems such as insuﬃcient accuracy or inability to achieve real-time detection when detecting dense scenes


Introduction
Since the development of computer vision, object detection in dense small targets has been a very challenging topic.In dense scenes, many targets need to be detected, the background is complex, and the occlusion problem is common, which greatly affects the accuracy.However, due to the practical application requirements, such as mask recognition detection in high traffic scenarios, crop fruit detection, etc., the human detection workload is large, the recognition accuracy is unstable, and the execution efficiency is low.So, the high-accuracy object detection recognition method for dense scenes is of great research significance.
Nowadays, object detection has been widely used in the field of deep learning, which can be divided into two main types, respectively, one-stage-based regression and twostage-based RPN (Region Proposal Network) [1].The one-stage algorithm starts from the original YOLO (You Only Look Once) [2] and gradually develops SSD (Single Shot MultiBox Detector) [3], YOLOv2 [4], RetinaNet [5], YOLOv3 [6], etc., and it directly uses the whole image as the network input and obtains the position of the target-enclosing frame and the class of the target by only one forward propagation.The one-stage algorithm is fast in detection but suffers from low accuracy and poor detection of small objects.While twostage starts from the original R-CNN [7], and gradually develops SPPNet (Spatial Pyramid Pooling) [8], Fast-RCNN [9], and Faster-RCNN [10].The two-stage algorithms need to generate region proposals by heuristic methods (selective search [11]) or CNN networks (RPN) and then classify region proposals.Although its accuracy will be higher than that of the one-stage algorithm, the feature repetition calculation is too large and the training speed is too slow.And, after the one-stage algorithm is continuously optimized, the drawbacks are compensated in YOLOv4 [12], YOLOv5 [13], YOLOv6 [14], and YOLOv7 [15], which are continually developed in the YOLO series.
Most of the previous studies have used older model versions, and suitable application scenarios for the latest YOLOv7 have not yet been found.So, this paper improves the latest YOLOv7 deep learning model and proposes the YOLOv7B-CBAM network model to deal with small target detection in dense scenes.The main innovations and contributions of this paper are as follows: (1) An improved YOLOv7B-CBAM model based on the attention mechanism is proposed to enhance the performance of the YOLOv7 model using the CBAM attention mechanism to achieve high-precision, real-time detection in complex and dense scenes.(2) Comparing the results of three improved YOLOv7 models based on different attention mechanisms on the VOC dataset and proposing the YOLOv7B-CBAM model with the highest accuracy, which demonstrates the superiority of the proposed model in accuracy.
(3) Realizing real-time, high-accuracy detection on the two different datasets demonstrates the generalization and applicability of the proposed model in different complex scenes.

Related Work 2.1. Computer Vision and Deep Learning
Object detection in images, as the most advanced aspect of computer vision development, is now widely used in various aspects.On one hand, in the field of defect detection, Ref. [16] reviewed and summarized the application of product defects in defect detection in ultrasonic detection, filtering, and computer vision, and performed a detailed analysis of defect classification, feature description, etc.In [17], the real-time detection of surface defects on arc magnets was achieved using a migration learning mechanism using lightweight YOLOv5s, which guarantees the high accuracy of defect detection under the condition of small sample training.On the other hand, various problems to be solved in the study of target detection and classification of UAV (unmanned aerial vehicle) datasets are summarized [18], which illustrate the wide application of computer vision and object detection in the field of deep learning in the UAV domain.Also, in the field of plant and pest detection, four different models are compared on a pine insect pest dataset, and a hybrid model is proposed that can be applied for monitoring and predicting various insect species in agriculture and forestry [19].The Faster DR-IACNN with higher accuracy is proposed to achieve real-time detection and provide guidance for the field of grape leaf disease detection and other plant pest fields [20].It turns out that object detection and computer vision based on deep learning has significant advantages in practical applications, saving manpower, and simplifying the recognition process.

YOLO
The YOLO series is almost the fastest and best algorithm in one-stage object detection, and its continuous development is the main reason for it to stay mainstream.From the appearance of YOLO at the beginning of 2016 to YOLOv2 at the end of the same year, BN layers were used to make bounding box predictions utilizing anchor box.In 2018, YOLOv3 used FPN (Feature Pyramid Network) upsampling to deepen the number of backbone layers.YOLOv4 appeared in April 2020, with its addition of SPP and PAN (Path Aggregation Network) [21] structures, while YOLOv5, which appeared in June, reduced the model size by 90% compared to YOLOv4, but the accuracy was equivalent.The YOLO series was followed by YOLOX [22], YOLOv6, and YOLOv7 using E-ELAN.However, the usefulness of the YOLO series has been proven since YOLOv3.
On the one hand, YOLOv3 and YOLOv4 have been widely used in various fields.In [23], the proportional and scale-aware YOLO method is proposed, which solves the problem of detection of objects with large aspect ratio differences, such as the human body, and detection of smaller objects, and performs well in VOC 2012 and pedestrian detection.However, the accuracy of YOLO models based on older versions is lower than the widely used YOLOv5 and YOLOv7 today.In [24], the average accuracy of YOLOv7 was demonstrated to be better than YOLOv5s using experiments on Camellia oleifera fruit detection.In the experiments, the YOLOv7 model outperformed the accuracy of YOLOv5s in detecting obscured fruits, proving the superiority of the YOLOv7 algorithm.In the field of hat and mask recognition in complex kitchen scenarios, the embedded model using YOLOv5s has been able to achieve real-time detection with a guaranteed accuracy of 85.7% [25].Experiments show that the YOLO series performs well in dense scenes, proving the superiority of YOLOv7 in detection accuracy against small targets.

Attention Mechanism
A one-stage algorithm has high detection speed, but the consequent low accuracy has always been the shortcoming of its development.In contrast, the attention mechanism extracts features which significantly improve the accuracy of recognition and classification, which has always led to good results in the improvement of YOLO models.The attention mechanism that appeared in the limelight was first used on RNNs in [26], and the attention mechanism was first introduced into the image field in [27].Later, the CNN-based attention mechanism RA-CANN was proposed for the first time in [28], and, in the subsequent development, a variety of attention mechanisms gradually emerged, such as channel attention, spatial attention, and self-attention [29].SE-Net, proposed in [30], introduced the channel attention mechanism into the public's vision for the first time, which was mainly used to show the correlation between different channels, and subsequently developed ECA-Net [31], GCT [32], and so on.On the other hand, starting from STN [33], the spatial attention mechanism is mainly used to improve the feature expression of key regions, enhance specific target regions, and weaken irrelevant background regions, and then GE-Net [34] was developed.In recent years, the hybrid attention mechanism of parallel channel attention and spatial attention is mainly used, mainly CBAM [35] (Carbon Border Adjustment Mechanism), BAM [36], scSE [37], DANet [38], CA [39] (coordinate attention), etc.In [40], the improvement of YOLOv5s using the CA mechanism resulted in a 30% smaller size than the original model, but still ensured its good detection accuracy.The CAM and parallel residual attention blocks were used in [41,42] to improve the accuracy of the then-highest accuracy models on vehicle model recognition and human pose estimation applications, respectively.It has been proved that mixed attention can effectively improve the robustness of the network as well as the accuracy in practical application.In this paper, various attention mechanisms are used to modify the model, which is mainly mixed attention mechanisms, and the comparison experiments of CBAM, CA, and SimAM [43] (simple, parameter-free attention module) are used.

Research Status and Application Analysis of Object Detection Based on YOLO
In the above research, it is obvious that the YOLO series has advantages over other models in dense scenes and the YOLOv7 model has high accuracy in detecting small targets.Therefore, this paper uses the YOLOv7 model to solve the problem of small target detection in dense scenes.On the other hand, the mixed attention mechanism performs well in YOLOv5 and can effectively improve the model's accuracy.Due to the limitations of the YOLOv7 model itself, in order to ensure the requirements of a high detection rate and low false detection rate in dense scenes and improve the accuracy of target detection, this paper makes some improvements to the YOLOv7 model.Based on the original YOLOv7 network model, part of the traditional convolutional layers are replaced with the standard convolution combined with the attention mechanism to take full advantage of the feature semantic information.To find the best results in terms of accuracy using the attention mechanism, this paper selected CBAM, CA, and another new type of SimAM to modify the model, and finally found that CBAM performed best.

Carbon Border Adjustment Mechanism
CBAM is a lightweight module that combines channel attention and spatial attention proposed by the benchmark SENet.In the case of a small increase in the amount of calculation and parameters, the performance of the model is greatly improved.CBAM mainly emphasizes meaningful features in two dimensions, spatial and channel, which correspond to spatial and channel attention in the model, respectively.In CBAM, the input intermediate feature mapping F is first passed through the one-dimensional channel attention module M c and obtains the intermediate output F .Subsequently, the final result is obtained through the two-dimensional spatial attention module M s , which proceeds as follows: where ⊗ denotes element-by-element multiplication, while the specific channel attention module and spatial attention module formulas are as follows: where σ denotes the sigmoid function, MLP indicates multi-layer perceptron, and f 7×7 is the convolution operation of 7 × 7. The specific CBAM structure is shown in Figure 1.CBAM makes the input intermediate feature map focus on important features and suppresses unnecessary features and performs noise reduction of irrelevant clutter, which eventually makes the network focus on the object more correctly.In this paper, CBAM is used to take the place of two initial conventional convolutional layers in the backbone part of the YOLOv7 network, which effectively extracts the image features and significantly improves the accuracy of the model.The specific modified model structure is shown in Figure 2; the red box on the left is the backbone part of the YOLOv7 model, and the red box on the right is the head part.The specific location of the added CBAM attention module has been marked with a yellow color block in the figure, where "CBS, 3/1, 64" refers to the original convolutional layer with a convolutional kernel size of 3, a stride of 1, and a channel count of 32.

Coordinate Attention
CA is a lightweight mechanism proposed after improving the shortcomings of SE as well as CBAM attention mechanisms, which enables networks to obtain information over a larger range by embedding location information into channel attention.To avoid compressing all spatial information and capturing accurate location information, the traditional channel attention is proposed to be decomposed into two one-dimensional global poolings, which extract spatial information from horizontal and vertical directions, respectively, and perform transformation coding.The specific decomposition is to pool the input feature maps of (C, H, W) by x and y directions instead of global pooling.Firstly, each channel is pooled from horizontal and vertical coordinates to generate (H, 1) and (1, W), respectively, and the outputs of the c − th channel with height h and width w are, respectively:  The obtained result is decomposed into two independent tensors by 1 × 1 convolution, and then the 1 × 1 convolution is used to perform the up-dimensioning operation, respectively, and, finally, the attention vector is obtained by combining the sigmoid activation function: from which, finally, the output Y is obtained, and the whole process can be summarized as:

Simple, Parameter-Free Attention Module
SimAM is an attention module based on neuroscientific knowledge proposed for convolutional neural networks.Unlike the existing one-dimensional channel attention and two-dimensional spatial attention, SimAM does not use traditional pooling but assigns weights to the results using energy function solutions based on neuroscience theory and the principle of linear differentiability.Thus, it is not adding any parametric quantity but is a novel three-dimensional weighted attention mechanism.The energy function is as follows: and its analytical solution is as follows: where and, thus, the minimum energy can be obtained by the following equation: where the importance of neurons can be obtained by 1 e * , and the features are augmented as defined by the attention mechanism, resulting in the following formula: In this paper, SimAM is used similarly to CA, with a conventional convolutional layer replaced by SimAM attention in the MP2 module of the YOLOv7 head section modified.Since SimAM does not add additional parameters, the number of parameters is rather reduced compared to the original network.The specific structure is shown in Figure 4.The dotted boxes with yellow blocks in the figure show the modified structure of MP2.

Summarize
The purpose of using the attention mechanism in this paper is mainly to solve the current problems of the original YOLOv7 model, such as the difficulty in guaranteeing accuracy when detecting dense scenes, the lack of accuracy in detecting small targets, and the difficulty in distinguishing various targets.In this paper, three different attention mechanisms, CBAM, CA, and SimAM, are used to improve the YOLOv7 model and conduct comparative experiments.In order to ensure the rigor of the comparison experiments, this paper continued to test the CBAM head part after obtaining good results in the backbone part when using CBAM.It was finally found that the YOLOv7 model improved by replacing the convolutional layer in the backbone part using CBAM, which has the highest accuracy, higher than other models, and the test results will be given in Section 4.  The experimental datasets used in this paper are Pascal VOC2007 and VOC2012, in which 21,493 images can be divided into 20 categories, including people, cats, cows, cars, buses, bicycles, sofas, TVs, bottles, etc.Among them, 50% of "Pascal VOC 2007" and all "Pascal VOC 2012" images are used as the training set, and the remaining 50% of "Pascal VOC 2007" images are used as both the test set and the validation set.During the training of all datasets in this paper, the hyperparameters are set to epoch 400 and batch size 8.

Comparative Experiment on Attention Mechanisms
In this paper, the effectiveness of the above three attention mechanisms in improving model accuracy is tested by comparing experiments.The three metrics of mAP, parameter size, and GFLOPs (giga floating-point operations per second) were compared in the same experimental setting using an image size of 640 × 640.The results are summarized in Table 2. "YOLOv7B-CBAM" denotes the use of CBAM to replace the convolutional network in the backbone, and "YOLOv7H-CA" denotes the use of CA to replace the convolutional network of the MP2 module in the head.It is evident from the experimental findings that: (1) In terms of model accuracy, the best results are obtained using CBAM attention, with a 1.0% improvement in model accuracy when replacing the head part compared to the original YOLOv7 model, and the best improvement in model accuracy when replacing the backbone part, with a 1.5% improvement.In contrast, the other two attention mechanisms also improve the model accuracy, but the results show that YOLOv7B-CBAM accuracy is 0.3% and 0.1% higher than YOLOv7H-CA and YOLOv7H-SimAM, respectively, indicating that the other two attention mechanisms do not improve the model accuracy as much as CBAM.
(2) When it comes to model size, it is clear that the three attention mechanisms have only minor effects on the number of parameters and operations.Only the model proposed in this paper has an increase of 0.2 in computation for the original YOLOv7 model, which is equivalent to an increase of 0.19% in the number of parameters and is almost negligible.And, since the main requirement of this paper is the improvement of the accuracy, overall YOLOv7B-CBAM has the best performance in the experimental results.(3) To ensure the rigorousness of the results, the YOLOv7B-CA and YOLOv7B-SimAM models which used attention mechanisms to replace the backbone are also tested in this part.According to the results, from the perspective of accuracy, the accuracies of these two models are 0.829 and 0.827, which are 0.049 and 0.051 lower than that of YOLOv7B-CBAM, and 0.034 and 0.036 lower than that of the original YOLOv7 model.
From the perspective of model size, these two models are not better than YOLOv7B-CBAM and the original YOLOv7 model.This proves that the performances of the YOLOv7B-CA and YOLOv7B-SimAM models are poor, which also fully indicates the superiority of YOLOv7B-CBAM.Therefore, there is no further research on the application of these two models in this paper.

Application
As an important application in the field of computer vision, object detection has wide application prospects due to its ability to identify and locate objects in images or videos and reduce labor costs.The current practical needs for real-time detection of dense scenes are mainly in the field of public safety in public places such as airports, stations, subways, and other crowded places, where real-time monitoring of people is needed, or in the field of agriculture, where real-time detection and localization of crops and animals are needed to achieve intelligent agricultural management and precision agricultural production.It is of great significance for improving production efficiency and ensuring public safety.Therefore, how to achieve efficient and accurate real-time detection of dense scenes is the current challenge that needs to be solved for computer vision and object detection.However, small object detection in dense scenes needs to overcome a variety of difficulties, specifically, the difficulties of small-object detection in dense scenes include the following aspects: (1) High density: there are a large number of objects in the scene, and their mutual occlusion and overlap can increase the difficulty of detection.(2) Small objects: small objects usually have small pixel sizes and are difficult to be accurately detected and localized in the image.(3) Diversity: objects in dense scenes may have different classes, shapes, colors, textures, and other features, which can increase the difficulty of training and testing the model.
In this case, achieving fast real-time detection is a challenging task.The YOLOv7B-CBAM model, whose accuracy has been verified on the VOC dataset, is used to test its accuracy and confirm its application in the public safety and agricultural domains on the mask dataset and the tomato dataset, respectively.The tomato and mask datasets used in this paper are both open-source datasets found on the Kaggle website, and the picture size is uniformly 640 × 640 × 3 during training.

Detection in Tomato Dataset
In the field of object detection, there have been many studies about fruit detection; for example, in [44], there is a systematic summary of the development of picking robots, where the importance of fruit identification techniques is emphasized and the characteristics of different target detection techniques are analyzed to illustrate the feasibility and necessity of the application of target detection techniques in picking robots.On the other hand, in [45], some problems are encountered and the directions for solving them using Nano Aerial Bee (NAB) in agricultural environments are investigated, and the effectiveness of YOLOv7 in practical applications in agricultural environments is tested on the flower detection dataset, demonstrating the robustness of YOLOv7 and the feasibility of its application.In general fruit detection applications for picking robots, it is common to encounter problems such as more complex real-world scenes, too many fruit and vegetable targets, small targets, and a large amount of occlusion, resulting in poor accuracy of the final results and difficulty in accurately identifying fruits.Currently, fruit detection requires high model accuracy and robustness, and the original object detection model can hardly meet its requirements.Therefore, this paper verifies the effectiveness of the above model by applying the application scenario of tomato fruit detection in a dense scene.
To verify the superiority of the YOLOv7B-CBAM model in dense scenarios, we completed comparative experiments on the tomato dataset.There are 895 images in the dataset, including 695 images for training, 95 images for validation, and 105 images for testing.The hardware and software environments for the experiments are consistent with those described in Section 4.
Figure 5 is the original image of the tomato dataset, and Figure 6 is the image after the detection of the dataset using the YOLOv7B-CBAM model.It can be seen that the blocked tomato in the lower left corner can be accurately identified, and the recognition effect of small objects in dense scenes in the upper right corner image is also very good.Specific experimental data are shown in Table 3.   Table 3 shows the results obtained for the YOLOv5s, the original YOLOv7 model, and optimized YOLOv7 models by using the attention mechanism on this dataset.The following conclusions can be drawn from the analysis results.
As shown in Figure 7, compared with the YOLOv7 model with the attention mechanism added, the accuracy of the proposed YOLOv7B-CBAM model is the best, which is 0.6% higher than that of the original YOLOv7 model and 0.9% higher than that of YOLOv7H-CBAM.It is proved that the CBAM module is effective and the model in this paper can identify small objects in complex and dense scenes more accurately.In contrast, YOLOv7H-CA and YOLOv7H-SimAM models performed poorly and had low accuracy in this dataset.The reason may be that the robustness of these two attention mechanisms was poor, and it was difficult to ensure accuracy in dense or complex scenes, which hurt accuracy.The comparison shows that the improved YOLOv7B-CBAM model in this paper has higher robustness and can still maintain a high recognition rate in dense scenes and accurately identify tomatoes.The model can be integrated into an intelligent tomatopicking robot in practical applications to automate the tomato-picking process.The use of the YOLOv7B-CBAM model can effectively improve the picking efficiency and accuracy of tomato-picking robots, reduce the cost and time of manual picking, and also cope with complex picking environments, such as dense tomato bushes.In addition, the high robustness of the model can also ensure good recognition in different picking scenarios, thus improving the stability and reliability of the robot.The model can be further enhanced by data augmentation, and more training data can be added for other types of fruit recognition, as well as for plant pest and disease detection or fruit quality detection.

Detection in Face Mask Dataset
Object detection has been widely used in many aspects of daily life and is crucial in the prevention and management of epidemics.Since early 2020, the COVID-19 epidemic has been spreading throughout the world, posing a significant threat to the world's health systems.COVID-19 is a virus spread by direct transmission and contact, and masks contain droplet nuclei of the virus to prevent wearers from inhaling it.As a result, masks are a crucial barrier against virus infection, which can significantly lower the risk of COVID-19 infection and the potential for cross-infection in public settings.Moreover, due to the unique characteristics of hospitals, airports, and other locations, it is vital to avoid other infectious illnesses in addition to COVID-19 and other health demands, so the detection of mask wear is also extremely significant.However, at present, the detection of mask wear is mainly based on human identification, which is not only a hygiene hazard but also a risk of missed detection due to personnel fatigue.However, due to the large number of people in public places, small targets, and face occlusion, the detection is difficult, and the requirement of model precision is high; the current object detection model finds it hard to meet its accuracy requirements.The validity of the above model is verified by the application scenario of mask-wearing specification detection in complex scenarios.
In this paper, comparison experiments are completed based on the face mask dataset to verify the accuracy of the YOLOv7 model and on the attention mechanism proposed above.There are a total of 1420 photos in the dataset: 990 photos for training, 136 photos for testing, and 294 photos for validation.The experiments' hardware and software environments are consistent with the description in Part IV. Figure 8 shows the images before identification in the face mask dataset.In Figure 9a, it can be seen that the YOLOv7 model incorrectly identified the behavior in the upper right picture as improper wearing a mask, while YOLOv7B-CBAM correctly identified it as not wearing a mask in Figure 9b.The specific experimental data are shown in Table 4.   Table 4 shows the results obtained on this dataset for YOLOv5, YOLOv6, YOLOv7, and YOLOv7 models using the attention mechanism.The following conclusions can be drawn from the analysis results: (1) As shown in Figure 10, when comparing the experimental results of the YOLOv5 series with the YOLOv7 model, it can be seen that in the YOLOv5 series only YOLOv5x, which has the largest number of parameters, has 0.2% more accuracy than YOLOv7 but, in contrast, the number of parameters and calculation of YOLOv7 is only half of YOLOv5x and it is twice as fast as the YOLOv5x model.The accuracy of the YOLOv7 model is 10.2%, 3.6%, and 2.9% higher than that of YOLOv5s, YOLOv5m, and YOLOv5l, respectively.It can be seen that the model performance is gradually improving with the development of the YOLO series.However, the YOLOv7B-CBAM model proposed in this paper also increases accuracy by 2.8% over the best of them in terms of accuracy, YOLOv5x, which proves the superiority of the model proposed in this paper in terms of accuracy.(2) When comparing the YOLOv7 model with the added attention mechanism, as shown in Figure 11, the accuracy of the proposed YOLOv7B-CBAM model is the best, with a 3.0% improvement over the original YOLOv7 model and 1.0% improvement over the YOLOv7H-CBAM.Compared with YOLOv7H-CA and YOLOv7H-SimAM, the accuracy is improved by 3.0% and 3.6%, respectively.It proves that the model in this paper can more accurately identify whether the mask is worn or not and the specification.The CBAM module effectively extracts the important features of images and noise reduction of irrelevant information, and the accuracy has surpassed that of the YOLOv7 model using other attention mechanisms as well as the original model.(3) In terms of model size and recognition speed, YOLOv5s and YOLOv5m, with less computation due to their number of parameters, improve the FPS to 128.2 and 90.9, respectively, at the expense of accuracy, which is higher than the YOLOv7 model.The comparison between the YOLOv7 models with the added attention mechanism shows that YOLOv7H-SimAM has less computation and number of parameters than the YOLOv7 model due to its parameter-free characteristics, but its accuracy is not improved.In contrast, YOLOv7B-CBAM does not change the number of parameters but slightly increases the computational power and decreases the FPS by 7, which results in an improvement in accuracy.
This comparison shows that the improved YOLOv7B-CBAM model in this paper can more accurately identify the location of the face and determine whether masks are worn properly, which greatly reduces labor cost and improves the safety and hygiene of mask identification.Due to its high accuracy and low missed detection rate, in addition to preventing the new coronavirus, this model can also be applied to places that need to ensure that masks need to be worn properly-for example, in hospital outpatient clinics to reduce the spread of bacteria and viruses and in restaurants or cafeteria kitchens to prevent droplet contamination of food and so on to ensure the hygiene and safety of the premises.The model may be used to test existing models and boost their precision in the future by being applied to more face detection datasets.

Summary
The above two comparative experiments demonstrate the accuracy of the proposed YOLOv7B-CBAM model for small object recognition in dense scenes, and it has fast detection speed and low computational cost while ensuring accuracy.Comparing the model proposed in this paper with other models that are currently more popular, SSD needs to process feature maps of different scales, which may be affected in dense scenes.In contrast, Fast-RCNN first performs region extraction on the image, and then classifies and localizes the extracted regions.In this case, Fast-RCNN may perform better in terms of accuracy, but is slower and cannot reach the practical application requirements of real-time detection.In addition, the model uses a special Anchor-Free technology, which does not require a preset anchor and can be more flexible than YOLOv5 to adapt to objects of different sizes and shapes.This makes the model promising for a wide range of applications in many related fields.
For normative mask-wearing recognition, the YOLOv7B-CBAM model can be applied to face mask-wearing detection to help relevant departments monitor the wearing of masks by the crowd.This application scenario can be used in public places, transportation, etc., to improve public health and epidemic prevention and control.In addition, the model has a wide range of applications in plant-or fruit-related fields.For example, it can be applied to the identification and classification of fruits, vegetables, and grains to help farmers better manage their crops and improve the efficiency and yield of agricultural production.Also, the model can be used for plant disease detection to help farmers take timely measures to reduce the spread and loss of diseases.It should be noted that although the model has high accuracy in recognizing normative mask-wearing and tomato fruit recognition, adjustments and optimizations need to be made based on specific scenarios and data in practical applications to improve the model's accuracy and stability.

Conclusions
In order to cope with the problem of insufficient accuracy of small target recognition in dense scenes, this paper improves the overall accuracy by optimizing the network structure of the YOLOv7 model to improve the model's robustness.By replacing the conventional convolutional layers in the backbone part with the CBAM attention mechanism, invalid and redundant features are eliminated, leaving more useful information for target localization and classification, thus improving the detection effect as well as the accuracy of the localization.The experimental findings on Pascal VOC show that YOLOv7B-CBAM is superior in terms of accuracy.In order to further verify the validity and accuracy of the model, the model is tested on the face mask dataset and the tomato dataset.It proves that the modified model in this paper can achieve a detection speed of more than 60 FPS while ensuring high accuracy in analyzing the specification of mask wearing, which is advantageous for practical use in applications like the specification of mask-wearing detection in public places and mask-wearing detection in hospitals.This model also has higher accuracy in the tomato dataset than the YOLOv7 model improved using other attention mechanisms,
where [•, •] is the concatenate operation, f 1×1 is the convolution operation of 1 × 1, δ is the nonlinear activation function, and σ is the sigmoid activation function.In this paper, CA is used to modify the MP2 of the head part of the original YOLOv7 network, and a traditional convolutional layer in the original MP2 module is replaced with CA attention.That is, one side performs convolution after pooling while the other side uses CA channel attention to convolution, and finally connects the two.The specific structure is shown in Figure3.Dotted boxes with yellow blocks in the figure show the modified structure of MP2, and the red CA module is the replacement of YOLOv7.In relation to MP2, 2c refers to the pooling layer with 2c number of input channels.

Figure 5 .
Figure 5. Image of tomato dataset before detection.

Figure 8 .
Figure 8. Images of face mask dataset before detection.

Table 1
describes the software and hardware environments used for model training and testing.

Table 3 .
Results of tomato dataset.

Table 4 .
Results of face mask dataset.