Study on Lightweight Model of Maize Seedling Object Detection Based on YOLOv7

: Traditional maize seedling detection mainly relies on manual observation and experience, which is time-consuming and prone to errors. With the rapid development of deep learning and object-detection technology, we propose a lightweight model LW-YOLOv7 to address the above issues. The new model can be deployed on mobile devices with limited memory and real-time detection of maize seedlings in the ﬁ eld. LW-YOLOv7 is based on YOLOv7 but incorporates GhostNet as the backbone network to reduce parameters. The Convolutional Block A tt ention Module (CBAM) enhances the network’s a tt ention to the target region. In the head of the model, the Path Aggregation Network (PANet) is replaced with a Bi-Directional Feature Pyramid Network (BiFPN) to improve semantic and location information. The SIoU loss function is used during training to enhance bounding box regression speed and detection accuracy. Experimental results reveal that LW-YOLOv7 outperforms YOLOv7 in terms of accuracy and parameter reduction. Compared to other object-detection models like Faster RCNN, YOLOv3, YOLOv4, and YOLOv5l, LW-YOLOv7 demonstrates increased accuracy, reduced parameters, and improved detection speed. The results indicate that LW-YOLOv7 is suitable for real-time object detection of maize seedlings in ﬁ eld environments and provides a practical solution for e ﬃ ciently counting the number of seedling maize plants.


Introduction
Maize is a strategic crop with the largest planting area and production in China. It provides an important guarantee for food security [1]. Rapid calculation of the seedlingemergence rate during seedling stage is crucial for predicting maize yield [2]. The traditional method is to manually calculate in the field, which requires a huge labor cost. To solve this problem, researchers attempt to use visual sensors and computer vision methods to achieve object detection. Yu et al. [3] proposed a new crop-segmentation method (AP-HI) based on computer vision using spatial distribution characteristics to judge whether crops have reached the emergence stage. The average accuracy rate of AP-HI is 96.68%, which is higher than other detection algorithms by 2.37%. Zhao et al. [4] employed the conventional Otsu thresholding approach to segment rapeseed plant objects. The average relative error is 6.83%, while R2 is 84.6%. Xia et al. [5] proposed a cotton-overlapping plant identification and counting method based on SVM and the maximum likelihood classification method that achieved an accuracy rate of 91.13%. However, using traditional object-detection methods to identify maize seedlings, the detection accuracy is affected seriously by the background of images. The error rate of object detection will increase with the complexity of the background (weeds, light, planting density, etc.).
With the development of deep learning, methods based on Deep Convolutional Neural Networks (DCNN) have gradually replaced traditional methods in the field of object detection. In DCNN, parameters of a multi-layer network are fitted to the features, which The new method introduced an attention mechanism module and a spatial pyramid pooling structure. The experiment results indicate that the mAP value was 86.69% and the detection speed was 57.33 f/s. Kaya [30] and their colleagues designed a novel method for detecting place diseases. The novel method developed a multi-headed DenseNet-based architecture and improved detection accuracy through image fusion. The improved method achieved an average accuracy, recall, precision, and f1 score of 98.17%, 98.17%, 98.16%, and 98.12%, respectively. Zhao et al. [31] proposed a convolutional neural network based on inception and residual structure. The experiment results indicate that the overall accuracy is 99.55% in identifying three diseases of corn, potato, and tomato. Song et al. [32] proposed a corn tassel pose estimation method based on computer vision and directional object detection. The evaluation metrics indicate that the proposed method has a correct estimation rate of 88.56% and 29.57 Giga floating-point operations (GFLOPs).
Although YOLO has achieved great success in various domains, there is little research on crop seedling detection. In the field of smart agriculture, research primarily focuses on pest and disease detection, as well as the detection of fruits and vegetables, while seedling detection has received limited attention. Moreover, there is a lack of publicly available datasets specifically designed to train deep-learning models for maize seedling detection. The complex background of maize seedlings in real environments presents a significant challenge for accurate detection using deep-learning models. Furthermore, the large number of parameters in the model contribute to slow inference speed and excessive memory usage. To address these challenges, we have collected a substantial amount of image data from field environments. Our focus has been on reducing the model parameters in the backbone network, enhancing the performance of the feature fusion network, and resolving the issue of position loss. By making improvements to the YOLOv7 model, we have enhanced its capability and efficiency in maize seedling detection. These efforts have been aimed at overcoming the limitations posed by the lack of datasets, the complexity of the background, and the computational demands of deep-learning models.

Data Acquisition, Augmentation, and Annotation
In this research paper, a dataset of maize seedling images was collected in the northeastern region of China, specifically using the Xianyu 335 maize variety. The images were acquired using a Dajiang drone (4 RTK) and had an original size of 5472 × 3648 pixels. The image acquisition period spanned from the emergence stage to the jointing stage of maize growth. During the data collection process, the drone followed a predetermined flight path and captured images from four different flight heights: 1.6 m, 2 m, 3 m, and 5 m, respectively. The camera was positioned to capture top-view images, and approximately 250 images were collected at each height. As a result, the total number of original images in the dataset amounted to 1000.
To mitigate the risk of overfitting during the training process [33], we increase training samples through data augmentation. General data-augmentation methods include Coutout [34], Random Erasing [35], Mixup [36], Mosaic [37], salt and pepper noise, etc. Considering the factors such as environment complexity, plant morphology, and planting density, Coutout and Random Erasing will result in a decrease in the number of positive samples that is not suitable for small objects. Therefore, we use augmentation methods with random brightness, cropping, and salt and pepper noise to reduce the loss of positive sample. The effect of data augmentation is shown in Figure 1. The total number of images after data augmentation is 2000; we divide all images into 8:1:1 rate and obtain 1600 training images, 200 validation images, and 200 test images, respectively. The distributions of the dataset are shown in Table 1.   Training  1000  175  255  170  1600  Validation  100  32  30  38  200  Test  120  27  21  32  200 During the data annotation process, the open-source tool LabelImg was utilized, as depicted in Figure 2. Each image was labeled using a single-category bounding-box format. In the case of top-view maize seedlings, each seedling was assigned a corresponding bounding box, ensuring that all the pixels of the seedling were fully encompassed within the rectangular area.

YOLOv7
In the field of object detection, YOLO has always been one of the most popular deeplearning models. YOLO adopts a single neural network structure that divides the entire image into multiple grids and predicts multiple bounding boxes for each grid, which contains object positions and class information. Therefore, for each bounding box, YOLO predicts four coordinate values that represent the coordinates of the upper-left and lowerright corners of the bounding box, as well as the probabilities of belonging to different categories. Since this prediction process involves regression calculation between input data and model parameters, it can be considered a regression problem. Since 2016, YOLO has released multiple versions, each with different improvements and optimizations. In this paper, we utilized YOLOv7 [38], which is considered to be the most stable and reliable among the released versions. Compared to other YOLO models, YOLOv7 has significantly improved in both detection accuracy and speed during the Maize dataset.
Building on its predecessor, YOLOv7 innovatively introduces the extended ELAN architecture that improves the network's self-learning capability without destroying the original gradient path. ELAN is mainly composed of VoVNet combined with CSPNet and optimizes the gradient length of the overall network with the structure of stack in the computational block. By optimizing gradient paths, deeper networks can effectively learn and converge. In addition, YOLOv7 incorporates a cascade-based model scaling method, which dynamically adjusts the model size to suit the specific detection requirements. The main purpose of model scaling is to adjust some attributes of the model and generate models at different scales to meet the needs of different inference speeds. For example, the scaling model of EfficientDet [39] considers width, depth, and resolution. As for scaled YOLOv4 [40], its scaling model adjusts the number of stages. However, the above methods are mainly used in PlainNet and ResNet. Therefore, it is necessary to propose corresponding composite model scaling methods for cascaded models. In the YOLOv7 model, when we scale the depth factor of a calculation block, we must also calculate the changes in the output channel of that block. Then, we will scale the width factor of the transition layer by an equal amount of variation. The method can maintain the characteristics of the model during the initial design and maintain the optimal structure. These methods ensure that the model is optimized for the task at hand, further enhancing the effectiveness and performance of YOLO. The research paper adopts YOLOv7 as the baseline model and carries out further optimizations to enhance its performance. The main structure of YOLOv7 consist of three key components: input, backbone, and head. These components work together to enable efficient and accurate object detection. The network structure of YOLOv7 is shown in Figure 3. The Input component of YOLOv7 incorporates two key elements: adaptive scaling and adaptive anchor box. Adaptive scaling is primarily used to adjust the size of the input image. This approach offers several advantages. Firstly, it can save memory when dealing with large-size images. Secondly, it enables the network to adapt to input images of varying sizes, thereby enhancing the model's generalization capability. Finally, adaptive scaling contributes to improved detection accuracy by ensuring that small objects occupy a more significant portion of the image, leading to better object localization. The adaptive anchor box is responsible for automatically selecting the number and size of prior boxes. By adjusting to different object scales and aspect ratios, the adaptive anchor box enhances the accuracy of object detection during testing. Moreover, it offers flexibility in accommodating diverse scenarios and tasks through the utilization of the K-means algorithm. This adaptability allows the model to adjust to different object scales and aspect ratios, improving its overall performance in various detection scenarios.
The backbone component of YOLOv7 is responsible for feature extraction and consists of three modules: CBS (Conv-BN-SiLU), ELAN (Extended Latent Attention Network), and MP (Max-Pooling). The CBS module comprises a sequence of layers, including Convolutional (Conv) layers, Batch Normalization (BN) layers, and SiLU (Sigmoid Linear Unit) layers. It employs three different convolutional kernel sizes and step sizes, allowing it to capture features at various scales. The CBS module plays a crucial role in extracting informative features from the input data. The ELAN module is an efficient network structure designed to control the shortest and longest gradient paths within the network. By doing so, it encourages the network to learn more diverse and discriminative features, resulting in improved robustness. The ELAN module enhances the model's capability to extract meaningful representations from the input data. The MP module consists of two branches for down-sampling. It utilizes max-pooling operations to reduce the spatial dimension of the feature maps, effectively capturing essential information at different scales. Together, the CBS, ELAN, and MP modules within the backbone component of YOLOv7 work harmoniously to extract relevant features from the input data, enabling accurate and efficient object detection.
The head component is responsible for further processing the extracted features and performing object detection. It consists of several modules: Spatial Pyramid Pooling (SPPCSPC), Feature Pyramid Network (FPN), Path Aggregation Network (PANet), and the detection heads. The SPPCSPC module adapts to images of different resolutions by obtaining different receptive fields with maximum pooling. The FPN and PANet enhanced the ability of network to integrate different feature layers. RepVGG is introduced to the head for training and to achieve recognition and classification of images. YOLOv7 has three detection heads, which are used to detect different sizes.

LW-YOLOv7
Although YOLOv7 performs well in real-time object detection, there are still some issues when we use it to detect maize seedling in images. Firstly, different images have many similarities in features. When the model is training, similar features can be extracted from different images. It will occupy a large amount of memory space for the model to be deployed on mobile devices. Secondly, for detecting small objects, multiple convolutions and up-sampling operations will lead to the loss of location information. Finally, YOLOv7 converges too slowly when calculating loss. To solve the above problems, we propose a lightweight model base on the YOLOv7, which has been improved as follows.
1. To tackle the problem of redundant features during training, we use the GhostNet module to replace the ordinary convolution of the Backbone in the YOLOv7. At the same time, we introduce the CBAM attention mechanism module to improve global attention with channel attention module and spatial attention module. 2. To solve the position information of small objects, we use BiFPN to replace PANet in the Head and enhance the representation ability of features by adding residual links. 3. In terms of loss convergence, the SIoU loss function is used instead of the CIoU loss function to reduce the degree of freedom of the loss function, enhance network robustness, and improve the speed and accuracy of box regression.

Improvement of Backbone
In YOLOv7, a significant computational bottleneck arises from the CBS (Convolution, BN, SiLU) modules. These modules have high computational demand and require substantial memory space during training, posing challenges for deploying the model on resource-limited mobile devices. To address this issue, we integrate the GhostNet [41] model into YOLOv7. The GhostNet model, as shown in Figure 4, employs a two-step process to reduce computational complexity. Firstly, it utilizes ordinary convolutions with a kernel size of 1 × 1 to obtain intrinsic feature maps. These convolutions help capture essential information and reduce the dimensionality of the features. Secondly, GhostNet employs cheap operations to generate redundant feature maps based on intrinsic features. These redundant feature maps provide additional information without significantly increasing computational costs. By concatenating the intrinsic and redundant feature maps, GhostNet achieves better performance in object detection while minimizing computational costs. This approach effectively addresses the computational limitations of CBS modules, making it more feasible to deploy the model on resource-limited mobile devices. While GhostNet helps reduce the number of parameters in the backbone of YOLOv7, it may also lead to the omission of important features. To solve this issue, we have explored the integration of attention modules. Among them, the CBAM (Convolutional Block Attention Module) [42] has demonstrated the most promising results in terms of detection accuracy within our model. The CBAM, depicted in Figure 5, combines channel attention and spatial attention mechanisms to enhance the model's ability to capture informative features.

Convolutional Block Attention Module
Input Feature Refined Feature CBAM is a lightweight and effective attention module for feed-forward convolutional neural networks. It includes two sub-modules: channel attention module (CAM) and spatial attention module (SAM). The CAM and SAM models can be viewed in Figure  6. Firstly, the input feature map F( ) was multiplied with the input feature F. Secondly, by using the output result of CAM as channel-refined feature F', a two-dimensional convolution operation of the SAM was conducted. Finally, the re-  The attention-generating process of the CBAM are shown in Equations (1) and (2), where ⊗ denotes element-wise multiplication. Mc represents a one-dimensional convolution operation of CAM, while Ms represents a two-dimensional convolution operation of SAM. F denotes the input feature, and F'' denotes the output feature. By introducing GhostNet and CBAM modules, we have made changes to the structure of ELAN and MP, as shown in Figure 7. Compare with the original network, the new network decreases the number of parameters and is more suitable for deployment in mobile devices with limited mobile resources. Additionally, it also can improve its feature extraction capability by introducing the CBAM module and solving the problem of easily ignoring small maize seedlings in complex field environments.  GhostELAN; (b) Denotes GhsotMP. "1 × 1, 4c, c" denotes the GhostNet operation, which has convolutional kernel size of 1 × 1; its input channel is 4c and output channel is c.

Improvements of Head
YOLOv7's Head mainly consists of FPN and PANet. Firstly, FPN structure performs the forward feature extraction of deep convolutional networks. Secondly, based on the P5 layer, the FPN structure performs two times up-sampling from top to bottom. It can obtain feature maps with sizes of 40 × 40 and 80 × 80, respectively. Finally, the feature maps obtained from the previous step are added to the P4 and P3 layers. The structure of FPN is shown in Figure 8a.
Although FPN can propagate high-level semantic information to lower levels, the nearest neighbor interpolation method employed in up-sampling may not efficiently distribute information. To solve this problem, as an alternative to FPN, PANet is used in YOLOv7. PANet is shown in Figure 8b.
PANet is an instance segmentation framework, and it cannot directly perform object detection. However, it can enhance multi-scale information fusion. In PANet, the fusion of feature maps from different levels is achieved through path aggregation, which ensures the continuity and consistency of features.
Compared to FPN structure, PANet has made significant improvements in objectdetection task performance. However, it has some disadvantages, such as high computational cost, model training difficulty, and errors in recognizing small objects.
In order to improve PANet, we referred to the BiFPN [18] structure, which is shown in Figure 8c.
BiFPN is a neural network optimized for object detection based on PANet and offers several advantages. Firstly, BiFPN is highly efficient. It has few parameters that reduce the computational complexity of model and can complete the calculation of the FPN structure in a short time. It is suitable for real-time object detection tasks. Secondly, it has high precision. BiFPN transfers contextual information through adaptive features. It can maintain the consistency of semantic information at different scales and improve the accuracy of object detection. Finally, it has high robustness. BiFPN can handle objects of different sizes and shapes, and it has better adaptability to image rotation, scaling, and translation.
In this paper, we remove the number of single points of input features and reduce the computational load of the network. Adding a residual link in the output section can enhance the ability to express features. Increasing the weight of each scale feature after fusion can adjust the contribution of each scale.

Improvements to the Loss Function
The accuracy and convergence speed of object detection model depends on the loss function largely. The loss function of YOLOv7 contains three parts: coordinate loss, confidence loss, and classification loss. YOLOv7 adopts the calculation method of CIoU loss in the coordinate loss, and its expression is shown in Equation (3). ρ is the Euclidean distance between the center points of the predicted box and ground truth. c represents the diagonal distance of the minimum circumscribed matrix. IoU is defined as the ratio of the intersection area between the predicted box and the ground truth to the union area of the two bounding boxes. a is a weight parameter, and v indicates the similarity between length and width. Equation (4) and Equation (5) In Equation (5), wgt and hgt represent the width and height of the ground truth, and w and h represent the width and height of the predicted box.
CIoU is a newer evaluation metric that improves upon the IoU and its variants. The CIoU takes into account not only the extent of overlap between the predicted box and ground truth but also their distance apart and aspect ratio differences. However, it does not take into account the direction between ground truth and predicted box results in slow model convergence, low efficiency, low performance, etc. Therefore, we introduce the SIoU loss function [43] to optimize the coordinate loss. Compared to CIoU, the advantages of SIoU include robustness to partial occlusion, better handling of objects with different scales, and faster convergence speed during the training process. The calculation method of the SIoU loss function mainly includes four parts: angle cost, distance cost, shape cost, and IoU cost, and its expression, shown in Equation (6) is as follows: In Equation (6), Δ is the distance cost, and Ω is the shape cost. Since the angle cost is added to the SIoU function, it will reduce the probability of a penalty term equal to 0. This not only accelerates the convergence of the loss function, but it also reduces the prediction errors. The parameters and their relationships for the SIoU loss function are shown in Figure 9. B represents the predicted box and Bgt represents the ground truth. Cx is the width of the minimum bounding matrix. Cy is the height of the minimum bounding matrix. Cw and Ch are the width difference and height difference between the center point of the predicted box and the ground truth, respectively. σ is the shortest linear distance between the predicted box and the ground truth. Sin α is the ratio of Ch to σ, and sin β is the ratio of Cw to σ.
If α is equal to 0 or π/2, the angle cost (Λ) will be 0. If α is less than π/4, we prioritize the minimization of α. Otherwise, we prioritize the minimization of β. The expression for the angle cost is given by Equation (7): in Equation (7), The distance cost in the SIoU loss function is determined by the distance between the center points of the ground truth and the predicted box, which is further affected by the angle cost. The distance cost function is redefined with these considerations, and its expression is given by Equation (11):  (11) in Equation (11), As α approaches 0, the contribution of distance cost decreases continuously. On the contrary, when α is closer to π/4, the contribution of the distance cost keeps increasing. As the angle increases, γ is assigned a time-preferred distance value.
The shape cost is defined in Equation (15).
θ controls the attention paid to the shape cost. In this paper, the value of θ is set to 1 to optimize the aspect ratio, thereby restricting the free movement of the shape, where

LW-YOLOv7 Maize Seedling Detection Model
Based on the original advantages of YOLOv7, we proposed an improved LW-YOLOv7 model to detect maize seedlings in the real environment while ensuring detection accuracy; it reduced the number of parameters and improved the convergence speed of the model. The overall framework of LW-YOLOv7 is shown in Figure 10. In the backbone network, we use ReLU as the activation function, and the total number of layers of the backbone is 54. The specific network architecture of the backbone is shown in Table 2.

Experimental Environment and Evaluation Indicators
The training task is implemented on Python 3.9 and Pytorch 1.13. The device information required for training includes Inter(R) Core(TM) i9-10980XE CPU @ 3.00 GHz, NVIDIA GeForce RTX 3090, and 64 GB of memory.
During training, we need to calculate the size of the anchor boxes so that the model can better predict the bounding boxes of maize seedlings. K-means is an unsupervised learning algorithm. It treats all the bounding boxes in the training set as samples and takes the width and height of the bounding boxes as two features. During calculation, we set the number of anchor boxes to 9 and use the new distance calculation equation. The new calculation method is shown in Equation (16). Secondly, for each sample in the training set, calculate its distance from all cluster centers and assign it to the nearest cluster. For each cluster, recalculate the center point of the cluster; that is, select the average value of all samples in the cluster as the new cluster center. Finally, K-Means perform 1000 epochs of iteration to obtain the optimal anchor boxes on the above operations. In deep learning, excessive input image resolution may lead to issues such as insufficient memory and long calculation time. Therefore, we set the image resolution to 640 × 640. Additionally, when we were training maize seedlings data, the training results tended to stabilize at 80 epochs. To solve the problem of overfitting, we set the total number of iterations to 100. The learning rate is a very important hyperparameter in deep learning, which controls the speed of updating model parameters during gradient descent optimization. If the learning rate is too large, it may lead to the model cannot converge and large fluctuations in training error. Therefore, we set the initial learning rate to 0.01, the learning rate frequency to 0.1, and the final learning rate to 0.001. In YOLOv7, establishing the batch size has a significant impact on the training effect and speed of the model. Batch size refers to the number of samples processed simultaneously in each round of gradient descent, which can affect multiple aspects such as the direction, amplitude, and speed of model parameter updates. A larger batch-size value can improve the training speed of the model, but it may lead to overfitting, and an inability to converge, among other issues. Therefore, we set the batch size to 16. We also provided other hyperparameters and information during the training process. Table 3 shows the parameters of the training process.  21 30,34 44,50] To assess the performance of the improved model, we use five evaluation metrics, including precision, recall, mAP, parameter, and FPS. Precision is a performance metric commonly used in the evaluation of classification models. It measures the proportion of true positives among all instances that the model has classified as positive. Recall is an important metric for evaluating a model's detection ability, reflecting its ability to correctly identify all positive samples. Specifically, recall is defined as the proportion of true positive samples that are correctly detected out of all actual positive samples. Another commonly used metric is mAP, which stands for mean average precision. It is a method to evaluate the overall performance of model-detection results and take into account the performance of the model at different thresholds. The calculation of mAP is as follows: sort all positive samples by confidence from high to low, then calculate the precision at each confidence, and, finally, obtain the average precision across all the confidences. Parameter refers to the size of the best weight file generated in the deep-learning model. This metric is commonly considered a way to measure the complexity of the model because larger parameters usually indicate higher model complexity. FPS refers to the number of frames per second that a model can process, which is a measure of the model's inference speed. This metric is often hardware-dependent and can be improved by optimizing the computation graph, reducing the model size, etc.
The precision and recall calculation methods are noted by Equations (17) and (18) During the experiment, we set the IOU threshold to 0.5. If an object is correctly predicted and the IOU threshold exceeds our setting, it can be considered a positive sample, while the others are considered negative samples, where precision measures the ratio of the correct predicted samples to the total samples. The recall is the ratio of correctly predicted positive samples to the total number of positive samples. TP is the number of positive samples that are correctly identified. FP is the number of negative samples recognized as positive samples. FN is the number of positive samples recognized as negative samples.
AP represents the area enclosed by precision and recall. Specifically, the method of calculation is shown in Equation (19); P(R)dR is a function that recalls as the abscissa and precision as the ordinate. In Equation (20), mAP refers to the average value of AP across all categories, where n represents the number of categories in the dataset. In this paper, there is only one category to be detected, which makes AP = mAP.
FPS represents the number of images detected per second. In testing, to meet the requirements of real-time detection, the value of FPS should be greater than 30. FPS contains three parts: the image pre-processing time (Pre), inference time (Inf) of the network, and non-maximum suppression time (NMS). FPS is shown in Equation (21).

Comparison of Different Backbones Based on YOLOv7
In this experiment, to verify the feature extraction capabilities of different backbones, we select MobileNet-V2, MobileNet-V3, ShuffleNet-V2, and LW-YOLOv7 for comparison with the YOLOv7 algorithm. The comparison results are shown in Table 4. The precision of the YOLOv7+MobileNet-V2 algorithm decreased by 1.3%, the recall decreased by 1.8%, the mAP decreased by 1.8%, and the number of parameters decreased by 25.2 M. The precision of the YOLOv7+MobileNet-V3 algorithm decreased by 3.4%, the recall decreased by 0.4%, the mAP decreased by 2.9%, and the number of parameters decreased by 27.5 M. The precision of the YOLOv7+ShuffleNet-V2 algorithm decreased by 5.6%, the recall increased by 0.7%, the mAP decreased by 3.6%, and the number of parameters decreased by 28.7 M. The precision of the LW-YOLOv7 algorithm decreased by 1.6%, the recall increased by 3%, the mAP increased by 0.5%, and the number of parameters decreased by 15.2 M.
Based on the comparison results, it can be observed that using lightweight models may result in a decline in precision. Among them, the YOLOv7+ShuffleNet-V2 algorithm experienced the most significant decrease in precision, while the LW-YOLOv7 algorithm exhibited a relatively minor drop compared to other models. This is because lightweight models often adopt simplified network structures, which may reduce their feature representation capability and subsequently impact the models' accurate object detection. However, lightweight models have advantages in reducing parameter quantity and computational resource requirements, making them suitable for resource-constrained situations. Additionally, we can introduce methods such as CBAM and BiFPN to improve the detection accuracy of the model. This is because they can enhance the representation and fusion capabilities of feature maps.

Comparison between CIoU and SIoU
To assess the efficacy of the SIoU loss function, this experiment conducted a comparative analysis with the CIoU and DIoU loss functions. The results, as presented in Figure  11, demonstrate several important findings.
Firstly, it is observed from the figure that the coordinate loss stabilizes after approximately 80 epochs during the training phase. This indicates that the network has reached a point of convergence where further training may not significantly improve performance. Based on this observation, the training epoch was set to 100. Secondly, the SIoU loss function exhibits a significantly faster convergence speed compared to the CIoU and DIoU loss function. This implies that the SIoU loss function allows for more efficient optimization and training of the model, leading to quicker convergence toward the desired performance. Lastly, the error ratio between the ground truth and the predicted bounding box decreases by 9.49% to 45.78% when using the SIoU loss function. This reduction in the error ratio suggests that the SIoU loss function improves the accuracy of the predicted bounding boxes, leading to better object-detection performance.
To summarize, the comparative analysis reveals that the SIoU loss function outperforms the CIoU and DIoU loss functions in terms of convergence speed and error reduction. These findings indicate that the SIoU loss function is more effective for assessing the accuracy of object-detection models, particularly in the context of maize detection at the seedling stage.

Comparative of Different Object Detection Models
To verify the effectiveness of the LW-YOLOv7 model proposed in this paper, we trained and tested it on our dataset and compared it to other popular object-detection models. These models include Faster-RCNN, YOLOv3, YOLOv4, YOLOv5, and YOLOv8. The comparison results are shown in Table 5. Compared to LW-YOLOv7, the precision of the Faster-RCNN model decreased by 26.7%, the recall decreased by 27.4%, the mAP decreased by 37.6%, and the number of parameters increased by 45%, while the FPS is 14.3 f/s and decreased by 529.37%. Therefore, the detection accuracy of Faster-RCNN was significantly lower than YOLOv7 and did not meet the requirements of real-time detection.
The precision of the YOLOv3 model decreased by 0.9%, the recall decreased by 3.3%, the mAP decreased by 1.9%, the number of parameters increased by 51.86%, and the value of FPS is 82 f/s. YOLOv3 can meet the requirement of real-time detection, but the accuracy is lower than LW-YOLOv7. Additionally, it also takes up more memory and disk space on mobile devices.
The precision of the YOLOv4 model increased by 0.6%, the recall decreased by 2.5%, the mAP decreased by 1%, the number of parameters increased by 43.64%, and the value of FPS is 107 f/s. Therefore, YOLOv4 also requires more memory and disk space.
The precision of the YOLOv5 model decreased by 0.1%, the recall decreased by 5.7%, the mAP decreased by 3.9%, the number of parameters increased by 35.01%, and the value of FPS is 114 f/s. Therefore, YOLOv5 not only has lower detection accuracy than LW-YOLOv7, but it also has a significant computation load.
The precision of the YOLOv8 model increased by 1.9%, the recall decreased by 2.2%, the mAP has not changed, the number of parameters increased by 32.19%, and the value of FPS is 125 f/s. Compared with other popular algorithms, the comparison results show that the LW-YOLOv7 proposed in this paper has the highest mAP value, the model size is smaller than other popular object-detection algorithms, and the detection speed meets the requirements of real-time detection.

Ablation Test Comparison
In order to provide the impact of different modules on the detection accuracy and speed of the model, the ablation experience results are shown in Table 6.
In Table 6, we used YOLOv7's metrics as a baseline, which is shown in the first row of Table 4. mAP, parameters, and FPS of YOLOv7 are 92.7%, 74.8 M, and 121 f/s, respectively.
Introducing the GhostNet module, the mAP decreased by 1.2%, but the total parameters also decrease by 21.39%. The value of FPS is 94, which meets the real-time requirements.
Based on GhostNet, introducing the CBAM attention mechanism module, the mAP and total parameters were 92.3% and 59.4, and they decreased by 0.4% and 20.58%, respectively.
Introducing the BiFPN module, the mAP was 92.9%, which increased by 0.2%. The total parameters were 59.4 M, which decreased by 20.58%.
Introducing the SIoU function, the mAP was 93.2%, which increased by 0.5%. The total parameters were 59.4 M, which decreased by 20.58%.
The study results indicate, by introducing the GhostNet model, the total parameters can be reduced and the slightly reduced detection accuracy. Adding the CBAM module, the accuracy of the model will increase. Moreover, through introducing the module of BiFPN and CIoU function, the accuracy also increases, and the total parameters keep unchanged.

Comparative of Different Class Activation Mapping
Deep-learning networks are often considered black box experiments during training and are not easily interpretable. In order to gain a better understanding of the model's recognition process, it is important to analyze its internal workings and how it processes input data. This can involve examining its architecture, training data, feature extraction methods, and prediction mechanisms, among other factors. Additionally, visualizations such as heatmaps and saliency maps can provide insights into which areas of an input image are most relevant to the model's predictions. Therefore, the experiment introduced Grad-CAM [44]. Grad-CAM is a technique that can be utilized to visualize the attention of the model and identify which parts of the input image are most important for generating its predictions. Specifically, Grad-CAM calculates the gradients of the target class output for the feature map of the final convolutional layer and then applies them to obtain a weighted sum of activation maps. The resulting heatmap highlights regions of the input image that had the greatest influence on the model's predictions. The heat map image of GradCAM is shown in Figure 12. Based on the results displayed in Figure 12, it is evident that the improved models proposed in the paper exhibit greater capability to identify and extract features of maize seedlings compared to their counterparts. Additionally, the models show less susceptibility to being affected by complex environmental factors. As a result, this approach provides a better explanation of the deep-learning process, as it can better prioritize and capture relevant features while filtering out extraneous information.

Comparison of Object-Detection Visualization Results
In order to verify the effectiveness of the LW-YOLOv7 mode in the field environment, this study uses the model to perform real-time objection on maize seeding images captured by drones and selects other three existing popular object-detection models (Faster RCNN, YOLOv5, YOLOv7) that are compared with the LW-YOLOv7 model. Faster RCNN is a two-stage object detection model.YOLOv5, YOLOv7, and LW-YOLOv7 are all single-stage object detection models. The results are shown in Figure 13.
This experiment selected two kinds (shooting heights) of maize seedlings images at different heights. Where (a)-(d) shooting height is 1.6 m, (e)-(h) shooting height is 5 m.
As shown in Figures 13a-d, using Faster RCNN for detection, there are positioning deviations, detection errors (mistakenly identifying weeds as seedling maize plants), and repeated detection. Using YOLOv5 for detection, although the problem of repeated detection is solved, a large number of weeds and negative samples still be identified as maize seedlings. Using YOLOv7 for prediction, the problem of positioning deviation is solved, but there are false detections. However, with LW-YOLOv7, we solved the problems of repeated detection and recognition errors.
As shown in Figures 13e-h, Faster RCNN has a large number of missed detections. YOLOv5 and YOLOv7 also have some missed detections. However, LW-YOLOv7 performs better in small object detection than the other three models. It can be seen that when using LW-YOLOv7 on maize seedling images, it can effectively avoid the positioning deviation and run-time long of the Faster RCNN model. Additionally, it can solve the singlestage object detection model (YOLOv5 and YOLOv7) and repeated detection problem, and it reduces the probability of misidentification. The comparison of the four models proved the advantages of the LW-YOLOv7 model in real-time object detection tasks of maize seedling images, and it provides a solution basis for the rapid realization of the seedling maize plant detection in the field environment. the correct prediction, the blue box is the duplicate prediction, the yellow box is the wrong prediction, and the purple box is the not detection.

Conclusions
In this study, we present a new model to solve the difficult tasks of real-time detection in maize seedling images. The new model improves the accuracy of detection and reduces the number of parameters. By using the LW-YOLOv7 detection model, we can obtain the position and quantity of maize seedlings, as well as judge the density and growth uniformity of maize. In addition, we can calculate the emergence rate and replant the missed areas in a timely manner to increase maize yield. It is important for evaluating maize growth and yield. On the training model, we can change the depth and width of the network to reducing the size of weight. This provides a theoretical basis for deploying our improved model on devices with limited mobile resources. The proposed maize seedling detection model can provide benefits for scientific researchers engaged in object detection and for those who use the model for real-time detection in agriculture. Using this model can save a lot of manpower and time. In addition, this model can be integrated into unmanned aerial vehicle (UAV) systems for real-time detection, further enhancing its practicality. In this way, we can quickly and easily monitor large areas of land, identify seedlings that require attention, and take action before problems arise.
While the proposed maize seedling model has many benefits, there are still challenges that need to be addressed to improve its generalizability across different datasets and environments. Firstly, the dataset has certain specificity. The model is trained and tested on a specific dataset, which may limit its adaptability to maize seedlings at different stages. If applied to scenarios with significant differences from the training dataset, the model's performance may decline. Secondly, the model has implicit class limitations. LW-YOLOv7 is a single-stage detector that typically predicts multiple bounding boxes and corresponding classes for each position in the input image. However, this design may result in implicit class limitations, meaning the model can struggle to accurately differentiate objects with overlapping or similar features. Finally, it involves model complexity and computational resource requirements. Despite having a relatively smaller model size compared to other popular object-detection models, LW-YOLOv7 may still require substantial computational resources. On low-performance devices such as mobile or embedded systems, the model may face challenges due to limited computational resources.
Despite these challenges, our findings demonstrate that LW-YOLOv7 object-detection models offer great promise in addressing real-world problems. In the future, we plan to improve the detection accuracy and speed in maize seedling detection. The weight of the model further reduced to facilitate better deployment of identification tasks on the edge computing platform.