Target Detection for Coloring and Ripening Potted Dwarf Apple Fruits Based on Improved YOLOv7-RSES

: Dwarf apple is one of the most important forms of garden economy, which has become a new engine for rural revitalization. The effective detection of coloring and ripening apples in complex environments is important for the sustainable development of smart agricultural operations. Addressing the issues of low detection efficiency in the greenhouse and the challenges associated with deploying complex target detection algorithms on low-cost equipment, we propose an enhanced lightweight model rooted in YOLOv7. Firstly, we enhance the model training performance by incorporating the Squeeze-and-Excite attention mechanism, which can enhance feature extraction capability. Then, an SCYLLA-IoU (SIoU) loss function is introduced to improve the ability of extracting occluded objects in complex environments. Finally, the model was simplified by introducing depthwise separable convolution and adding a ghost module after up-sampling layers. The improved YOLOv7 model has the highest AP value, which is 10.00%, 5.61%, and 6.00% higher compared to YOLOv5, YOLOv7, and YOLOX, respectively. The improved YOLOv7 model has an MAP value of 95.65%, which provides higher apple detection accuracy compared to other detection models and is suitable for potted dwarf anvil apple identification and detection.


Introduction
Potting cultivated dwarfing rootstock apple [1] is an innovative approach to apple cultivation.It involves reducing the tree size [2][3][4], promoting early fruiting, increasing the yield, and facilitating the renewal and management of apple varieties [5].This method not only enhances the efficiency of natural resources such as land and water but also preserves resources.It provides a new avenue for improving the economic benefits of orchards and promoting the development of rural industries.Concurrently, it fosters the growth of the courtyard economy, which emphasizes ecological balance and environmental protection.Through rational planting and breeding patterns, it can improve soil quality, maintain clean water sources, and enhance biodiversity.Additionally, the courtyard economy can promote the recycling of waste, thereby reducing environmental pollution.In agricultural production, real-time and accurate monitoring of fruit quantity, automatic harvesting [6,7], and precise yield prediction are beneficial for improving production efficiency and economic benefits [8].Moreover, they help to drive the innovation and ad-vancement of agricultural technology, promoting the sustainable development of agriculture.Consequently, detecting the growth conditions of potted dwarfing rootstock apple trees in an effective way holds substantial practical significance for advancing precision agriculture [9].
Apple varieties are typically characterized by attributes such as color, shape, and texture.In China, where apples are a significant horticultural crop, the adoption of machine learning (ML) and computer vision (CV) technologies has advanced the capabilities for apple detection and recognition.This progress supports the intelligent production of apples [10].For example, Guo et al. proposed a detection algorithm based on image color region segmentation [11].This algorithm assesses an object's eligibility as a detection target by analyzing its different features.However, traditional algorithms frequently encounter challenges in practical applications due to factors such as foliage occlusion, variable lighting conditions, and fruit entanglement.The boundaries of the corn stover section were effectively extracted using a shape feature-based image segmentation algorithm [12].
In recent years, the YOLO algorithm has been widely adopted in precision agriculture, significantly improving the speed and accuracy of crop detection [13,14].YOLOv7, an advanced deep learning algorithm based on machine vision, effectively addresses the challenges of object detection [15].For example, the improved YOLOv7 has been utilized for the recognition of apple inflorescence morphology [16].Experimental results indicate that the improved YOLOv7 model achieves a recognition speed of 42.58 frames per second (fps), outperforming other models such as YOLOv5s, the improved YOLOv5, and the original YOLOv7.In [17], the YOLOv7 model was combined with a Squeeze-Excite (SE) network and performed well in a detection task on the PASCAL VOC dataset.The improved model reduced the number of parameters by 12.3% and the FLOPs by 18.86% compared to the original YOLOv7.Furthermore, the advancements in YOLOv3 based on SIoU loss [18] have led to improved accuracy in detecting occluded pedestrians.The YOLOv7 model has demonstrated significant theoretical and practical applications in the detection of potted dwarf apple trees.
In this paper, the tree characteristics of potted dwarf anvil apple trees are taken as the subject of the research, emphasizing and analyzing the two key periods for the automated management and yield prediction of fruit trees: the coloring and ripening periods.Fruit farmers and orchard managers can monitor the number and condition of fruits undergoing coloration in real time.This allows them to promptly implement management measures, such as adjusting light exposure, controlling moisture, and applying fertilizers, to optimize the coloration effect of the fruits.By analyzing the images of these two specific periods with deep learning, this paper aims to improve the recognition accuracy to better serve the automated management and yield prediction of potted dwarf anvil apples, thereby reducing environmental pollution and realizing green and eco-friendly orchard production.
The lightweight operation of the YOLOv7 model was ensured by adopting an efficient and parameterizable backbone network called RepBlock, which ensures accuracy while preserving computation time.Furthermore, it introduces the SE attention mechanism and the SIoU loss function, which further reduce complexity with minimal accuracy loss, enhancing the model's feature extraction capability in complex environments and thus improving the detection of obscured objects.This paper focuses on the coloring and ripening process of potted dwarf rootstock apple trees and explores the utilization of tree features to enhance the accuracy of fruit tree identification and detection.The proposed model is designed to enhance the accuracy of recognizing and detecting potted dwarf anvil apples, providing effective technical support for smart agriculture automated picking equipment.

Dataset Collection
This work focuses on researching the new variety of Ruixiang red apple with dwarf anvil, using pot planting in a sunlit greenhouse.The sunlit greenhouse spans 4000 square meters and houses 1500 potted dwarf anvil fruit trees.The rows are 3 m apart, with trees spaced 2 m apart within each row.The difference between dwarf rootstock fruit trees and traditionally planted fruit trees can be visually compared through Figure 1.The dataset was collected at the fruit tree dwarf rootstock planting base of Shanxi Agricultural University, located in Xiaowang Village, Nandali Township, Xia County, Yuncheng City, Shanxi Province, with geographic coordinates ranging from 111°02′ to 111°41′ east longitude and 34°55′ to 35°19′ north latitude.The collection period was from September 15 to 30 October 2023, using a Canon 700D camera.The images were captured in various weather and light conditions (sunny or cloudy) and at different distances-far (100-150 cm) and close (50-100 cm)-as shown in Figure 2. The apple fruit's reproductive period includes flowering, fruit drop, expansion, coloring, and ripening stages.This collection focuses on the coloring and ripening stage of the fruit.It is designed for fruits freshly taken out of the film bag, transitioning from green to yellow to red as they ripen.The fruit should have a diameter between 70 and 100 mm.During orchard production, it is crucial to monitor fruit color and ripeness in real time, automate picking, and accurately predict yield to boost efficiency and profits.Yuncheng, the research area, experiences year-round monsoon influence and has a warm temperate continental climate.It receives an average of 525 mm rainfall annually, enjoys 2350 h of sunshine, maintains an average yearly temperature of 13.3 °C, and has a frost-free period lasting 212 days.In recent years, through robust agricultural industry restructuring, the apple cultivation area has surged to 2.12 million mu, yielding over 1.8 billion kilograms annually, establishing itself as a prime national hub for high-quality apple production.This paper used an on-site data collection approach to tackle uneven sample distribution, enhance dataset diversity, and reduce bias in model training.We focused on target detection technology for tasks like machine inspection and mechanical fruit picking, where identifying ripe fruits at different distances and executing picking operations in close proximity are crucial.We gathered apple images from various time slots, distances, and angles, totaling 1325 images.

Dataset Production
In this paper, we utilized the LabelImg software (1.8.6) for manual annotation of the images.We excluded apple images with over two-thirds occlusion to ensure annotation accuracy.This decision considers both apple appearance and shape and the picking robot end-effector's fault tolerance.After completing the annotation, an XML file was generated containing category and coordinate information.We divided the dataset into training, validation, and test sets using an 8:1:1 ratio.The test set comprises 150 images, including 50 showcasing apples in various complex scenes, and serves to evaluate the model's detection efficacy.Figure 2 presents sample images of apples in diverse complex scenarios.The sufficiency of image data is pivotal during the model training phase.Insufficient training data may result in overfitting.Presently, we possess 1325 original apple images, which partly capture apple characteristics.However, they may not fully cover the differences in light, weather, noise, clarity, and other factors found in natural environments.Hence, to enhance the performance and generalization capability of the target detection model, it is imperative to expand the apple image dataset.We took several steps to enhance our data, resulting in 7950 images, as shown in Figure 3.This expansion is crucial for enhancing the model's accuracy and generalization ability. ( is the original image,   is the preprocessed image, and a and b are the adjustment parameters.The expanded image dataset will offer richer and more detailed target features for subsequent studies, thereby enhancing the model's capability to discern fruit shapes and features.

YOLOv7 Principle and Structure
The YOLOv7 model has three main parts: the backbone feature extraction network, the neck network, and the detection head network [19], as depicted in Figure 4.The backbone processes the input image with 640 pixels × 640 pixels and extracts valuable information via feature extraction.This process mainly depends on the collaboration of the Conv+BN+Silu (CBS) module, the Efficient Aggregation Network (ELAN) module, the MaxPool (MP) module, and the SPPCSPC module [20].In the backbone, the Multi-Concat-Block and Transition-Block play pivotal roles.The Multi-Concat-Block comprises four branches, each conducting a varying number of normalized convolution operations on the input feature layer.The results from these four branches are combined and then processed through another normalization convolution operation, producing a final output feature layer of fixed size.This design boosts network depth to enhance accuracy while tackling gradient vanishing issues with its multi-branch stacking and skip connection structure.
The neck network, on the other hand, uses the PANet network structure [21,22].This structure combines top-down and bottom-up information flows through a two-way fusion strategy.The neck consolidates information from different backbone and detection layers, enabling multi-scale fusion and extracting 3 enhanced feature layers.
Finally, the detection head network carefully examines feature points in each layer to determine the exact location, confidence level, and category of the target object.
YOLOv7 builds a feature pyramid structure like YOLOv5 on the three key feature layers of the backbone network to boost feature extraction.This architecture utilizes Multi-Concat-Block and Transition-Block modules to combine feature layers of varying scales through up-sampling and down-sampling operations, enhancing the extraction of highquality feature layers for target detection.It is worth noting that YOLOv7 utilizes the unique SPPCSPC module in the early stage to perform deep feature extraction for feature layers of size (20,20,1024), aiming at expanding the sensory field of the network layer.The SPPCSPC module combines the CSP and SPP modules.It divides the feature layer into two parts: one passes through the module after standardized convolution, while the other undergoes normalized convolution.Additionally, the second part is pooled using the SPP module with four different kernel sizes to integrate features into a fixed-size output layer.This not only helps to improve the sensory field and accuracy of the network layer, but also effectively circumvents the repeated extraction of feature information, reduces the computational complexity, and thus improves the computational speed.

Improvements to YOLOv7
The network architecture of YOLOv7 consists of three main parts: the input layer, the backbone network, and the detection head [19].On the input side, YOLOv7 follows the Mosaic data enhancement approach proposed by YOLOv4, which is trained by randomly cropping four images and splicing them into a single image, thus enriching the dataset and improving the training efficiency, while keeping the training and inference costs unchanged.
To strike a balance between detection accuracy and efficiency, YOLOv7 introduces a series of innovative strategies, including Reparametrized Convolution (RepConv), the Efficient Lightweight Aggregation Network (ELAN), and dynamic label assignment.However, it is noteworthy that YOLOv7 primarily relies on convolutional operations for feature extraction, which are inherently local operations.Convolutional layers often only model relationships between neighboring pixels, making it difficult to capture long-range dependencies.This limitation may result in the loss of features for small objects and missed detections for immature fruits.To address this issue, an attention mechanism is introduced to pay closer attention to the relationships between pixel features.Serving as a global operation, the attention mechanism computes weights between features through matrix operations, enabling the model to better capture long-range dependencies.The image features obtained from self-attention mechanisms complement those obtained from convolutional operations, collectively enhancing the model's performance.
In addition, the originally utilized CIoU loss function in YOLOv7 only accounts for the scale loss of the bounding boxes, without considering the mismatch between the predicted and true box orientations.Therefore, this paper adopts the SIoU loss function to replace the CIoU loss function, incorporating orientation loss into the model training process, aiming to further enhance the model's performance.

SE Attention Mechanism
Traditional CNNs emphasize feature representation within channels while overlooking the mapping relationships between channels.The SE module addresses inter-channel relationships through squeeze-and-excitation operations, thereby adaptively adjusting channel responses.In traditional convolutional networks, inter-channel relationships are often implicit and are confined to specific hierarchical levels.However, top-level channels are closely related to tasks; for instance, in segmentation networks, the number of top-level channels corresponds to the number of segmentation categories.In middle layers of the network, the number of channels is typically based on empirical data or test results, which may be derived from real-world applications or extensive experimental analysis.The SE network architecture, proposed in [23], integrates the idea of attention mechanisms into convolutional neural networks to achieve adaptive learning of the importance of each channel.
The SE module introduces a parameter-efficient channel attention mechanism, as illustrated in Figure 5.It achieves this through two steps: squeezing and excitation.In the squeezing phase, the feature maps are aggregated into a scalar value, and in the excitation phase, a fully connected layer is utilized to transform this value into a weight vector, which is then used to weight the feature maps.The SE module enables CNNs to learn the importance of each channel, thereby improving model performance.Demonstrating strong performance across various image classification tasks, the SE network has been widely adopted in diverse visual tasks.

=
; thus, This formula represents the convolution operation, where ; s c v is a 2D convolution kernel, indicating that c v acts on the cor- responding channel of the X squeeze stage, Global Information Integration, in order to comprehensively consider the information of each channel in the output feature map; this paper adopts the global average pooling method to integrate global spatial information into a single-channel descriptor, For activation and adaptive recalibration, in order to adequately capture the interdependencies between channels, this paper employs a simple gating mechanism, the sigmoid activation function, to achieve adaptive recalibration of channel responses.( ) In this paper, the SE module serves as a plug-and-play component designed to address inter-channel dependencies,as illustrated in Figure 6.It utilizes global average pooling to compress global spatial information, i.e., performing the squeeze operation.To fully leverage the aggregated information from the squeeze operation, the proposed model employed the FC-ReLU-FC-sigmoid operation to capture channel dependencies.The SE module can be added to any location of the YOLOv7 network, such as within the C3 module.When integrating the SE module into the YOLOv7 network via the common.pyfile, unlike the original C3 module, it incorporates the SE module into the bottom bottleneck section.This modification aims to facilitate better experimentation and evaluation of model performance.

SIoU Loss Function
In the YOLOv7 algorithm, IoU is actually short for intersection over union, which is also known as the "intersection and union ratio".IoU has a crucial role in target detection and semantic segmentation.We can set the value of IoU as the ratio of the intersection and the union of two graph areas, as shown in Figure 7.
The predictor frame regression loss uses the CIoU loss, but CIoU does not take into account the mismatch between the predictor frame and the true frame directions.An SIoU loss function is introduced in which the penalty metrics are redefined by taking into account the angle of the vectors between the desired regressions.Applied to conventional neural networks and datasets, it is shown that SIoU improves the speed of training and the accuracy of inference.SIoU further considers the vector angles between the true and predicted frames, redefining the associated loss function.SIoU consists of four components-angular loss (Λ) (Angle Cost), distance loss (∆) (Distance Cost), shape loss (Ω) (Shape Cost), and the intersection and merger ratio loss (U) (IoU Cost)-as shown in the angular loss parameter schematic in Figure 8.  IoU represents the intersection over union between the predicted bounding box (B) and the center of the real frame (BGT); ∆ signifies the distance loss, aiming to minimize the centroid distances between various predicted bounding boxes; and Ω indicates the shape loss, which quantifies the deviation of the predicted bounding box's centroid from that of the center of the real frame.

( ) ( )
If α ≤ π/4, the convergence process will first minimize α and otherwise minimize β: The angular loss (Λ) is calculated as follows [13]: The distance loss (∆) is calculated as follows: Shape loss,  ，is defined as follows: ( ) IoU loss is defined as follows: The total loss function is as follows:

Lightweight Improvement
RepBlock is obtained via structural re-parameterization; multi-branch networks often outperform single-path networks in classification tasks, yet this advantage can result in higher inference latency [24].Our paper draws inspiration from RepVGG.We aim to balance accuracy and speed by adopting RepBlock, an efficient and tunable backbone network.RepBlock capitalizes on specific hardware acceleration features, and post-training, we significantly reduce inference latency by transforming the multi-branch topology into a single 3 × 3 convolution layer (RepConv) with ReLU activation.Figure 9 illustrates the backbone network structure of RepBlock.In this paper, we performed an equivalent mapping on the original multi-branch architecture.Specifically, each branch had its convolutional kernels transformed into 3 × 3 convolutional kernels.For 1 × 1 convolutional layers, to address the shortcut branch problem without convolutional layers, we introduced four fixed numerical convolutional kernels, , where After equivalence mapping, we merged the convolutional layers and BN layers of each branch, resulting in a biased convolutional layer as shown in Figure 7. Specifically, we merged each branch into a single convolutional layer.The convolutional parameters and biases of the merged layer are denoted as tively.The structure of RepBlock is illustrated in Figure 8, and the relationship between its input and output operations is represented by Equation ( 14).
In this paper, we successfully utilized the convolutional kernel fusion technique described in Equation ( 15) to convert three sets of parameters  

Experimental Environment
In this paper, the experimental tests were carried out on the deep learning server provided by the Smart Agriculture Laboratory of Shanxi Agricultural University.The server is powered by the Windows 10 operating system and is equipped with an Intel ® Core™ i7-7700HQ processor (Intel, Santa Clara, CA, USA) (clocked at 3.80 GHz) and an NVIDIA GeForce GTX 3090 graphics card (NVIDIA, Santa Clara, CA, USA).The software used in the experiment and their versions are shown in Table 1.

Evaluation Indicators
To assess the detection performance of the model, this paper utilizes three metricsprecision (P), recall (R) and average precision (AP) (PK)-as the evaluation criteria to assess the model [21] (QH).Accuracy, P, is the ratio of true positive detections to the total number of positive predictions made by the model, and is calculated by the following formula [15]: R measures the ratio of true positive detections to the total number of actual positive instances.It is calculated as followed: FN TP TP R (17) where TP denotes the number of true positive detections, FP represents the number of false positive detections, and FN indicates the number of false negative detections.
The average precision (AP) represents the area under the precision-recall curve, which is plotted by combining points from both precision and recall metrics.The formula is R R P AP (18) where P(R) is the value of P corresponding to R on the PR curve [22].

Improved Experimental Precision
The improved YOLOv7 network underwent systematic training on a training set containing 1340 images of potted dwarf apple trees.To verify the effectiveness of this method, we further evaluated 333 images of dwarf apple trees from an independent test set.According to the results presented in Table 2, the proposed model demonstrated excellent performance, with an accuracy rate of up to 87.22% and a recall rate of These results fully prove the reliability and practicality of the improved model.Figure 12 presents the test accuracy of the experimental models.

Discussion
To further validate the advantages and efficacy of the proposed improved YOLOv7 model for potted dwarf apple detection tasks in orchards, we meticulously adjusted the model parameters, and a series of comparative experiments were conducted.These experiments used the industry-representative target detection models, namely YOLOv7, YOLOv5, and YOLOX.The experimental results demonstrated the performance differences between the algorithms in the dwarf apple recognition detection task, as shown in Figure 13, by comparing the performance of YOLOX, YOLOv5, YOLOv7, and the improved YOLOv7 model in key metrics such as recognition accuracy, recall rate, and average precision.After comparing the close-up detection effect images, we observed that all four detection models exhibited certain detection capabilities.Among them, YOLOv7 stood out in improving the precision of the bounding boxes, with both the completeness and accuracy of its bounding boxes surpassing the other models.All models showed varying degrees of deficiencies in the completeness of detection annotations.
In the long-range small target detection effect images, by introducing the SE attention mechanism, our method significantly enhanced the feature extraction capability.Under these circumstances, YOLOX and YOLOv5 failed to detect all targets completely, while YOLOv7 successfully detected all targets but made misjudgments in the recognition during the coloring and maturation periods.This indicates that the improved YOLOv7 not only effectively prevents the omission of small targets in long-range images but also enhances detection accuracy.Moreover, the improved algorithm demonstrated optimal detection performance when processing dense images.In summary, the improved YOLOv7 exhibited outstanding precise recognition capabilities in various environments, including long-range, close-up, and dense scenes.
As shown in Table 3, the detection accuracy of the YOLOX model for potted dwarf apples was 81.91%, the recall rate was 86.35%, and the average precision was 89.65%; the YOLOv5 model had a detection accuracy of 79.50%, a recall rate of 81.83%, and an average precision of 86.65%; the YOLOv7 model had a detection accuracy of 82.34%, a recall rate of 88.84%, and an average precision of 90.04%; the improved YOLOv7 model had a detection accuracy of 87.22%, a recall rate of 90.02%, and an average precision of 96.65%.The harmonic means of the precision and recall rates for the YOLOX, YOLOv5, YOLOv7, and improved YOLOv7 models were 0.84, 0.83, 0.84, and 0.89, respectively.The experimental data were plotted into a PR curve, as shown in Figure 14.

Conclusions
The utilization of the improved YOLOv7 model for detecting the coloring and ripening periods of dwarf rootstock apples contributes to effective fruit tree harvest and precise greenhouse management.It promotes the green development of fruit production and strengthens the innovation of smart agricultural technology, thus making a positive contribution to the sustainable development of the apple industry.The proposed model uses an efficient, parameterizable backbone network called RepBlock for lightweight networks, introduces the SE attention mechanism to optimize features, and uses the SIoU loss function to improve the ability to detect obscured apples.The proposed model shows significant improvements in precision, recall, and average accuracy rate, while being lightweight.Compared to the YOLOv7 model, it has achieved remarkable increases of 4.88%, 1.18%, and 5.61% in precision, recall, and average accuracy rate, respectively.In conclusion, the proposed model exhibits significant potential and advantages in detecting the coloring and maturity stages of dwarf rootstock apples.In future research, further optimization of the detection model can enhance its precision and real-time performance.Further research on deploying the improved model into practical applications, such as quantitative fertilization during the coloring stage and real-time harvesting during the ripening periods, can save resources, protect the environment, and improve farming efficiency, thereby promoting the sustainable development of agriculture.Future research and applications will drive advancements in smart agriculture, offering precise management of resources like water, soil, and fertilizers.This reduces biochemical use, lowers labor intensity, and promotes sustainable agricultural development.

Figure 1 .
Figure 1.Potted dwarf apple trees and cultivated dwarf apple trees.

2 .
The final output result of the SE code block is obtained by rescaling U by a scale factor, s:

Figure 10 .
Figure 10.RepBlock equivalence mapping.(a) The structure and parameters of RepBlock; (b) the structure and parameters of RepBlock after equivalence mapping.
in Figure11.This method effectively reduced the number of convolutional kernels by two-thirds and significantly improved the inference efficiency of the network.

Figure 13 .
Figure 13.Detection results of different models with various conditions.(a) Near distance; (b) long distance; (c) coloring stage; (d) maturity stage.

Figure 14 .
Figure 14.PR curves of different detection models.

Table 1 .
System environment of the experiment.

Table 3 .
Research of precision.