ASHM-YOLOv9: A Detection Model for Strawberry in Greenhouses at Multiple Stages

Mo, Yan; Bai, Shaowei; Chen, Wei

doi:10.3390/app15158244

Open AccessArticle

ASHM-YOLOv9: A Detection Model for Strawberry in Greenhouses at Multiple Stages

by

Yan Mo

^1,2,*

,

Shaowei Bai

¹

and

Wei Chen

^3,*

¹

School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China

²

College of Aeronautics Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

³

College of Geoscience and Surveying Engineering, China University of Mining & Technology, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8244; https://doi.org/10.3390/app15158244

Submission received: 26 June 2025 / Revised: 19 July 2025 / Accepted: 20 July 2025 / Published: 24 July 2025

Download

Browse Figures

Versions Notes

Abstract

Featured Application

A novel strawberry detection model was developed to obtain strawberry accurately and obtain the growth conditions of strawberry plants in greenhouses.

Abstract

Strawberry planting requires different amounts of soil water-holding capacity and fertilizer at different growth stages. Determining the stages of strawberry growth has important guiding significance for irrigation, fertilization, and picking. Quick and accurate identification of strawberry plants at different stages can provide important information for automated strawberry planting management. We propose an improved multistage identification model for strawberry based on the YOLOv9 algorithm—the ASHM-YOLOv9 model. The original YOLOv9 showed limitations in detecting strawberries at different growth stages, particularly lower precision in identifying occluded fruits and immature stages. We enhanced the YOLOv9 model by introducing the Alterable Kernel Convolution (AKConv) to improve the recognition efficiency while ensuring precision. The squeeze-and-excitation (SE) network was added to increase the network’s capacity for characteristic derivation and its ability to fuse features. Haar wavelet downsampling (HWD) was applied to optimize the Adaptive Downsampling module (Adown) of the initial model, thereby increasing the precision of object detection. Finally, the CIoU function was replaced by the Minimum Point Distance based IoU (MPDIoU) loss function to effectively solve the problem of low precision in identifying bounding boxes. The experimental results demonstrate that, under identical conditions, the improved model achieves a precision of 97.7%, a recall of 97.2%, mAP50 of 99.1%, and mAP50-95 of 90.7%, which are 0.6%, 3.0%, 0.7%, and 7.4% greater than those of the original model, respectively. The parameters, model size, and floating-point calculations were reduced by 3.7%, 5.6% and 3.8%, respectively, which significantly boosted the performance of the original model and outperformed that of the other models. Experiments revealed that the model could provide technical support for the multistage identification of strawberry planting.

Keywords:

object detection; ASHM-YOLOv9; strawberry multistage; greenhouse planting

1. Introduction

Strawberry is an important economic crop that is rich in nutrients and vitamins. China is a large country in terms of the production, export, and consumption of strawberries, with its sown area and output ranking first in the world. According to statistics, the strawberry planting area in China reached 110,000 hectares in 2018, accounting for approximately one-third of the world’s strawberry planting area [1].

The growth stages of strawberries can be divided into several main periods: the flowering period, which typically begins in spring when strawberry plants start to develop flower buds. These buds gradually open to form flowers; the growth period follows flowering, during which the flowers are pollinated and enter the growth phase. This stage is primarily focused on fruit development, as strawberries begin to form small fruits; during the ripening period, strawberry fruits gradually enlarge and change color from green to red, indicating maturation [2]. The demands for nutrients and water vary significantly during different growth stages, making accurate identification of these stages crucial for scientific management, yield improvement, and quality enhancement [3]. Traditional manual identification methods face several critical limitations: labor-intensive and time-consuming processes that cannot meet the demands of large-scale agricultural production; subjective assessment variations among different operators leading to inconsistent results; and limited scalability for continuous monitoring in greenhouse environments and high operational costs that reduce agricultural profitability. These limitations necessitate the development of automated, objective, and efficient identification systems. Existing computer vision approaches for crop identification encounter specific challenges: similarity between adjacent growth stages, particularly between growth and ripening periods where fruit sizes are nearly identical; complex greenhouse environments with variable lighting conditions, leaf occlusion, and fruit clustering; and limited dataset diversity as most existing studies focus on mature strawberries in field conditions rather than comprehensive multistage identification in controlled environments. With the acceleration of industrialization, traditional manual picking techniques are no longer capable of fulfilling the requirements of modern agriculture and are time-consuming and costly, thereby necessitating innovative solutions. In recent years, the advent of automated strawberry picking technologies based on image recognition has led to revolutionary changes in this field. These technologies can accurately determine the ripeness of strawberries, thereby increasing their picking efficiency and precision [4]. Therefore, improving the ability to accurately identify the various growth stages of strawberries is not only an important step toward promoting agricultural modernization but also critical for ensuring food security, increasing farmers’ income, and promoting sustainable development. Precise delineation of strawberry phenological phases is indispensable for implementing stage-specific interventions, maximizing productivity, and maintaining harvest integrity—core imperatives for advancing sustainable, high-value horticulture [5,6].

In practical agricultural image recognition, traditional machine learning methods and image processing algorithms have gradually lost their advantages. These methods often require significant manual intervention and cumbersome feature extraction processes, leading to a heavy workload and insufficient accuracy [7]. In contrast, target detection algorithms, which rely on deep learning and convolutional neural networks (CNNs), can automatically extract intricate information from images, thereby remarkably increasing recognition accuracy [8,9]. This approach not only reduces labor costs but also demonstrates greater adaptability in handling complex scenes, resulting in more accurate recognition outcomes and advancing the development of agricultural intelligence.

In recent studies, object detection technology has overcome the bottleneck of previous technologies and is gradually developing in a more efficient and accurate direction [10]. During the detection phase, object detection approaches can be divided into two principal categories.

One type belongs to the two-stage object detection algorithm, which is characterized by the necessity of generating candidate regions prior to identification. The faster region-based convolutional neural network (Faster R-CNN) and the mask region-based convolutional neural network (Mask R-CNN) are typical two-stage object detection algorithms [11,12]. In crop recognition, the counting accuracy of fruit quantity is often an important metric for assessing model performance. For example, as discussed by Li [13], they used a revised Faster R-CNN to identify and count ripe strawberries, replaced the original RoiAlign algorithm with the RoiPooling algorithm, and used bilinear interpolation for calculation. The average identification precision of the model reached 87.33%. On the basis of a more detailed recognition of strawberry fruits, Tang [14] employed a modified Mask R-CNN to conduct fine classification and recognition of ripe strawberries. The original network was enhanced by incorporating self-calibrated convolution, resulting in an average precision of 93.7% for the upgraded model.

Another type is the single-stage object detection algorithm, with more advanced algorithms such as the single-shot multibox detector (SSD) and the You Only Look Once (YOLO) algorithms [15,16]. The characteristics of single-stage object detection algorithms are that they do not require the generation of candidate regions and have the merits of a short training time and less computational burden. Therefore, it has greater advantages in recognizing large quantities of crops. For example, Ridho [17] evaluated the precision of a strawberry fruit picking robot on the basis of SSD convolutional neural networks. By performing migration learning on SSD-MobileNet V1 with three retraining sessions, the results show that the robot can operate at higher frame rates with 90% precision. Both good and bad strawberries were detected and differentiated. Recently, the YOLO series of algorithms has demonstrated significant advantages in crop recognition, primarily characterized by their speed, accuracy, and outstanding performance in recognizing different varieties of crops. For example, algorithms such as YOLOv5, YOLOv7, and YOLOv8 have been widely applied in agriculture. For example, Li [18] proposed a real-time multilevel strawberry algorithm that relies on YOLOv5 and introduced the adaptive spatial feature fusion (ASFF) module. Experiments were conducted on the image recognition of strawberries in different periods in complex scenes. The results show that YOLOv5-ASFF has improved detection precision. It performed excellently in all aspects, with the precision and recall values reaching 87.19% and 88.88%, respectively, which are 0.48% and 1.71% higher than those of the original algorithm YOLOv5s. This study provides ideas and technical support for strawberry identification in complex field scenes. Yang [19] introduced a novel deep learning methodology that enhances tea yield estimation by precisely tallying tea buds in croplands. They integrated a squeeze-and-excitation network based on the YOLOv5 model, incorporating the Hungarian matching algorithm and the Kalman filtering algorithm to ensure precision as well as dependable quantification of tea buds. Xu [20] investigated TrichomeYOLO for identifying the length and density of maize trichomes via scanning electron microscope images based on the YOLOv5 model. Chai [6] utilized YOLOv7 and AR technology to determine the maturity of strawberry plants. The results indicated that their model could identify the maturity of strawberry plants well, with mAP and F1 values of 89% and 92%, respectively. Furthermore, the research also indicated that. The YOLOv7-tiny model can reduce the monitoring time, but the detection performance is slightly reduced. Yang [21] proposed and evaluated the LS-YOLOv8s model for strawberry maturity detection and LW-Swin transform modules. The results indicate that for mAP50, the precision of LS-YOLOv8s increases by 1.6% compared with that of YOLOv5s. The LS-YOLOv8s used only 51.93% of the parameters, and the detection precision reached 94.4%, which provides good technical support for strawberry object detection.

Although existing studies have achieved relatively highly accurate strawberry object detection, the datasets used in most of these studies are based on mature strawberry plants in the field, and for the flower stage and growth stage of strawberry plants, the identification content is limited. In recent years, strawberry plants grown in greenhouses have become increasingly widely used. Field surveys have revealed that strawberry plants grown in greenhouses often exhibit inconsistent stages. Therefore, it is increasingly important to perform object detection of strawberry plants grown in greenhouses at different stages. To address the aforementioned issues, our research focuses on the YOLOv9 model, which is a newly released version of the YOLO series with higher precision and faster inference speed, to improve and optimize the algorithm for identifying the different stages of strawberry growth and to research a multistage object detection model for strawberry growth in greenhouses. Among existing crop recognition algorithms, two-stage detection algorithms (such as Faster R-CNN and Mask R-CNN) demonstrate superior accuracy but suffer from high computational overhead due to their complex candidate region generation process, making them unsuitable for real-time detection requirements. In contrast, single-stage detection algorithms achieve a better balance between speed and accuracy. Among the various YOLO series algorithms, YOLOv5 has become the mainstream choice for agricultural applications because of its mature architecture and stable performance; however, it still has limitations in multistage target detection capabilities in complex greenhouse environments. YOLOv7 improves detection accuracy through the introduction of reparameterizable modules, but it increases model complexity and lacks adaptability across different growth stages. YOLOv8 shows overall performance improvements but still lacks targeted optimization for recognizing strawberries at different growth stages in greenhouse environments. The newly released YOLOv9 model achieves higher precision and faster inference speed through more advanced architectural design, providing a better foundation platform for multistage strawberry detection in greenhouses. Based on this analysis, this study selects YOLOv9 as the base model and focuses on algorithm improvement and optimization tailored to the characteristics of different growth stages of greenhouse strawberry, aiming to construct a strawberry detection model capable of accurately identifying multiple stages, such as flowering and growth periods, to meet the practical requirements of strawberry growth monitoring in greenhouse environments.

In this study, we used the greenhouse strawberry dataset we constructed to create an ASHM-YOLOv9 model with higher precision and a smaller volume by introducing modules such as adjustable kernel convolution, the SE network, Haar wavelet downsampling, and other modules. The remainder of this article is structured as follows: Section 2 presents the experimental data and methods. An analysis of the comparative experiments and ablation experiments is presented in Section 3. Finally, Section 4 presents the conclusions and analysis.

The main accomplishments of this research are as follows:

(a): A strawberry image dataset consisting of 2682 images was established via digital camera photography and data augmentation methods, which included a total of 13,668 strawberry samples across three growth stages.
(b): An ASHM-YOLOv9 model for multistage strawberry recognition based on YOLOv9 was introduced. This model incorporates modules such as Alterable Kernel Convolution (AKConv), Squeeze-and-Excitation (SE) Networks, and Haar Wavelet Downsampling (HWD). For the constructed dataset, the model attained an accuracy of 97.7%.
(c): A comparative analysis was conducted on the constructed dataset against other mainstream algorithms in the YOLO series, which demonstrated that our model performed the best.

2. Research Area and Dataset

2.1. Research Area and Data Collection

The strawberry dataset used in this study was collected at the Ecological Agriculture Science and Technology Park of the China Academy of Agricultural Machinery, Changping District, Beijing (40°09′49″ N 116°17′11″ E), as shown in Figure 1. Images of the strawberry plants were collected in January 2023, and images were acquired. Strawberries in various growth phases are cultivated in a greenhouse. The images were captured in different rows within the greenhouse, with two shooting sessions conducted two weeks apart, both at noon. The greenhouse maintained controlled environmental conditions during the data collection period. The average temperature was 22 ± 3 °C during the day and 15 ± 2 °C at night. The relative humidity was maintained at 60–70%. Images of the strawberry plants were collected in January 2023, with two shooting sessions conducted two weeks apart. This two-week interval was specifically chosen to capture the natural progression of strawberry growth stages, as strawberries typically advance from flowering to fruit development and maturation within 14–2 days under greenhouse conditions. The interval ensured that we could document the complete transition between growth stages (R4-R6) while maintaining consistent environmental conditions. The acquisition devices used were a HUAWEI Mate 40 Pro smartphone and a PENTAX KS2 SLR camera. The images had resolutions of 4096 × 3072 and 5408 × 3680 pixels and were stored in JPG format, as shown in Figure 2. The collected images encompass different angles, planting densities, and growth stages and are not restricted to a single strawberry variety.

2.2. Dataset Preprocessing

In terms of the phenological stages and growth habits of strawberries, the annual phenological period can be divided into six stages (R1–R6): the reproductive stage, floral bud differentiation stage, dormancy stage, flowering stage, fruiting stage, and ripening stage. The first three stages are less practical for image recognition and object detection [22]. Therefore, in this study, the last three phenological stages of strawberry plants were selected for experimentation. The characteristics and information for these three phenological stages can be found in Table 1. Then, LabelImg PyQt5 was used to label each image in the dataset to form a sample set. The images were labeled with the smallest enclosing rectangular box, and the samples with severe occlusion (occlusion greater than 90%) or severe blurring caused by the long distance were not labeled. The txt file in the YOLO format was used as the sample input model for training.

2.3. Data Augmentation

Through the screening and exclusion of the original data, a dataset of 670 sample images was finally obtained. To ensure the efficiency of the training result, the original images were first downscaled, and all the images were resampled to dimensions of 800 × 600 pixels. To improve the model’s feature recognition ability, data augmentation was performed on the sample images, and the original images were subjected to operations such as mirroring, noise addition, and exposure processing to increase the sample size. A total of 2682 processed images were obtained.

2.4. Construction and Analysis of the Strawberry Dataset

After data augmentation, a total of 13,668 strawberry samples were ultimately obtained. Among them, 2855 samples were in the flower stage, 6083 samples were in the growth stage, and 4730 samples were in the mature stage. As shown in Figure 3, the main challenges in strawberry recognition are as follows:

Similarity in labels between the fruit development period and the fruit ripening period: As illustrated in Figure 3, the fruit sizes within the fruit development period and the fruit ripening stage are nearly identical, making it difficult to distinguish between these two stages during transitional phases.

Complex image backgrounds: Strawberries often grow in relatively complex environments, where issues such as occlusion between fruits, leaf obstruction, or overly intricate backgrounds can occur. The ability to increase the recognition of strawberry ripeness within such complex surroundings represents a crucial challenge.

Finally, the sample images were randomly split into a training collection of 2145 images, a validation collection of 268 images, and a test collection of 269 images at a ratio of 8:1:1. This split maintains proportional representation across growth stages, with each stage having ~80% samples in training, ~10% in validation, and ~10% in testing. The exact sample counts per stage are as follows: Flowering: 2284 training/285 validation/286 test; Growth: 4866 training/608 validation/609 test; Mature: 3784 training/473 validation/473 test.

3. Methods

This section introduces the original YOLOv9 network as well as our improved ASHM-YOLOv9 network, with the aim of improving the detection accuracy of strawberry maturity. The ASHM-YOLOv9 network incorporates the SE attention mechanism, replaces certain convolution methods, and enhances the original model’s ADown downsampling algorithm with the HWD downsampling algorithm. Finally, the loss function is replaced. The workflow of this study is shown in Figure 4.

3.1. YOLOv9 Object Detection Model

The YOLOv9 object detection model represents the newly introduced version within the YOLO series [23]. This approach outputs directly by skipping the intermediate layers and addresses the problem of substantial information loss in the input data of deep networks during feature extraction and spatial transformation at each layer. It introduces the concept of programmable gradient information (PGI) and designs a generalized efficiency layer aggregation network (GELAN) on the basis of gradient path planning. PGI mainly includes the main branch, auxiliary reversible branch, and multilevel auxiliary information; GELAN integrates the CSPNet and ELAN networks, which can reduce the computational cost and improve efficiency [24,25]. With respect to the MS COCO dataset, the performances of YOLOv9 and the other advanced detector algorithms (YOLO MS [26], YOLOv7 [27], YOLOv8 [28], YOLOv5 [29], etc.) show that YOLOv9 outperforms other models in terms of optimization computations, parameter utilization, and precision. The YOLOv9 object detection model can be categorized by size into YOLOv9t, YOLOv9s, YOLOv9m, YOLOv9c, and YOLOv9e. For the purposes of this study, we selected YOLOv9c, which achieves the highest performance on the MS COCO dataset, for our experiments [23].

3.2. ASHM-YOLOv9 Object Detection Model

To address the fact that the YOLOv9 model does not effectively monitor the growth stages of strawberries, we improved the original model and proposed the ASHM-YOLOv9 model. The architecture of the ASHM-YLOv9 object detection model is illustrated in Figure 5. The black dotted frame represents the backbone structure of the main network branch; the orange dotted frame represents the neck structure of the main network branch; the green dotted frame represents the head structure of the network main branch; and the red dotted frame represents the auxiliary training branch of ASHM-YOLOv9, which merely takes part in the training phase and does not engage in the inference phase. During training, the training results of the main branch and the auxiliary branch can be returned to the three, four, and five convolutional layers of the main branch for regression. The red plates are the improvements made to the original YOLOv9 model. The original model has been refined and enhanced in the subsequent three aspects. The introduction of AKConv allows for automatic adjustment of the convolution kernel size in accordance with the shape features of strawberries, effectively capturing their features at different stages and thereby enhancing identification efficiency. Additionally, the SE network is integrated to enhance the model’s feature extraction and fusion capabilities. By combining spatial and channel information, comprehensive strawberry information features are constructed, channel information extraction is improved, and the overall performance of the neural network is enhanced. The shape characteristics of strawberries effectively capture their features in various stages, thereby improving identification efficiency [30]. SE is integrated to enhance the model’s feature extraction ability and feature fusion ability by integrating spatial information and channel information to construct strawberry information features, improve channel information extraction, and enhance neural network performance [31]. Haar wavelet downsampling is introduced to improve the ADown algorithm of YOLOv9 to further increase the precision of object detection [32]. In addition, this study uses the MPDIoU loss function [33] as a substitute for the CIoU function, which can effectively improve the identification precision of the bounding box.

3.2.1. Alterable Kernel Convolution

Two deficiencies are ubiquitous in traditional convolution operations. First, its calculation is restricted to a specific local window, preventing it from retrieving information from other locations. Second, its sampling shape is a square with a fixed size of k×k, and as k increases, the number of parameters of the convolutional module increases rapidly. In practical applications, the sizes and shapes of different objects in different datasets are different. Zhang [30] proposed alternative kernel convolution (AKConv), which can utilize efficient convolution kernels with a specific number of parameters to extract features, increasing network performance while trimming down the model’s parameters and reducing computational complexity. It is tested on the representative datasets VOC [34] and COCO [35]. AKConv has the following advantages: it permits the convolution kernel to possess an arbitrary quantity of parameters, making it adjustable in size and shape on the basis of actual requirements, thus adapting to the alterations of the target more efficiently; it presents a novel algorithm for generating initial sampling coordinates, strengthening its adaptability to targets of diverse sizes; it enhances the precision of feature extraction by modifying the sampling positions of irregular convolution kernels according to the acquired offsets; and it reduces the model parameters and computational overhead.

The AKConv model is presented in Figure 6. The AKConv takes picture input in the format of (C, H, W), with C representing the number of channels, H representing the height, and W representing the width. First, the offset of the kernel is obtained through the Conv2d convolution operation. N is the dimension of the convolution kernel. Then, the revised coordinates are obtained by adding the offset to the original coordinates, and the geometry of the convolution kernel is sampled. Then, the modified sampling shape is resampled and finally output after being reshaped, convolved, and normalized by the SiLU function.

The principle is as follows: first, we assume that a convolution sampling network of size 3 × 3 exists. Let

K

represent the sampling network, which can be expressed as (1):

\begin{matrix} K = \{(- 1, - 1), (- 1,0), \dots, (0,1), (1,1)\} \end{matrix}

(1)

However, such a sampling network is regular, as the sampling grid is typically centered at the point

(0,0)

. In contrast, irregular sampling grids do not have a central point. To accommodate the convolution network’s irregular shape, a solution in which the top-left corner is set as the sampling origin is proposed. The convolution operation for the initial coordinates of the irregular convolution at position

S_{n}

,

S_{0}

, is defined as follows (2):

\begin{matrix} \begin{matrix} Conv (S_{0}) = \sum λ (S_{0} + S_{n}) \end{matrix} \end{matrix}

(2)

Here, λ represents the convolution parameters, which allow for the effective implementation of irregular convolution operations.

Unlike traditional convolution methods, AKConv can adjust the shape of the convolutional kernel in a targeted manner on the basis of the shape characteristics of strawberries at different stages and determine the required number of parameters accordingly. It can automatically adjust the volume during processing. The size of the product kernel can effectively capture the characteristics of the flower stage, the growth stage, and the mature stage to improve identification efficiency. Therefore, we added AKConv to the convolutional module of the original model to form the RepNCSPELAN4-AKConv convolutional layer for experimentation.

3.2.2. Squeeze-and-Excitation Networks

The core module of the convolutional neural network (CNN) is the convolution operator, which constructs information features by integrating spatial and channel information. Therefore, improving the extraction of channel information can enhance the performance of neural networks. To this end, Hu [31] focused on improving feature channels and proposed a new architectural unit, squeeze-and-excitation (SE) networks, with the aim of enhancing network quality by establishing interdependence between feature channels. The SE module’s structure is presented in Figure 7. First, a conventional convolution operation is performed on the feature map (h’, w’, B’) to acquire a new feature map (h, w, B). Subsequently, global average pooling is utilized on the channels, compressing the feature map into a 1 × 1 × B feature vector. This is followed by two layers of fully connected activation to obtain the weights. The convolved feature map is the product of multiplication with the obtained weights, generating our final output. It comprises two principal constituents: squeeze and excitation. The squeeze module is capable of performing global average pooling on the input feature map, thereby reducing the eigenvalues of all channels to a single dimension, which forms a global vector. The detailed operation is depicted in Equation (3). The excitation module is activated by the ReLU function [36] and sigmoid activation function [37]. The image is dimensionally reduced and then increased to generate a weight vector and normalized, as shown in Equation (4). Finally, the original eigenmap is multiplied by the weights to selectively enhance and suppress the original features. The introduction of the SE networks into the model can improve the identification precision without increasing the computations and parameters.

F_{Squeeze} = \frac{1}{h \times w} \sum_{i = 1}^{h} \sum_{j = 1}^{w} X (i, j)

(3)

where F_squeeze is the squeeze function; X is the original image; h is the height of the image; and w is the width of the image.

F_{Excitation} (f, α) = σ (g (f, α)) = σ (α_{2} δ (α_{1} f))

(4)

where σ is the sigmoid activation function; δ is the ReLU activation function; α₁ and α₂ represent the weights of the fully connected layer; and f is the aggregation feature of the original image after compression.

3.2.3. HWD—ADown Module

Deep learning neural networks usually use standard downsampling networks such as the Conv module, but these sampling methods often ignore boundaries and textures, resulting in information loss. To solve this issue, YOLOv9 introduces the ADown module [27], which replaces the traditional downsampling operation through a lightweight design and flexibility, which not only curtails the model size but also enhances the precision of object detection, providing an effective downsampling solution for real-time object detection. The Haar wavelet downsampling (HWD) block is frequently utilized as a replacement for the convolution or pooling layers, can enhance the precision of object detection, and can be seamlessly integrated into the frameworks of the YOLO series [32]. Therefore, in this paper, we combine the HWD with the ADown module, further enhancing the precision of object detection.

The wavelet basis function and scale function of the 1-stage, one-dimensional Haar transform are defined as follows (5):

\{\begin{matrix} Φ_{1} (x) = \frac{1}{\sqrt{2}} Φ_{1,0} (x) + \frac{1}{\sqrt{2}} Φ_{1,1} (x) \\ Ψ_{1} (x) = \frac{1}{\sqrt{2}} Φ_{1,0} (x) + \frac{1}{\sqrt{2}} Φ_{1,1} (x) \end{matrix}

(5)

Here,

Φ_{i, j} (x)

is defined as (6):

\begin{matrix} Φ_{i, j} (x) = \sqrt{2^{i}} Φ (2^{i} x - j), j = 0,1, \dots, 2^{i} - 1 \end{matrix}

(6)

In this case, parameters

i

and

j

represent the order of the Haar basis function. Moreover, it is defined as: Furthermore,

Φ_{0,0} (x)

is defined as (7):

\begin{matrix} \begin{matrix} Φ_{0,0} (x) = Φ_{0} (x) = \{\begin{matrix} 0, x < 0 \\ 1,0 \leq x < 1 \\ 0, x \geq 1 \end{matrix} \end{matrix} \end{matrix}

(7)

Hence, the 1-stage Haar transform can be expressed with the 0-stage Haar basis function (8):

\begin{matrix} \begin{matrix} \begin{matrix} \{\begin{matrix} Φ_{1} (x) = Φ_{0} (2 x) + Φ_{0} (2 x - 1) \\ Ψ_{1} (x) = Ψ_{0} (2 x) - Ψ_{0} (2 x - 1) \end{matrix} \end{matrix} \end{matrix} \end{matrix}

(8)

The main structure of the HWD is illustrated in Figure 8. It mainly includes two parts: a lossless feature encoding block and a feature learning encoding block. To reduce the image’s spatial resolution and carry out feature transformation, wavelet transformation is utilized in the lossless feature encoding section; the feature learning encoding section uses standard convolutional layers, normalization layers, and activation layers, which are used mainly to extract discriminative features. Experiments show that the combined use of the two modules can effectively improve the precision of object detection with a relatively small increase in computations. Therefore, this module was chosen to replace the original downsampling module for experimentation.

3.2.4. MPDIoU Loss Function

In object detection tasks, bounding box regression is widely applied as a crucial step in localization and detection. When the aspect ratios of the predicted box correspond to those of the annotated box but their actual dimensions differ, the sizes of the recognized box and the actual box do not align. In such cases, traditional loss functions cannot be optimized effectively. Therefore, through reasonable design of the loss function, the precision of the model’s bounding box can be enhanced, thus enhancing the detection precision. The default loss function is the CIoU [38], which uses a monotone focusing mechanism that fails to strike a balance between the identification of simple samples and the identification of difficult samples. In addition, the precision of boundary detection boxes for strawberry cultivation is not high. The CIoU calculation is shown in (9).

CIoU = IoU - \frac{ρ^{2} (B_{g t}, B_{p r d})}{C^{2}} - β V

(9)

where ρ²(B_gt,B_prd) represents the Euclidean distance from the predicted bounding box center point to the background bounding box center point; C² represents the diagonal length of the minimum circumscribing rectangle; β is an equilibrium parameter; and V represents the aspect ratio consistency from the predicted box to the object box. This study introduces the minimum distance-based boundary regression loss function (MPDIoU) to improve the precision of frame prediction [33]. The calculation can be simplified as shown in Equations (10)–(12).

d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2}

(10)

d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2}

(11)

MPDIoU = IoU - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(12)

where d₁ indicates the distance from the top-left corner of the prediction box to the actual box; d₂ indicates the distance from the bottom-right corner of the prediction box to the actual box; W represents the width of the box; and h represents the height of the box. The experimental results indicate that replacing the CIoU loss function with the MPDIoU can improve the precision of bounding boxes as well as the precision of multistage object detection for strawberries.

4. Experiments and Results

4.1. Experimental Environment and Parameter Configuration

Our experiment was carried out via Windows 11. The GPU used was an RTX 4090D (24 GB), the CPU used was an 18 vCPU AMD EPYC 9754 128-Core Processor (AMD, Santa Clara, CA, USA), the memory used was 60 GB, our programming language used was Python 3.9.17, and CUDA 11.8 was employed to speed up the GPU. The PyTorch 2.0 framework was used for training. Table 2 outlines the specific training parameters. A batch size of 4 was selected because of GPU memory constraints, as larger batches (≥8) caused out-of-memory errors in preliminary tests. Cross-validation was not performed given the dataset size (13,668 samples).

4.2. Evaluation Indicators

Our experiment uses the confusion matrix, precision (P), recall (R), and mean average precision (mAP) to evaluate the models. Precision gauges the proportion of correctly classified positive samples among all samples deemed positive. Recall computes the ratio of correctly classified positive samples to the overall quantity of true positive samples. The AP evaluates the precision measure for a single class. mAP computes the average AP value among all classes, presenting a thorough evaluation of the object detection model’s efficacy. The specific calculation is presented in Equations (13)–(16):

Precision = \frac{T_{p}}{T_{p} + F_{p}}

(13)

Recall = \frac{T_{p}}{T_{p} + F_{N}}

(14)

AP = \int_{0}^{1} Precision (Recall) dRecall

(15)

mAP = \frac{\sum_{i = 1}^{C} AP (i)}{C}

(16)

where T_p is the overall quantity of positive samples predicted by the mode, F_p is the overall count of false-positive samples, F_N is the aggregate number of false-negative samples, mAP50 stands for the average precision with an IoU threshold of 0.5, and mAP50-95 indicates the average mAP within IoU thresholds spanning from 0.5–0.95. C represents the total number of classes.

4.3. Experimental Results

4.3.1. Training Results of the ASHM-YOLOv9 Model

The confusion matrix for the ASHM-YOLOv9 model training is presented in Figure 9A. The confusion matrix contrasts the predicted outcomes of the model with the actual outcomes to evaluate the model’s performance. The diagonal line represents the classification precision corresponding to each category. The classification accuracies of the strawberry flower stage, growth stage, and mature stage were 0.97, 0.97, and 0.99, respectively, indicating that the model can effectively identify the three strawberry stages. Figure 9B shows the variation in each evaluation indicator versus the number of trainings. The figure shows that the training of the model tends to be stable after the number of iterations reaches 250. Among them, precision, recall, and mAP50 tend to be stable after 100 iterations and are all above 97%. The mAP50-95 tends to be stable after 250 iterations and is above 90%. The final precision, recall, mAP50 and mAP50-95 of the ASHM-YOLOv9 model were 97.7%, 97.2%, 99.1% and 90.7% (95% confidence intervals from repeated experiments: P = 97.7 ± 0.3%, R = 97.2 ± 0.4%, mAP50 = 99.1 ± 0.2%, mAP50-95 = 90.7 ± 0.6%), respectively.

4.3.2. Ablation Experiment

To explore the efficacy of each module within the improved algorithm on the basis of YOLOv9c, the P, R, mAP50, mAP50-95, model size, parameters, and floating-point calculations were evaluated on the basis of the original YOLOv9c. We performed ablation experiments on various modules by combining augmentations and reductions. The outcomes are shown in Table 3.

Based on the ablation experiment results. In Table 3, an analysis of the specific functions of the module and the experimental outcomes suggests that when deformable AKConv convolution kernels are introduced, they can adapt to different image features for the selection and application of convolution kernels. This avoids the conventional square convolution, reduces the computational load in specific situations, and consequently decreases the overall model parameter count, model size, and computational effort while keeping the model’s precision and mAP50 almost unchanged. Specifically, this is manifested as a 0.2% increase in recall, a 0.5% increase in mAP50-95, a 14.0% decrease in parameters, a 13.9% decrease in model size, and a 9.9% decrease in floating-point computations. On the basis of AKConv, the incorporation of the SE into the model enhances the interdependencies between convolutional channels, allowing our network to recalibrate features. This enables the selective emphasis of effective feature information while suppressing ineffective features by leveraging the global information provided. With the incorporation of SE, the model achieves a 0.2% increase in precision, a 1.4% increase in recall, a 0.2% increase in mAP50, and a 2.3% increase in mAP50-95. Moreover, the model’s parameter quantity, size, and computational burden remain almost the same as those of the original model. By introducing the above two modules, we successfully reduced the model size and computational load while achieving an overall improvement in accuracy, although the effects were not significant. Therefore, we replaced the Adown downsampling algorithm in the model with the new HWD Adown algorithm during the training module. This module is founded on the Haar wavelet transform. It enables the reduction of the spatial resolution of feature maps and preserves image information by substituting specific convolutional layers. With this modification, we successfully increased mAP50-95 by 7.2% compared with the original model, accompanied by 0.6% growth in mAP50, a 2.7% increase in recall, and a 0.5% increase in precision, despite a slight increase in computational load. Finally, the loss function was replaced by MPDIoU. The P, R, mAP50, and mAP50-95 increased by 0.1%, 0.3%, 0.1%, and 0.2%, respectively, which effectively increased the performance of the original model.

In summary, compared with the original YOLOv9c model, the ASHM-YOLOv9 model shows a 0.6% increase in precision, a 3.0% increase in recall, a 0.7% increase in mAP50, and a 7.4% increase in mAP50-95. Moreover, it attains a 3.7% cutoff in parameters, a 5.6% reduction in model size, and a 3.8% reduction in floating-point calculations. These outcomes illustrate that the proposed algorithm remarkably improves the precision and efficiency of multistage strawberry identification, highlighting its practical value.

4.4. Contrasting Model Performance Prior to and Following Enhancement

Table 3 provides a contrast of the comprehensive recognition functions of the benchmark models YOLOv9c and ASHM-YOLOv9. A comparison of precision data is presented: the overall precision, recall, mAP50, and mAP50-95 for strawberry recognition, as well as key parameters such as the quantity of model parameters, model size, and computational load. The experiments show that our ASHM-YOLOv9 achieves an average precision of 97.7% for multistage strawberry recognition, showing a 0.6% enhancement over the benchmark model YOLOv9c. The average Recall is 97.2%, an increase of 2.7%; the mAP50 is 99.1%, an improvement of 0.6%; and the mAP50-95 is 90.7%, an increase of 7.2%. The parameters are 58.29 MB, which is a reduction of 2.21 MB compared with before; the model size is 115.19 MB, a decrease of 7.21 MB; and the FLOPs are 253.3 G, which is a reduction of 10.6 G compared with before.

In addition, Table 4, Table 5, Table 6 and Table 7 show a comparison of precision data between the benchmark models YOLOv9c and ASHM-YOLOv9 across three stages: flowering, growing, and ripening. In contrast to the original model, our model shows improvements in precision during the flowering stage as follows: an increase of 0.6% in precision, 2.8% in recall, 0.9% in mAP50, and 8.1% in mAP50-95. In the growing stage, the improvements are as follows: 1.2% in precision, 3.6% in recall, 1.0% in mAP50, and 7.2% in mAP50-95. In the ripening stage, the improvements are 0.1% in precision, 2.5% in recall, no change in mAP50, and a 6.9% increase in mAP50-95. From the data comparison, it is evident that our precision metrics, with the exception of mAP50 in the ripening stage, which remains consistent with the original model, outperform the original model in all other aspects, particularly showing the most significant improvement during the growing stage. This comparison indicates that our ASHM-YOLOv9 not only leads the original model in terms of precision across all stages in the multistage recognition task for strawberries but also outperforms it in terms of the quantity of parameters and resource utilization effectiveness.

Evaluation of the Detection Performance of ASHM-YOLOv9

To portray the performance of the base models YOLOv9c and ASHM-YOLOv9 in strawberry recognition, we conducted experiments using the same complex scenes, and the results are presented in Figure 10. The red dashed boxes signify areas where the base model YOLOv9c made incorrect identifications, whereas the improved ASHM-YOLOv9 model recognized them correctly. The yellow arrows represent samples missed during recognition, and the red arrows indicate samples that were incorrectly identified. The filled rectangular boxes represent recognized labels and confidence levels, whereas the unfilled dashed rectangular boxes represent the areas of recognized objects. The varying shades of color represent strawberries at three different stages. In summary, based on the recognition findings, our improved ASHM-YOLOv9 model has more advantages in identifying strawberries at different stages. Critical error analysis reveals primary misclassification patterns, including stage-transition confusion where near-mature fruits with 70–80% coloration are misclassified as the growth stage (red arrows in Figure 10), heavy occlusion cases where fruits are more than 90% obscured (yellow arrows), and low-contrast lighting conditions that wash out color features (dashed boxes). This shows that our enhanced model is better suited to meet the needs of multistage strawberry recognition in agriculture. Figure 11 shows the F1 confidence curves of YOLOv9 (Figure 11a) and ASHM-YOLOv9 (Figure 11b). The horizontal axis represents the confidence level, and the vertical axis represents the F1 score, which integrates precision and recall. By comparison, the overall F1 score of ASHM-YOLOv9 reaches 0.98 and can be achieved at a confidence threshold of 0.265; the overall F1 score of YOLOv9 is 0.97, requiring a confidence threshold of 0.566. This finding indicates that ASHM-YOLOv9 has a better balance between confidence and performance. The curves of various classes (flower, growth, mature) also show that its single-class performance may be improved, and the overall performance is better than that of YOLOv9.

4.5. Comparison with Leading Mainstream Contemporary Models

To confirm the effectiveness of the proposed algorithm over other popular YOLO algorithms, we used the precision (P), recall (R), mAP50, and mAP50-95 as evaluation indicators. The ASHM-YOLOv9 model was subjected to comparative experiments with YOLOv5s, YOLOv7, YOLOv8n, and YOLOv9c in the same environment. The outcomes are displayed in Table 4, Table 5, Table 6 and Table 7.

As presented in Table 4, in the identification experiment of the strawberry flower stage, the P, R, mAP50, and mAP50-95 of the ASHM-YOLOv9 model were 97.9%, 96.0%, 98.9% and 88.1%, respectively. The P, mAP50, and mAP50-95 of all the models were greater than those of the other models, but YOLOv5s outperformed the recall, reaching 97.6%. According to Table 5, in the identification experiment of the strawberry growth stage, the P, R, mAP50, and mAP50-95 of the ASHM-YOLOv9 model were 96.5%, 97.1%, 99.1%, and 91.6%, respectively, which were better than those of the other algorithms. According to Table 6, in the identification experiment of the strawberry mature stage, the P, R, mAP50, and mAP50-95 of the ASHM-YOLOv9 model were 98.8%, 98.4%, 99.2%, and 92.4%, respectively. Its P, R, and mAP50-95 values all surpass those of the other models. The recall of the YOLOv5s model is 0.2% greater than that of our model, reaching 99.4%. Table 7 shows the data comparison for the whole stage. The ASHM-YOLOv9 model is optimal for the remaining evaluation metrics, except for the recall, which is 0.1% lower than that of YOLOv5s. The P, R, mAP50, and mAP50-95 of the ASHM-YOLOv9 model were 97.7%, 97.2%, 99.1%, and 90.7%, respectively.

On the basis of the above data, the ASHM-YOLOv9 model performed the best in terms of all the indicators, surpassing the other models with respect to P and mAP50. There was a small gap between the performance with respect to the recall and the best YOLOv5s, and its performance with respect to mAP @0.5-95 far exceeds that of the other models. This discovery implies that the enhanced ASHM-YOLOv9 model can yield better results in multistage strawberry identification.

4.6. Algorithm Validation

According to the outcomes of the comparative analysis and ablation experiments, the YOLOv5s, YOLOv8n, YOLOv9c, and ASHM-YOLOv9 models proposed in this study were selected for visualization comparison. Images under three backgrounds—normal background, image defocus, and fruit occlusion—were randomly selected for visualization. The detection box was labeled with its prediction category as well as its confidence level, and its detection results are shown in Figure 12. Group A is the comparison experiment under the normal environment, Group B is the comparison experiment under image defocus, and Group C is the comparison experiment under fruit occlusion.

The experimental outcomes demonstrate that the enhanced ASHM-YOLOv9 model achieves better identification outcomes. In each stage of strawberry growth, the ASHM-YOLOv9 model has a prominent function in the multistage identification of strawberry plants.

5. Discussion

5.1. Model Advantages

In this work, the YOLOv9 model was chosen as the base model for experiments addressing the multistage object detection problem of strawberry cultivation in greenhouses. Since the introduction of YOLO in 2015, it has evolved through multiple versions, becoming more mature and addressing its limitations. This progression has endowed the YOLO series of algorithms with numerous advantages in object detection. With several version updates, the balance between recognition speed and accuracy has significantly improved, and its ability to monitor small targets has become more sophisticated. Additionally, its strong generalization ability allows it to be applicable in various fields [39]. YOLOv9 is a recently proposed object detection method that addresses the problem of information loss occurring layer by layer during information transmission. It has demonstrated outstanding performance in object detection tasks on the MS COCO dataset, surpassing all previous real-time object detection methods. Additionally, its strong transferability allows it to be applicable in various research fields [23]. For example, Gui Haitian and others proposed an FS-YOLOv9 model based on the YOLOv9 model for detecting and identifying breast cancer. This model has shown superior performance on both the original dataset and the breast cancer dataset, demonstrating significant improvements over previous models. This also provides practical assistance for the diagnosis of high-risk breast cancer patients [40]. Lu and others proposed a MAR-YOLOv9 algorithm for object detection of crop types in the agricultural field and in complex and variable environments. Experiments have shown that it can recognize fruits against complex backgrounds, with accuracy superior to that of other models [41]. An R [42] utilized YOLOv9 as an improved foundation in the field of smart urban traffic and proposed the GC-YOLOv9 model. This model significantly enhances the perception capability and detection performance of the original model, surpassing existing technologies. It also holds potential value in the management of fire safety in public areas, the monitoring of forest fires, and intelligent security systems. In summary, YOLOv9, with its fast speed, high accuracy, and strong transferability, has become our base model.

Next, we improved the base YOLOv9 model by making modifications from four different perspectives. On the basis of the unique shape characteristics of strawberries, we changed the traditional convolution kernel shape, moving away from the conventional square kernels to use convolution kernels of arbitrary sizes to better fit the shape of strawberries. We successfully reduced the sizes of various model parameters via AKConv, resulting in a slight improvement in accuracy. Liu [43] utilized YOLOv8 to detect and classify Stropharia Rugoso-Annulata and verified in their ablation experiments that the incorporation of the AKConv module substantially reduced the sizes of diverse parameters, which is consistent with our research findings. We integrate the SE module into the original model, which can learn global information to enhance features that are beneficial for recognition while suppressing features that are not helpful for recognition. Tian [44] used YOLOv5 as a base model for recognizing aerial remote sensing images and subsequently added the SE module. Their ablation experiments demonstrated that the introduction of this module could improve the recognition accuracy and other parameters. Correspondingly, we also successfully improved the model’s accuracy. The Haar wavelet transform is a signal handling approach that enhances recognition accuracy and precision by decomposing image information into low-frequency and high-frequency components and then combining them into new feature maps. K. Kumara [45] used a combination of YOLO and Haar to automatically identify license plates of high-speed vehicles, and the results indicated that the combination of the two methods significantly outperformed either method used separately. Our experiments also confirmed this, as the addition of wavelet downsampling caused minor growth in the number of parameters, yet resulted in a remarkable increase in accuracy. Peng [46] conducted research on the identification of tea tree buds via polarization and YOLOv5, and after the loss function was replaced with MPDIoU, a noticeable increase in recognition accuracy was observed. Our model also achieved improved accuracy after the original loss function was replaced.

Overall, this study not only demonstrated that the YOLOv9 model performs well in agriculture but also validated our modifications to the original model through ablation experiments, comparisons before and after model changes, and comparative experiments with other object detection algorithms. Our ASHM-YOLOv9 model has unique advantages in the multistage recognition of strawberries. Furthermore, this achievement provides a new approach and technical means for the identification and monitoring of crops in agriculture.

5.2. Limitations and Future Work

Although this study has achieved remarkable outcomes in the multistage detection and recognition of strawberries, there are still areas for improvement and enhancement that need to be addressed. First, in terms of the model’s accuracy, it is not yet at the cutting-edge level that can be achieved. With ongoing updates to the YOLO series and the introduction of more advanced functional modules, the model’s accuracy can be further optimized and improved. Second, the operational load of the YOLOv9 model is not particularly low, raising questions about whether real-time monitoring results can be obtained during agricultural research in the future. Finding a reasonable way to reduce the model’s operational load, especially in cases with a particularly large amount of data, is also an issue worth addressing. Our dataset has two potential biases: (1) collection was performed exclusively in January under consistent noon lighting, which may not represent seasonal variations in natural sunlight conditions; (2) all the samples came from homogeneous greenhouse environments with controlled humidity, potentially limiting their applicability to open-field agriculture with variable weather patterns. These factors could affect color-based maturity assessment accuracy. Deploying models prone to misclassification in field environments risks severe ecological and economic consequences. For instance, erroneous identification of strawberry ripeness can trigger premature harvesting, compromising fruit quality and shortening shelf life. Similarly, initiating irrigation during the flowering stage (misclassified as the growth stage) results in water resource wastage—a critical concern for greenhouses in arid regions. Therefore, addressing this issue is particularly important.

While the ASHM-YOLOv9 architecture shows promise for multistage crop detection, direct application to other species faces challenges: (1) crops with less distinct color transitions during ripening (e.g., cucumbers) may require additional spectral data; (2) plants with complex canopy structures (e.g., tomatoes) could increase occlusion-related errors; and (3) scaling to larger fields would necessitate drone-based image acquisition and corresponding algorithm adaptations. Future work should validate transfer learning approaches using minimal target crop samples [47]. For practical agricultural deployment, further validation across diverse strawberry cultivars and planting densities is essential. Variations in fruit morphology (e.g., size, color distribution) among cultivars and high-density planting systems may impact detection accuracy, requiring cultivar-specific model fine-tuning before large-scale implementation.

6. Conclusions

In this research, we present ASHM-YOLOv9, a multistage object detection model for strawberry plants. On the basis of YOLOv9, we introduce AKConv, which improves the computing speed and reduces the size and number of parameters. The addition of the SE network enhances the model’s feature selection and fusion capabilities. Moreover, the original model’s ADown downsampling algorithm is replaced with the improved HWD-ADown algorithm, which enhances identification precision and recall. The MPDIoU loss function addresses the issue where the predicted and annotated boxes have matching aspect ratios but differing dimensions, enabling better object border identification and further improving detection precision. Based on the above characteristics, the ASHM-YOLOv9 model, which has less computational complexity, fewer parameters and models, and lower detection precision than the original YOLOv9 model, was developed. By analyzing the experimental results, the following conclusions are drawn:

(1) In the self-built greenhouse strawberry dataset, under the same experimental conditions, the improved ASHM-YOLOv9 model increased the precision by 0.6% and the recall by 3.0% compared with those of the original model YOLOv9c. When the mAP50 increased by 0.7%, the mAP50-95 increased by 7.4%, the number of parameters decreased by 3.7%, the model size decreased by 5.6%, and the floating-point calculations decreased by 3.8%, effectively improving the performance of the original model. Compared with the YOLOv5s, YOLOv7, and YOLOv8n models, the precision increases by 0.4%, 1.0% and 0.5%, respectively; the recall increases by -0.1%, 2.2%, 2.1%, and 3.0%, respectively; the mAP50 increases by 0.3% and 2.6%, 0.7%, 0.7%, respectively; and the mAP50-95 increases by 7.9%, 8.1%, 5.0%, and 7.4%, respectively.

(2) To verify the detection performance of the ASHM-YOLOv9 model, three sets of comparison experiments were carried out for visualization analysis. The results demonstrate that our model attains the highest identification precision in both the normal environment and the complex environment; fruit boundary identification also has a good effect.

This study focused on conventional strawberry varieties commonly available on commercial markets. To develop a more robust and generalizable detection system, future research will systematically expand the dataset to include diverse strawberry cultivars (white varieties, day-neutral types, alpine strawberries) and extend the ASHM-YOLOv9 framework to other greenhouse crops such as tomatoes, peppers, and small fruit crops through transfer learning approaches. Additionally, multi-environment validation across different greenhouse facilities with varying lighting and growing conditions will be conducted to ensure model robustness and practical applicability, ultimately establishing a comprehensive crop detection platform for diverse agricultural production systems. Beyond the technical achievements, this research contributes to addressing broader societal challenges in agricultural sustainability and food security. The automated phenotyping system promotes precision agriculture practices that optimize resource utilization while reducing environmental impact through targeted interventions based on accurate growth stage identification. However, the implementation of such technologies must consider ethical implications, including data privacy, equitable access for small-scale farmers, and potential employment transitions in agricultural sectors. Future deployment should prioritize inclusive development approaches that ensure technology benefits reach diverse farming communities while preserving traditional agricultural knowledge and supporting workforce adaptation to emerging precision agriculture paradigms.

Author Contributions

Conceptualization, writing-original draft preparation, Y.M.; validation, S.B.; software, W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62261038.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and data are available: https://github.com/Jiawen-Zheng/YOLOv9/tree/master (accessed on 19 July 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

De Lima, J.M.; Welter, P.D.; Dos Santos, M.F.S.; Kavcic, W.; Costa, B.M.; Fagherazzi, A.F.; Nerbass, F.R.; Kretzschmar, A.A.; Rufato, L.; Baruzzi, G. Planting density interferes with strawberry production efficiency in Southern Brazil. Agronomy 2021, 11, 408. [Google Scholar] [CrossRef]
Zhu, J.X. Discussion on different growth and development stages Fertilization pointsof strawberry. Tieling Acad. Agric. Sci. 2013, 7, 35–46. [Google Scholar]
Zhang, Y.; Zhang, K.; Yang, L.; Zhang, D.; Cui, T.; Yu, Y.; Hui, L. Design and simulation experiment of ridge planting strawberry picking manipulator. Comput. Electron. Agric. 2023, 208, 10769. [Google Scholar] [CrossRef]
Cao, L.L.; Chen, Y.R.; Jin, Q.G. Lightweight strawberry instance segmentation on low-power devices for picking robots. Electronics 2023, 12, 3145. [Google Scholar] [CrossRef]
Kim, S.J.; Jeong, S.; Kim, H.; Jeong, S.; Yun, G.Y.; Park, K. Detecting ripeness of strawberry and coordinates of strawberry stalk using deep learning. In Proceedings of the 2022 Thirteenth International Conference on Ubiquitous and Future Networks (ICUFN), Barcelona, Spain, 5–8 July 2022; pp. 454–458. [Google Scholar]
Chai, J.J.K.; Xu, J.L.; O’Sullivan, C. Real-Time detection of strawberry ripeness using augmented reality and deep learning. Sensors 2023, 23, 7639. [Google Scholar] [CrossRef] [PubMed]
Hamuda, E.; Glavin, M.; Jones, E. A survey of image processing techniques for plant extraction and segmentation in the field. Comput. Electron. Agric. 2016, 125, 184–199. [Google Scholar] [CrossRef]
Attri, I.; Awasthi, L.K.; Sharma, T.P.; Rathee, P. A review of deep learning techniques used in agriculture. Ecol. Inform. 2023, 77, 102217. [Google Scholar] [CrossRef]
Sanaeifar, A.; Guindo, M.L.; Bakhshipour, A.; Fazayeli, H.; Li, X.; Yang, C. Advancing precision agriculture: The potential of deep learning for cereal plant head detection. Comput. Electron. Agric. 2023, 209, 107875. [Google Scholar] [CrossRef]
Dhillon, A.; Verma, G.K. Convolutional neural network: A review of models, methodologies and applications to object detection. Prog. Artif. Intell. 2019, 9, 85–112. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Bi, X.L.; Hu, J.W.; Xiao, B.; Li, W.S.; Gao, X.B. IEMask R-CNN: Information-Enhanced Mask R-CNN. IEEE Trans. Big Data 2023, 9, 688–700. [Google Scholar] [CrossRef]
Li, J.J.; Zhu, Z.F.; Liu, H.X.; Su, Y.R.; Deng, L.M. Strawberry R-CNN: Recognition and counting model of strawberry based on improved faster R-CNN. Ecol. Inform. 2023, 77, 102210. [Google Scholar] [CrossRef]
Tang, C.; Chen, D.; Wang, X.; Ni, X.D.; Liu, Y.H.; Liu, Y.H.; Mao, X.; Wang, S.M. A fine recognition method of strawberry ripeness combining Mask R-CNN and region segmentation. Front. Plant Sci. 2023, 14, 1211830. [Google Scholar] [CrossRef] [PubMed]
Zhai, S.P.; Shang, D.R.; Wang, S.H.; Dong, S.S. DF-SSD: An improved SSD object detection algorithm based on Densenet and feature fusion. IEEE Access 2020, 8, 24344–24357. [Google Scholar] [CrossRef]
Jiang, P.Y.; Ergu, D.; Liu, F.Y.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Inf. Technol. Quant. Manag. (ITQM) 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Ridho, M.F.; Irwan. Strawberry fruit quality assessment for harvesting robot using SSD convolutional neural network. In Proceedings of the 2021 8th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), Semarang, Indonesia, 20–21 October 2021; pp. 157–162. [Google Scholar]
Li, Y.D.; Xue, J.X.; Zhang, M.Y.; Yin, J.Y.; Liu, Y.; Qiao, X.D.; Zheng, D.C.; Li, Z.Z. YOLOv5-ASFF: A multistage strawberry detection algorithm based on improved YOLOv5. Agronomy 2023, 13, 1901. [Google Scholar] [CrossRef]
Li, Y.; Ma, R.; Zhang, R.T.; Cheng, Y.F.; Dong, C.W. A tea buds counting method based on YOLOv5 and Kalman filter tracking algorithm. Plant Phenomics 2023, 5, 30. [Google Scholar] [CrossRef]
Xu, J.; Yao, J.; Zhai, H.; Li, Q.M.; Xu, Q.; Xiang, Y.; Liu, Y.X.; Liu, T.H.; Ma, H.L.; Mao, Y.; et al. TrichomeYOLO: A neural network for automatic maize trichome counting. Plant Phenomics 2024, 5, 24. [Google Scholar] [CrossRef]
Yang, S.Z.; Wang, W.; Gao, S.; Deng, Z.P. Strawberry ripeness detection based on YOLOv8 algorithm fused with LW-Swin Transformer. Comput. Electron. Agric. 2023, 215, 108360. [Google Scholar] [CrossRef]
Wei, W.L.; Lin, B.Y.; Cao, J. Observation on phenological period and growth habit of strawberry variety. Fujian J. Agric. Sci. 1987, 2, 88–91. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arxiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the 2019 CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1571–1580. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing network design strategies through gradient path analysis. J. Inf. Sci. Eng. 2023, 39, 975–995. [Google Scholar]
Chen, Y.M.; Yuan, X.B.; Wu, R.Q.; Wang, J.B.; Hou, Q.B.; Cheng, M.M. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. arxiv 2023, arXiv:2308.05480. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Chen, W.Z.; Hao, L.Y.; Liu, H.Y.; Zhang, Y.Z. Real-time ship detection algorithm based on improved YOLOv8 network. In Proceedings of the 2023 IEEE 2nd Industrial Electronics Society Annual On-Line Conference (ONCON), virtually, 8–10 December 2023; p. 10430529. [Google Scholar]
Zhu, X.K.; Lyu, S.C.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Zhang, X.; Song, Y.Z.; Song, T.T.; Yang, D.G.; Ye, Y.C.; Zhou, J.; Zhang, L.M. AKConv: Convolutional kernel with arbitrary sampled shapes and arbitrary number of parameters. arxiv 2023, arXiv:2311.11587. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E.H. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Xu, G.P.; Liao, W.T.; Zhang, X.; Li, C.; He, X.W.; Wu, X.L. Haar wavelet downsampling: A simple but effective downsampling module for semantic. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Siliang, M.; Yong, X. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arxiv 2023, arXiv:2307.07662. [Google Scholar]
Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, L. Microsoft COCO: Common Objects in Context. arxiv 2014, arXiv:1405.0312. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 2010 International Conference on Machine Learning, Qingdao, China, 11–14 July 2010. [Google Scholar]
Elfwing, A.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.H.; Wang, P.; Liu, W.; Li, J.Z.; Ye, R.G.; Ren, D.W. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Gui, H.; Su, T.; Jiang, X.H.; Li, L.; Xiong, L.; Zhou, J.; Pang, Z.Y. FS-YOLOv9: A Frequency and Spatial Feature-Based YOLOv9 for Real-time Breast Cancer Detection. Acad. Radiol. 2024, 32, 1228–1240. [Google Scholar] [CrossRef] [PubMed]
Lu, D.L.; Wang, Y.X. MAR-YOLOv9: A multi-dataset object detection method for agricultural fields based on YOLOv9. PLoS ONE 2024, 19, e0307643. [Google Scholar] [CrossRef] [PubMed]
An, R.; Zhang, X.C.; Sun, M.P.; Wang, G. GC-YOLOv9: Innovative smart city traffic monitoring solution. Alex. Eng. J. 2024, 106, 277–287. [Google Scholar] [CrossRef]
Liu, R.M.; Su, W.H. APHS-YOLO: A Lightweight Model for Real-Time Detection and Classification of Stropharia Rugoso-Annulata. Foods 2024, 13, 1710. [Google Scholar] [CrossRef] [PubMed]
Tian, Z.W.; Huang, J.; Yang, Y.; Nie, W.Y. KCFS-YOLOv5: A High-Precision Detection Method for Object Detection in Aerial Remote Sensing Images. Appl. Sci. 2023, 13, 649. [Google Scholar] [CrossRef]
Kumaran, K.; Tharani, R.; Saranya, G.; Prashanthi, V.; Kumar, G.M.; Sagar, K.M. Enhanced Automatic Number Plate Recognition for High-Speed Vehicles: Leveraging YOLO and Haar Cascade. In Proceedings of the 2024 4th International Conference on Pervasive Computing and Social Networking (ICPCSN), Salem, India, 3–4 May 2024; p. 00145. [Google Scholar]
Peng, J.Y.; Zhang, Y.N.; Xian, J.Y.; Wang, X.C.; Shi, Y.Y. YOLO Recognition Method for Tea Shoots Based on Polariser Filtering and LFAnet. Agronomy 2024, 14, 1800. [Google Scholar] [CrossRef]
Cui, S.C.; Chen, W.; Xiong, W.W.; Xu, X.; Shi, X.Y.; Li, C.H. SiMultiF: A Remote Sensing Multimodal Semantic Segmentation Network With Adaptive Allocation of Modal Weights for Siamese Structures in Multiscene. IEEE T Geosci Remote Sens. 2025, 63, 4406817. [Google Scholar] [CrossRef]

Figure 1. (a) Map of China. (b) Location of our study area.

Figure 2. Sample data of some strawberry images.

Figure 3. The density map demonstrating the distribution of size and position of the strawberry dataset. A darker red hue represents more data in the corresponding area.

Figure 4. Research workflow for detecting strawberry plants at different growth stages via the ASHM-YOLOv9 model.

Figure 5. ASHM-YOLOv9 network structure.

Figure 6. Alterable Kernel Convolution Structure.

Figure 7. Structure of squeeze-and-excitation networks.

Figure 8. Haar wavelet downsampling structure.

Figure 9. ASHM-YOLOv9 model training results. (A) is the confusion matrix result of the model, (B) is the evaluation metric result.

Figure 10. Recognition results of the YOLOv9c and ASHM-YOLOv9 base models in complex environments.

Figure 11. (a) F1 score of YOLOv9; (b) F1 score of ASHM-YOLOv9.

Figure 12. Verification results (A) is the normal background; (B) is the image defocus; and (C) is the fruit occlusion.

Table 1. Data Type Presentation.

Stage	Label	Description
R4	Flower	The main characteristics of the strawberry flowering stage are the white petals and the yellow stamens.
R5	Growth	The color of the strawberry fruit was generally green. The basis for the judgment was that the colored area of the fruit in the image was less than 80% of the overall fruit-growing area.
R6	Mature	The colored area in the image was greater than or equal to 80% of the overall fruit-growing area.

Table 2. Training parameter settings.

Parameters	Value	Parameters	Value
epochs	300	workers	4
batch-size	4	Patience	100
imgsz	640	close-masaic	0
evolve	300	optimizer	SGD

Table 3. Experimental results of the ablation experiments.

Improvement Module				P/%	R/%	mAP 50/%	mAP 50–95/%	Parameters/mb	Model Size/mb	FLOPs/G
AKConv	SE	HWD ADown	MPDIoU	P/%	R/%	mAP 50/%	mAP 50–95/%	Parameters/mb	Model Size/mb	FLOPs/G
×	×	×	×	97.1	94.2	98.4	83.3	60.50	122.4	263.9
√	×	×	×	97.0	94.4	98.4	83.8	52.06	105.4	237.7
√	√	×	×	97.2	95.6	98.6	85.6	52.09	105.6	237.8
√	√	√	×	97.6	96.9	99.0	90.5	58.29	115.19	253.3
√	√	√	√	97.7	97.2	99.1	90.7	58.29	115.19	253.3

Table 4. Experimental results for each model at the flower stage.

Flower Stage/%
Models	Precision	Recall	mAP50	mAP50-95
YOLOv5s	97.5	97.6	98.1	79.9
YOLOv7	96.9	95.9	97.0	80.6
YOLOv8n	97.2	94.0	98.4	85.7
YOLOv9c	97.3	93.2	98.0	80.0
ASHM-YOLOv9	97.9	96.0	98.9	88.1

Table 5. Experimental results for each model at the growth stage.

Growth Stage/%
Models	Precision	Recall	mAP50	mAP50-95
YOLOv5s	95.9	96.2	98.8	83.7
YOLOv7	95.0	92.4	97.2	82.6
YOLOv8n	95.7	94.1	97.6	86.3
YOLOv9c	95.3	93.5	98.1	84.4
ASHM-YOLOv9	96.5	97.1	99.1	91.6

Table 6. Experimental results for each model at the mature stage.

Mature Stage/%
Models	Precision	Recall	mAP50	mAP50-95
YOLOv5s	98.5	98.0	99.4	84.9
YOLOv7	98.3	96.6	95.4	84.5
YOLOv8n	98.6	97.0	99.1	88.5
YOLOv9c	98.7	95.9	99.2	85.5
ASHM-YOLOv9	98.8	98.4	99.2	92.4

Table 7. Experimental results for each model at the whole stage.

Whole Stage/%
Models	Precision	Recall	mAP50	mAP50-95
YOLOv5s	97.3	97.3	98.8	82.8
YOLOv7	96.7	95.0	96.5	82.6
YOLOv8n	97.2	95.1	98.4	85.7
YOLOv9c	97.1	94.2	98.4	83.3
ASHM-YOLOv9	97.7	97.2	99.1	90.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mo, Y.; Bai, S.; Chen, W. ASHM-YOLOv9: A Detection Model for Strawberry in Greenhouses at Multiple Stages. Appl. Sci. 2025, 15, 8244. https://doi.org/10.3390/app15158244

AMA Style

Mo Y, Bai S, Chen W. ASHM-YOLOv9: A Detection Model for Strawberry in Greenhouses at Multiple Stages. Applied Sciences. 2025; 15(15):8244. https://doi.org/10.3390/app15158244

Chicago/Turabian Style

Mo, Yan, Shaowei Bai, and Wei Chen. 2025. "ASHM-YOLOv9: A Detection Model for Strawberry in Greenhouses at Multiple Stages" Applied Sciences 15, no. 15: 8244. https://doi.org/10.3390/app15158244

APA Style

Mo, Y., Bai, S., & Chen, W. (2025). ASHM-YOLOv9: A Detection Model for Strawberry in Greenhouses at Multiple Stages. Applied Sciences, 15(15), 8244. https://doi.org/10.3390/app15158244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ASHM-YOLOv9: A Detection Model for Strawberry in Greenhouses at Multiple Stages

Abstract

Featured Application

Abstract

1. Introduction

2. Research Area and Dataset

2.1. Research Area and Data Collection

2.2. Dataset Preprocessing

2.3. Data Augmentation

2.4. Construction and Analysis of the Strawberry Dataset

3. Methods

3.1. YOLOv9 Object Detection Model

3.2. ASHM-YOLOv9 Object Detection Model

3.2.1. Alterable Kernel Convolution

3.2.2. Squeeze-and-Excitation Networks

3.2.3. HWD—ADown Module

3.2.4. MPDIoU Loss Function

4. Experiments and Results

4.1. Experimental Environment and Parameter Configuration

4.2. Evaluation Indicators

4.3. Experimental Results

4.3.1. Training Results of the ASHM-YOLOv9 Model

4.3.2. Ablation Experiment

4.4. Contrasting Model Performance Prior to and Following Enhancement

Evaluation of the Detection Performance of ASHM-YOLOv9

4.5. Comparison with Leading Mainstream Contemporary Models

4.6. Algorithm Validation

5. Discussion

5.1. Model Advantages

5.2. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI