1. Introduction
Global population growth continues to drive an unprecedented demand for food, placing significant pressure on modern agricultural systems. According to [
1], the global volume of food consumption has steadily increased since 2015 and is projected to exceed 300 million metric tons by 2026. As illustrated in
Figure 1, this trend highlights the urgent need for scalable and efficient agricultural solutions that can meet rising consumption demands. Ensuring food security under these circumstances is paramount. It requires not only increasing productivity but also guaranteeing equitable access to sufficient, safe, and nutritious food, as outlined by [
2].
Within this context, fruits like apples are a vital component of global diets and economies. European Union farmers alone produce over 11.5 million tons annually, surpassing other fruits such as oranges and watermelons [
3]. However, traditional fruit harvesting methods remain labor-intensive, time-consuming, and prone to inefficiencies due to occlusions from foliage and inconsistent lighting. Manual localization often results in missed detections and lower productivity, especially when apples are partially hidden by leaves or branches. Moreover, harvesting efficiency is limited by the inability of human labor to operate continuously at scale.
Compounding these challenges, climate-related disasters such as extreme droughts, heatwaves, and unseasonal frosts have increasingly disrupted global crop yields, leading to food insecurity and economic loss, particularly in vulnerable agricultural systems [
4]. As climate variability intensifies, the need for resilient, technology-driven approaches in agriculture becomes even more critical. In response, precision farming tools embedded with deep learning capabilities present viable solutions for real-time monitoring and adaptive interventions to mitigate climate-induced stresses, thereby enhancing yield resilience and supporting sustainable food production [
5,
6].
To address these challenges, the integration of advanced technologies such as automation coupled with artificial intelligence, machine learning, and deep learning has become essential for boosting agricultural productivity [
7]. These innovations align with the objectives of Sustainable Development Goal 2, which promotes ending hunger and fostering sustainable agriculture through smarter resource management [
8]. In particular, deep learning offers powerful capabilities for high-precision visual recognition through hierarchical feature learning, enabling the automated detection and classification of fruit in complex orchard environments.
Among various deep learning architectures, convolutional neural networks (CNNs) have proven especially effective in agricultural applications, including fruit maturity estimation, disease detection, and object segmentation [
9]. CNNs use a series of convolutional and pooling layers to extract spatial patterns and identify features with minimal human intervention. In comparison to aerial drone platforms, ground-based RGB cameras provide higher resolution and lower-noise imagery. When mounted on a stable platform such as a tripod, these cameras can produce clearer images, making them more suitable for precise segmentation tasks like apple area mapping.
Despite their advantages, conventional CNNs often struggle to capture long-range dependencies and highlight the most relevant visual cues, particularly in cluttered scenes with occlusions or non-uniform lighting. To address these limitations, recent research has explored the use of attention mechanisms, either through channel or spatial attention variant, to guide the network toward extracting salient features of interest, while suppressing irrelevant ones. Furthermore, architectural enhancements such as multi-scale processing, residual connections, and grouped convolutions have shown success in various domains [
10,
11,
12], although their use in agricultural image segmentation remains limited.
This study introduces DABAMNet, a custom CNN architecture designed to enhance apple segmentation in orchard environments by integrating depthwise asymmetric bottleneck units with dual attention mechanisms. While DABAMNet builds on established components such as CBAM-based attention and bottleneck structures, its novelty lies in the strategic placement of attention modules within the network hierarchy, the use of a depthwise asymmetric structure for lightweight yet expressive encoding, and the unconventional combination of global max and min pooling operations within the attention block. These architectural refinements offer improved feature discrimination in agricultural image segmentation, particularly under real-world orchard complexities such as occlusion and lighting variation, as confirmed through empirical ablation studies.
To further enhance feature representation, DABAMNet integrates multiple pooling operations including global max pooling, average pooling, and min pooling. This combination allows the model to capture both dominant and subtle visual cues, improving robustness to occlusion, lighting variation, and background clutter. An extensive ablation study has been performed to confirm the effectiveness of this design. In short, this work presents a deep learning solution that addresses core limitations in fruit segmentation by leveraging architectural innovations and attention mechanisms. The proposed DABAMNet model is trained and tested on a publicly available apple orchard dataset and benchmarked against four state-of-the-art CNN baselines. Experimental results demonstrate significant improvements in segmentation accuracy and Intersection over Union (IoU), validating its potential for deployment in precision farming and real-time robotic harvesting systems.
This paper is structured into five sections.
Section 1 provides an introduction.
Section 2 reviews the approach methods applied in the agriculture sector.
Section 3 outlines the complete methodology for the architecture of the proposed DABAMNet.
Section 4 presents the results and discussion derived from using the proposed network. Finally,
Section 5 offers a concise conclusion, summarizing the limitations of the proposed network and suggesting future directions.
2. Related Works
This section provides a structured overview of recent advances in deep learning-based agricultural image analysis. It highlights the strengths and limitations of core architectural models, explores multi-scale feature integration techniques, and discusses methods designed to mitigate the problems of overfitting. These insights support the development of effective and generalizable models for precision farming applications using remote sensing and ground-level imagery.
The integration of deep learning into agricultural image analysis has catalyzed a transformative shift in how key challenges such as classification, detection, and segmentation are addressed in precision farming. Given the critical importance of agricultural productivity to global food security and economic resilience, recent studies have increasingly explored advanced machine learning techniques to enhance the automation, scalability, and precision of crop monitoring systems. These approaches have shown significant promise in addressing issues related to environmental variability, resource constraints, and the need for real-time decision-making. Deep learning methods, particularly those based on CNN, have demonstrated strong generalization capabilities across a wide variety of agricultural domains by learning hierarchical features directly from raw imagery.
2.1. Classification of Agricultural Products Using Deep Learning Methodology
In agricultural classification tasks, deep learning models are widely adopted to categorize fruits and crops based on ripeness, health condition, and disease severity. CNN-based classifiers, particularly those pre-trained on large-scale datasets such as ImageNet, have emerged as the preferred solution due to their ability to generalize across heterogeneous imaging conditions and object appearances.
Mimma et al. [
13] conducted a comprehensive study using ResNet50 and VGG16 architectures for multi-class fruit classification involving eight fruit categories. The study reported high classification accuracy, underscoring the capability of deep CNNs to extract relevant texture, shape, and color features across different fruit morphologies. The authors also noted the effectiveness of transfer learning in mitigating data scarcity issues, which are common in agricultural domains. Similarly, the work in [
14] developed a hybrid architecture combining AlexNet and a pre-trained VGG16 encoder for the classification of date fruits. Their method demonstrated improved feature extraction, especially for detecting subtle textural differences across ripeness stages.
Expanding upon these findings, Mahmood et al. [
15] applied pre-trained VGG16 and AlexNet models for jujube fruit classification into unripe, ripe, and overripe categories. VGG16 achieved higher accuracy and stability during training, leading to its deployment in a fully automated harvesting and sorting system. The integration of classification results into downstream processing pipelines reflects the operational readiness of these architectures. In another application, the authors of [
16] classified hazelnuts into five quality grades using similar CNN backbones. The model achieved superior performance over traditional rule-based approaches, demonstrating deep learning’s adaptability to post-harvest quality control tasks.
Beyond maturity and quality classification, deep learning has also been employed for plant disease detection and severity grading [
17]. The researchers have utilized a fine-tuned VGG16 model for apple disease classification, achieving precise identification of various infection types, even under heterogeneous background conditions. In [
18], Arshaghi et al. applied a VGG19-based architecture for potato disease severity classification, highlighting the importance of deeper networks in capturing fine-grained pathological patterns. Complementarily, the researchers in [
19] employed a ResNet50 model to identify cucumber leaf diseases, with notable improvements in detection sensitivity and specificity. Their study emphasized the model’s utility in field-based deployments, where disease symptoms may vary dynamically due to environmental stressors.
2.2. Fruit Detection in Precision Agriculture
Fruit detection is a cornerstone task in automated agriculture, facilitating yield estimation, robotic harvesting, and supply chain optimization. Object detection models based on the YOLO (You Only Look Once) architecture have become particularly prominent due to their real-time detection capabilities and robustness under variable environmental conditions. YOLO models detect objects by simultaneously regressing bounding box coordinates and object class probabilities, making them suitable for high-throughput agricultural tasks.
Zhou et al. [
20] applied YOLOv7 for dragon fruit detection in orchard environments characterized by dense foliage, variable lighting, and occlusion. The model delivered robust performance across multiple collection angles and demonstrated sub-second inference times, confirming its viability for real-time deployment on embedded systems. An optimized YOLOv3 by integrating it with MobileNet has also been researched in [
21], resulting in a lightweight and efficient detection framework tailored for low-power devices such as UAVs and agricultural robots.
The authors in [
22,
23] compared YOLOv3, YOLOv4, and YOLOv5 for cherry detection, highlighting YOLOv5’s superior balance between accuracy and speed. Their studies also integrated condition classification capabilities, enabling robots not only to detect but also to assess fruit ripeness. On the other hand, Reddy and Aishwarya [
24] conducted a comparative evaluation of YOLOv4 and YOLOv5 for fruit freshness classification, reporting that YOLOv5 consistently outperformed its predecessor in both precision and recall.
Using the same baseline models, the researchers in [
25,
26] implemented YOLOv4 across multiple fruit types including bananas, grapes, apples, mangoes, and pears, demonstrating the model’s flexibility and high accuracy in detecting diverse agricultural targets. Melnychenko et al. [
27] focused on occlusion-aware apple detection using YOLOv5, while papers [
28,
29] leveraged the same model for red chili ripeness analysis. Uniquely, Yang and Wang [
30] benchmarked four state-of-the-art models for green litchi detection and found YOLOv5-S to offer the best trade-off between detection accuracy and inference latency.
Further usage of deep learning methodology in palm agriculture was proposed in [
31] that utilized color-based feature extraction in tandem with deep learning model to determine palm fruit maturity stages. An extension of this work proposed by Jie et al. [
32] deployed the Xception model to delineate palm plantation boundaries, showcasing the model’s high representational power in segmenting large-scale agricultural landscapes. The aggregation of these studies confirms the YOLO framework’s dominance in agricultural object detection, driven by its modularity, high detection throughput, and adaptability to evolving field conditions.
2.3. Semantic Segmentation for Crop Monitoring and Yield Estimation
Semantic segmentation plays a pivotal role in precision agriculture by enabling fine-grained classification of vegetation, soil, and infrastructure at the pixel level. Segmentation maps produced by deep learning models are critical for applications such as plant phenotyping, weed detection, and biomass estimation. Encoder–decoder architectures, particularly those enhanced with attention mechanisms or multi-scale feature extractors, have been widely adopted to address the unique challenges posed by agricultural imagery.
The work in [
33] proposed a DeepLabv3+ model with a ResNet50 encoder for segmenting fruits across variable scales and orientations. Their model achieved high mean Intersection over Union scores and demonstrated consistent boundary delineation even in visually complex scenes. A simpler model was considered in [
34] by proposing an improved Fully Convolutional Network tailored for strawberry segmentation in cluttered environments. By integrating enhanced encoder–decoder pathways, their model effectively captured both fine-grained details and broader spatial context.
A more complex multiscale technique was introduced in [
35] by enhancing the PSPNet architecture through Convolutional Block Attention Module (CBAM) embedding for grape bunch segmentation. This attention-guided network achieved superior performance under non-uniform lighting and background variability. They extended the application of PSPNet+CBAM to apple segmentation [
10], where their model demonstrated resilience to occlusions and high structural complexity. These studies underscore the importance of attention modules in refining spatial and channel-wise feature representations.
Deb et al. tackled the challenge of overlapping leaf segmentation using LS-Net, a lightweight architecture designed for low-power edge devices [
36]. Their model excelled in segmenting rosette plant structures with minimal false positives. A hybridized approach in [
37] combined PSPNet with a ResNet50 backbone to accurately segment kiwi regions, achieving improvements in generalization and model interpretability. Conversely, in [
38], the authors implemented an integrated segmentation framework for paddy field analysis, showing that model fusion and contextual aggregation substantially improve segmentation consistency in large-area imagery. Collectively, these segmentation approaches reveal a growing reliance on integrated and hybrid networks that combine high-resolution backbones with contextual modules and attention mechanisms. These designs are not only effective in complex agricultural landscapes but also adaptable to varying resolutions and sensor modalities. As segmentation accuracy becomes increasingly vital for downstream tasks such as yield modeling and disease prediction, deep learning continues to provide a robust foundation for scalable, automated, and interpretable agricultural analysis systems.
2.4. Architectural Innovations of DABAMNet
In contrast to existing attention-augmented segmentation architectures such as CBAM-Net [
39,
40] and BAM-based U-Nets [
41,
42], DABAMNet introduces a series of architectural innovations specifically designed for orchard segmentation tasks in complex agricultural environments.
The first core innovation is the integration of Dilated Asymmetric Bottleneck Units (DABou), which enable effective multi-scale feature extraction while maintaining computational efficiency. By employing asymmetric convolutions alongside dilated branches, DABou units expand the receptive field without degrading spatial resolution, an essential capability when dealing with high structural variability in orchard imagery.
The second major contribution of DABAMNet is its selective placement of CBAM dual attention modules at an intermediate network depth, after DABou unit 5, where semantic abstraction and spatial granularity intersect. This strategic positioning enhances the model’s ability to simultaneously preserve object boundaries and integrate contextual semantics. Rather than deploying attention mechanisms uniformly or at early stages, as seen in prior models, DABAMNet’s placement is empirically optimized for improved spatial coherence in segmentation maps.
Third, DABAMNet introduces a modified Convolutional Block Attention Module (CBAM) that incorporates a dual pooling strategy, leveraging both global max and global min pooling operations within the channel and spatial attention branches. This enhancement allows the network to capture both prominent and subtle features, addressing common challenges in orchard segmentation such as occlusion, lighting variability, and fine-grained textural differences.
Unlike earlier approaches that often apply standard attention uniformly or in shallow layers, DABAMNet’s adaptive and fine-grained attention strategy contributes significantly to its performance, as validated through ablation experiments. These studies confirm that the proposed attention configuration enhances segmentation accuracy while maintaining robustness and computational tractability.
Beyond agricultural applications, the design of DABAMNet is informed by foundational advances in deep neural networks. The use of convolutional operations for hierarchical feature extraction draws on the seminal work of Liu et al. and Cruttwell et al., which established CNNs as the core architecture for visual recognition tasks [
43,
44]. Similarly, the incorporation of attention mechanisms is inspired by the work of Ayoub et al. and later generalized through the transformer framework by Wei et al., which emphasized the importance of selective focus in neural information processing [
45,
46]. The concept of multi-scale feature integration is also grounded in prior architectural designs such as the Feature Pyramid Network (FPN) [
47] and Atrous Spatial Pyramid Pooling (ASPP) [
48], both of which influenced the dilation strategies employed in the stacked DABou units within the DABAM blocks of DABAMNet.
Together, these innovations position DABAMNet as a purpose-built, attention-enhanced segmentation framework that balances high precision with practical efficiency and demonstrates potential for real-time use in automated orchard analysis.
3. Methodology
This section introduces DABAMNet, a novel deep learning architecture tailored for semantic segmentation in complex orchard environments. DABAMNet is composed of three core modules: the Initial Module for early-stage feature extraction, the Intermediate Module for multi-scale, attention-guided representation learning, and the Final Module for semantic projection and resolution recovery. Central to the architecture are two specialized submodules; the Dilated Asymmetric Bottleneck Unit (DABou) and the Convolutional Block Attention Module (CBAM), which are strategically embedded to enhance spatial fidelity and contextual sensitivity.
The design of DABAMNet is guided by three key objectives, each addressed by a specific module or mechanism within the architecture:
Spatial Preservation—Achieved through the Initial Module (M1), which minimizes early downsampling and preserves high-resolution spatial details essential for delineating object boundaries.
Contextual Enrichment—Realized in the Intermediate Module (M2) via the stacked DABou units within DABAM blocks, which progressively expand the receptive field using increasing dilation rates and asymmetric convolution paths.
Attention Calibration—Enabled by the integration of CBAM within the Intermediate Module (M2), placed after semantically rich DABou unit 5 to refine feature maps by emphasizing task-relevant information and suppressing background noise.
Together, these components form a modular encoder–decoder framework optimized for segmentation performance under the spatial and semantic challenges of orchard imagery. A comprehensive layer-wise breakdown of the architecture including kernel configurations, filter dimensions, and downsampling operations is presented in
Table 1, which summarizes the internal composition of DABAMNet across all modules. The following subsections provide detailed descriptions of each module and their contributions to the overall network functionality.
3.1. Convolutional Block Attention Module (CBAM)
The Convolutional Block Attention Module (CBAM) is a lightweight, plug-and-play attention mechanism designed to improve the representational capacity of convolutional neural networks by sequentially applying channel and spatial attention. This dual pooling mechanism strategy refinement enables the network to focus on what and where to emphasize in the feature map, thereby enhancing its ability to discriminate fine-grained patterns in complex visual scenes such as apple orchards. Specifically, CBAM adaptively recalibrates feature responses by first modeling inter-channel relationships to emphasize semantically relevant channels, followed by spatial attention to localize salient regions.
3.1.1. Channel Attention Submodules
The channel attention submodule aims to identify what features are important by capturing inter-channel dependencies. Let
denote the input feature map, where
,
, and
are the number of channels, height, and width, respectively. To compute the channel attention map
, CBAM applies two spatial pooling operations: global max pooling and global min pooling, to generate two distinct channel descriptors, as defined in Equation (1):
These descriptors are forwarded through a shared Multi-Layer Perceptron (MLP) with a bottleneck structure and reduction ratio
r = 2, consisting of two fully connected layers with ReLU activation in between. The channel attention map is defined in Equation (2):
where
denotes the sigmoid activation function. The refined feature map obtained via channel-wise multiplication defined in Equation (3):
where
denotes element-wise multiplication broadcast across spatial dimensions.
3.1.2. Spatial Attention Submodules
After refining features across channels, the spatial attention submodule focuses on identifying where to emphasize by capturing spatial correlations. Given the channel-refined feature map
, CBAM again applies global max pooling and global min pooling but this time along the channel axis, resulting in two spatial descriptors defined in Equation (4):
These are concatenated along the channel axis and processed using a convolutional layer with a 7 × 7 kernel to produce the spatial attention map defined in Equation (5):
where
denotes a convolution operation and
indicates channel-wise concatenation. The spatial attention is then applied via element-wise multiplication defined in Equation (6):
The combined sequential application of channel attention submodule and spatial channel attention submodule allows CBAM to adaptively recalibrate both the semantic importance (channels) and positional relevance (spatial locations) of features. This is particularly beneficial for apple segmentation where occlusion, lighting variation, and background clutter often challenge conventional feature extractors and degrade feature quality. The final output
contains features that are attentively refined in both spatially and channel-wise dimensions. Moreover, CBAM’s dual pooling strategy enhances its sensitivity to diverse feature activations, making it particularly effective for capturing subtle visual cues in complex orchard environments, as illustrated in
Figure 2, which shows the overall architecture of the CBAM.
In the proposed DABAMNet architecture, the CBAM is strategically integrated after DABou Unit 5 within the Intermediate Module , specifically inside the second DABAM block. This placement ensures that the attention mechanism operates on feature representations that are semantically enriched and have undergone multiple levels of receptive field expansion. The sequential application of channel and spatial attention allows CBAM to adaptively recalibrate both the semantic importance (channels) and the spatial relevance (positions) of features. This is particularly advantageous in the context of apple segmentation, where occlusion, illumination variability, and background noise can obscure discriminative patterns. By refining the features after mid-to-deep level abstraction, CBAM enhances the model’s sensitivity to subtle visual cues while preserving contextual coherence, which ultimately contributes to more accurate segmentation outcomes.
3.2. Dilated Asymmetric Bottleneck Unit (DABou)
The Dilated Asymmetric Bottleneck Unit (DABou) serves as the fundamental building unit of the DABAM blocks within the Intermediate Module (), forming the core of the DABAMNet architecture. Each DABou unit is designed to extract multi-scale spatial and semantic information efficiently by combining depthwise separable convolutions, dilated convolutions, and residual learning.
Inspired by Li et al., 2020, the DABou unit employs dilation and a parameter-efficient design to enlarge the receptive field without significantly increasing computational complexity [
49]. The unit processes an input feature map
, where
,
, and
are the number of channels, height, and width, respectively. This input undergoes a sequence of transformations designed to extract both local and contextual features through the following sequence of operations:
- 1.
Initial Convolution: The input feature map is normalized and activated using Batch Normalization and PReLU, followed by a standard 3 × 3 convolution with a fixed number of output filters (32 or 64, depending on the stage). This initializes the bottleneck structure and prepares the feature map for further filtering, producing an intermediate representation
F0, as defined in Equation (7):
where
denotes the PReLU activation function.
- 2.
Dual-Branch Processing: The output F0 is split into two branches to separately capture local and dilated contextual features:
Branch 1 (Local Context): Applies a depthwise 3 × 3 convolution with no dilation followed by a 1 × 1 pointwise convolution. This branch extracts fine spatial features via standard depthwise separable convolution followed by a pointwise projection, as defined in Equation (8):
Branch 2 (Dilated Context): Applies a depthwise 3 × 3 convolution where r = {2, 4, 8, 16} denotes the dilation rate, selected based on the DABou unit’s depth in the Intermediate Module, followed by a 1 × 1 convolution. This branch expands the receptive field, allowing the network to learn global and contextual semantics effectively, as defined in Equation (9):
is the depthwise kernel (no dilation);
is the dilated depthwise kernel with rate ;
is the shared pointwise kernel (per branch).
- 3.
Feature Fusion: The two branches are combined through element-wise addition to form a composite feature representation. This fusion combines both high-resolution spatial details and long-range semantic dependencies into a unified representation, which is then followed by Batch Normalization and PReLU activation, as defined in Equation (10):
- 4.
Final Projection and Residual Connection: To refine the fused features, a final 1 × 1 pointwise convolution is applied, followed by batch normalization and PReLU activation. A residual connection is then added by summing the output with the initial projection
F0, enhancing gradient flow and stabilizing the learning process, as defined in Equation (11):
where
denotes the dilation rate used in the DABou unit.
3.3. The Proposed Network: DABAMNet
DABAMNet is a modular deep learning architecture specifically designed for robust semantic segmentation of apple orchard scenes. It aims to capture both fine-grained spatial details and high-level contextual semantics through a carefully structured encoder–decoder design. The architecture is composed of three primary modules: the Initial Module (M1) for early feature extraction, the Intermediate Module (M2) for hierarchical and attention-enhanced multiscale learning, and the Final Module (M3) for dense semantic prediction. This design facilitates efficient feature reuse, adaptive receptive field expansion, and attention-guided refinement.
The overall motivation for DABAMNet stems from the limitations observed in conventional CNN-based segmentation models, which often struggle to balance spatial detail preservation with deep contextual understanding, particularly in cluttered and heterogeneous orchard environments. DABAMNet addresses these challenges through three key strategies: (1) early-stage downsampling coupled with spatially sensitive convolutional layers to retain boundary details, (2) stacked Depthwise Asymmetric Bottleneck (DABou) units with progressively increasing dilation rates to enhance multiscale contextual awareness without excessive parameter growth, and (3) the integration of a Convolutional Block Attention Module (CBAM) to dynamically recalibrate feature maps along both channel and spatial dimensions.
The encoder consists of the Initial and Intermediate Modules. The Initial Module (M1) transforms the raw RGB input into compact yet informative features using a combination of downsampling and residual convolutional operations. The Intermediate Module (M2) is composed of two DABAM blocks, each containing multiple DABou units with varying dilation rates and a CBAM positioned strategically to refine feature saliency. This hierarchical structure enables the network to learn progressively more abstract and semantically rich representations. Finally, the decoder is represented by the Final Module (M3), which projects the encoded features into class scores using a 1 × 1 convolution and restores the spatial resolution via bilinear interpolation. This allows DABAMNet to output pixel-level segmentation maps aligned with the original image dimensions.
3.3.1. Initial Module ()
The Initial Module of DABAMNet serves as the entry point of the network, transforming the raw RGB input image into an informative feature representation suitable for deeper semantic processing. It performs early-stage downsampling while preserving low-level spatial patterns critical for boundary-aware segmentation.
Let the input image be denoted as , where and represent the height and width of the input image, respectively. The processing in proceeds as follows:
Thus, the overall transformation performed by the Initial Module expressed in Equation (15):
This encoded representation serves as the input to the subsequent intermediate module for deeper context-aware feature learning.
3.3.2. Intermediate Module ()
The Intermediate Module of DABAMNet receives its input feature map from the Initial Module , and is composed of two sequential blocks, and . Each block consists of multiple stacked DABou units that facilitate hierarchical feature enrichment. These blocks are specifically designed to progressively expand the receptive field through increasing dilation rates, while preserving spatial resolution and maintaining parameter efficiency.
block comprises three DABou units, each with a dilation rate of and 32 filters per unit. Its output is concatenated with the block’s input and an earlier shallow skip feature . A dual path downsampling follows, using strided convolution and max pooling to reduce spatial resolution while retaining detail.
block includes six DABou units with 64 filters: DABou unit 4–5 use dilation , DABou unit 6–7 use dilation , and DABou unit 8–9 use dilation . After the fifth unit, a CBAM is inserted to enhance attention over high-level features. The output is concatenated with the block’s input and a deeper skip connection , followed by normalization and activation.
To reflect the multiple uses of DABou units in each DABAM block, the operation is formally defined in Equation (16):
where
The complete DABou operations within
and
blocks are defined in Equation (17):
where
denotes sequential composition;
denotes concatenation;
represents the dual-path downsampling;
as applied after ;
and are resized skip connections from shallow layers.
Finally, the output of
is formulated in Equation (18):
This layered architecture enables multi-scale feature learning while retaining strong spatial-semantic representations critical for segmentation in complex orchard imagery.
3.3.3. Final Module ()
The Final Module of DABAMNet serves as the classifier head that transforms the encoded features into a semantic segmentation map. It takes the output from the Intermediate Module , which contains both spatial and semantic information, and performs class-wise projection and resolution restoration to produce dense predictions at the original image size.
Let the input to this module
, where
denotes the number of feature channels and
is the downsampled spatial resolution. The operations in
are formulated in Equation (19):
where
denotes the weights of a convolutional layer that projects the features from to output classes;
performs bilinear interpolation to restore the original resolution;
generates the class probabilities for each pixel;
represents the final semantic segmentation prediction;
is the dilation rate of the DABou unit.
By integrating the three core modules, Initial Module
, Intermediate Module
, and Final Module
, the complete forward operation of DABAMNet is expressed as a nested function composition in Equation (20):
where
is the input RGB image;
is the Initial Module responsible to extract low-level spatial and contextual features for early feature extraction;
is the Intermediate Module composed of stacked DABAM blocks with multiscale attention refinement enhancing semantic richness and receptive field;
is the Final Module projecting the learned features into class probabilities through convolution and upsampling.
This formulation emphasizes the modular and progressive design of DABAMNet, wherein hierarchical representations are successively refined from low-level textures to high-level semantic understanding, as illustrated in
Figure 3, which depicts the overall network architecture of the proposed model.
4. Results and Discussion
This section presents a comprehensive analysis of the experimental results obtained using the proposed DABAMNet model. It begins by detailing the dataset used for training and evaluation, followed by an overview of the evaluation metrics adopted to quantify segmentation performance. The experimental setup is then described to ensure reproducibility and clarity of the training pipeline. An ablation study is conducted to investigate the contribution of key architectural components, including the placement of attention modules and the impact of different dilation strategies. Finally, the performance of DABAMNet is compared to several established segmentation models across key metrics, highlighting its effectiveness in segmenting apples under complex orchard environments. The findings in this section provide insights into the design choices and practical implications of DABAMNet for real-world agricultural applications.
4.1. Dataset
To evaluate the performance of the proposed DABAMNet architecture, this study utilizes an apple segmentation dataset introduced by [
50], developed by the University of Minnesota Research Center. The original dataset was acquired using a standard consumer-grade device, a Samsung Galaxy S4 mobile phone camera, highlighting the practicality of applying computer vision models to low-cost, field-deployable hardware. Data collection was performed by recording videos while walking along apple tree rows at approximately 1 m per second. The camera was held horizontally to face the tree canopy laterally, minimizing motion blur and capturing varying fruit orientations.
From the recorded video streams, image frames were sampled for annotation. While the original study extracted samples every fifth frame, this work adopts a sparser sampling strategy by selecting every 30th frame, effectively reducing redundancy and emphasizing visual diversity. A total of ten video sequences were collected from six distinct tree rows, from which 670 images were randomly selected and manually annotated for use in this study. These images were partitioned using a five-fold cross-validation strategy, where each fold maintains a 4:1 ratio (536 training and 134 testing images) to facilitate supervised learning and model evaluation. The dataset captures a broad range of visual variability. Apples appear in multiple color variations, including green, red, orange, and hybrid tones, and are situated at varying distances from the camera. Illumination conditions vary significantly due to the recordings being conducted at various times of the day, resulting in a variety of lighting angles and shadow patterns. This diversity presents both a realistic and challenging setting for image segmentation tasks.
Each selected image was annotated by trained human labelers using high-resolution polygon masks to delineate individual apple instances with precision. The annotation process, which required approximately 30 min per image, followed a rigorous quality assurance workflow. All masks were subject to a secondary verification process to ensure accuracy and consistency across the dataset. The final annotations were saved in Portable Network Graphics (PNG) format with a resolution of 720 × 1280 pixels, providing a detailed ground truth for evaluating model performance.
Figure 4 illustrates sample images from the dataset, capturing both the front-facing and receding views of the tree rows. This realistic acquisition setting, combined with careful annotation, makes the dataset a strong benchmark for assessing the generalization capability of segmentation models in unconstrained agricultural environments.
4.2. Evaluation Metrics
To assess the performance of the proposed DABAMNet model and its baseline counterparts, two standard evaluation metrics are employed: Accuracy (ACR) and Intersection over Union (IoU). These metrics provide complementary insights into the predictive reliability and spatial localization quality of the segmentation models.
Accuracy (ACR) measures the overall correctness of classification by evaluating both true positive (
) and true negative (
) predictions relative to all prediction outcomes. It reflects the proportion of correctly classified pixels (both foreground and background) over the total number of pixels, as expressed in Equation (21):
Here, and represent false positive and false negative predictions, respectively. A high accuracy score indicates that the model effectively distinguishes between the apple (foreground) and non-apple (background) regions.
Intersection over Union (IoU), also known as the Jaccard Index, is used to quantify the overlap between the predicted segmentation mask and the ground truth. It is calculated as the ratio of the intersection area to the union area of the predicted and actual masks, as shown in Equation (22):
Unlike accuracy, IoU focuses solely on foreground prediction quality and does not consider true negatives. It is particularly well-suited for evaluating segmentation tasks where precise delineation of target objects (e.g., apples) is critical.
Both metrics rely on the standard confusion matrix components: true positives (), true negatives (), false positives (), and false negatives (). Misclassifications occur when the predicted label does not match the ground truth, for example, when a pixel predicted as positive is actually negative (false positive), or when a true positive is missed (false negative). Conversely, correct matches between predictions and ground truth labels yield true positives and true negatives.
Additionally, the number of model parameters, encompassing both trainable and non-trainable weights, is reported as an auxiliary metric. This provides insight into the computational cost and efficiency of each CNN model under evaluation. Parameter counts are particularly relevant in real-time or resource-constrained deployment scenarios, such as embedded agricultural robotics. Together, these metrics offer a comprehensive assessment of segmentation accuracy, spatial alignment, and model complexity.
4.3. Experimental Setup
To evaluate the effectiveness of the proposed DABAMNet model, four state-of-the-art CNN architectures—PSPNet, SegNet, FC-DenseNet, and DabNet—were selected as benchmarks, based on their relevance to recent agricultural computer vision studies. All five models, including DABAMNet, were implemented in Python 3.9.10 using the Keras-TensorFlow 2.10.1 library on the Kaggle platform. Their hyperparameters were optimized individually to achieve the best possible classification and segmentation performance. Each model was trained and tested using the complete set of 670 annotated apple images, consistently sampled across all five folds to ensure balanced and comprehensive evaluation. The primary criterion for optimizing the hyperparameters was the convergence of the loss function during training and testing. A convergence threshold of 0.1 was established for the loss value, which all models were required to meet. A grid search strategy was used to explore the hyperparameter space, with the optimal settings summarized in
Table 2. Each model achieved successful convergence within the training and validation datasets using these parameters.
DABAMNet training was guided by the categorical cross-entropy loss function with accuracy as the evaluation metric. The Nadam optimizer, configured with a fixed learning rate, was applied consistently across all models. No data augmentation techniques were applied, apart from uniform image resizing to match the input dimensions required by each model. In the case of DABAMNet, the original images (720 × 1280 pixels) were resized to 256 × 512 pixels. Batch sizes were determined based on model size and hardware constraints. All experiments were conducted on a single NVIDIA RTX 2080 Ti GPU. For local testing, DABAMNet was also validated on an Intel i5-8250U processor with a 1.80 GHz clock rate, where a batch size of 9 images was feasible. The total number of parameters in DABAMNet was 20,603,811, of which 20,592,871 were trainable and 10,940 were non-trainable.
To mitigate sampling bias and reduce overfitting, a five-fold cross-validation strategy was employed. In each fold, the dataset was divided into training and testing subsets using a consistent 4:1 ratio. This yielded 536 training images and 134 testing images per fold. This validation approach ensures balanced evaluation and reliable performance metrics across diverse image samples.
4.4. Ablation Study
To identify the optimal configuration of the proposed DABAMNet model, an ablation study was conducted focusing on three key design factors: (1) the placement of the Convolutional Block Attention Module (CBAM) within the DABAMNet architecture, (2) the channel reduction ratio used in CBAM’s channel attention submodule, and (3) the global pooling strategy employed within CBAM. The objective was to evaluate how each of these factors affect segmentation accuracy and computational efficiency when localizing apples in complex orchard scenes. CBAM was selected for its ability to enhance CNN-based feature extraction by sequentially applying the channel attention submodule and the spatial attention submodule, allowing the network to prioritize both relevant feature maps and spatial regions. Within the channel attention submodule, the reduction ratio is applied in the internal Multi-Layer Perceptron (MLP) to compress and then restore the channel dimensions. This operation regulates model complexity, promotes generalization, and controls redundancy. It provides a balance between expressive power and computational cost. Such capabilities are especially advantageous for addressing challenges like occlusion, variable lighting, and background clutter in fruit segmentation tasks.
To further explore the optimal placement of the Convolutional Block Attention Module (CBAM) within the DABAMNet architecture, CBAM was inserted into three distinct locations. Each of these locations represents a distinct level of feature abstraction:
After DABou unit 5: Captures low-level to mid-level patterns beyond basic edges and textures, supporting small object detection.
After DABou unit 7: Extracts more complex structural information and helps interpret spatial context across the image.
After DABou unit 9: Encodes highly abstract representations essential for differentiating object classes and conditions.
This experimental design enabled the identification of the most suitable feature abstraction level for integrating attention mechanisms.
The reduction ratio in the channel attention submodule was varied to evaluate how dimensionality compression influences the network’s ability to prioritize prominent features. Ratios of 2, 4, and 8 were evaluated:
r = 2: Retains detailed channel information but increases computational cost.
r = 4: Balances performance and efficiency by moderately reducing channel dimensions.
r = 8: Enhances computational efficiency but may result in the loss of finer feature details.
The objective was to identify a reduction setting that maintains high segmentation performance while avoiding unnecessary computational overhead.
Finally, various global pooling strategies were evaluated within the attention module to determine their effect on model performance. Global Max Pooling, Global Average Pooling, and Global Min Pooling were applied individually and in combinations to assess their influence on attention precision and segmentation accuracy:
Global Max Pooling: Highlights the most dominant activations, directing attention toward the most salient and discriminative features.
Global Average Pooling: Captures overall contextual patterns across the feature map, promoting stability and generalization.
Global Min Pooling: Identifies minimally activated regions, which can suppress background noise and filter out irrelevant or redundant information.
This exploration aimed to identify the pooling configuration that most effectively supports attention learning in complex orchard environments.
This ablation study, as shown in
Table 3, demonstrated that the highest Intersection over Union (IoU) score of 0.7291 was achieved when the Convolutional Block Attention Module (CBAM) was integrated after DABou unit 5, using a channel reduction ratio of 2 in CBAM’s channel attention submodule and a dual pooling strategy combining global max pooling and global min pooling in both the channel attention submodule and the spatial attention submodule of CBAM. This configuration underscores the importance of applying attention at an intermediate stage of the network, where low-level textures and structures remain informative and higher-level semantic features begin to form. At this depth, the network can effectively benefit from attention mechanisms that refine feature representations by selectively amplifying relevant cues and suppressing noise. The low reduction ratio preserves detailed channel relationships within the attention submodule, while the combined pooling operations ensure that both dominant and subtle activations are captured across spatial and channel dimensions. Collectively, these design choices contribute to improved segmentation performance in complex orchard environments.
Placing CBAM after DABou unit 5 yielded superior performance compared to placements after DABou unit 7 and after DABou unit 9 due to its strategic position at an intermediate depth, where both low-level textures and emerging high-level semantics coexist. At this stage, the network retains rich spatial details essential for segmenting small, irregularly shaped objects like apples, while beginning to integrate broader contextual information. In contrast, applying CBAM after deeper layers (after DABou unit 7 and after DABou unit 9) limits its effectiveness, as feature maps at these stages are more abstract and spatially compressed. This abstraction, while useful for classification, often results in a loss of fine-grained spatial cues critical for precise localization. Consequently, CBAM has less spatial detail to refine, reducing its impact. Integrating attention earlier allows the model to enhance feature representation when both spatial precision and semantic cues are still accessible, resulting in more accurate segmentation.
Using a channel reduction ratio of 2 in CBAM’s channel attention submodule proved more effective than higher ratios of 4 and 8 due to its ability to preserve richer feature information during the dimensionality reduction process. A lower reduction ratio retains a larger proportion of the original channel descriptors when passing through the Multi-Layer Perceptron (MLP), thereby maintaining more nuanced inter-channel dependencies. This leads to finer attention weighting and more precise emphasis on informative feature channels. In contrast, higher reduction ratios compress the feature space more aggressively, which, although computationally efficient, risks discarding subtle but important discriminative signals. Such losses can be detrimental in segmentation tasks involving small or visually complex objects like apples, where detailed preservation is essential. The ratio of 2 offers a balanced trade-off, ensuring sufficient feature expressiveness while maintaining manageable computational complexity.
The dual pooling strategy combining global max pooling and global min pooling in both the channel and spatial attention submodules of CBAM demonstrated superior performance compared to other combinations involving global average pooling. This configuration effectively captures both the most salient and the least activated features within the feature maps. Global max pooling emphasizes the most prominent activations, which are crucial for identifying strong object signals, while global min pooling highlights underrepresented or suppressed features that may correspond to subtle object boundaries or background noise. Together, they provide a complementary view that enhances feature discrimination and robustness. In contrast, strategies that include global average pooling tend to smooth out activation differences, potentially diluting the contrast between critical and non-informative regions. Even when average pooling is combined with max or min pooling, its averaging nature can obscure sharp feature responses or fail to emphasize weak but relevant signals. Therefore, the combination of max and min pooling provides a more dynamic and discriminative attention mechanism, leading to improved segmentation precision, as confirmed in
Table 3.
In summary, the ablation study systematically evaluated key design choices within the CBAM-integrated DABAMNet architecture, focusing on attention module placement, channel reduction ratio, and pooling strategies. The results, as shown in
Table 3, demonstrate that inserting CBAM after DABou unit 5, using a channel reduction ratio of 2, and employing a dual pooling strategy combining global max pooling and global min pooling in both the channel attention submodule and the spatial attention submodule yields the highest segmentation accuracy. This configuration achieves an effective balance between feature expressiveness and computational efficiency, allowing the network to enhance relevant spatial and semantic cues while suppressing background noise. These findings not only validate the proposed architectural design but also offer theoretical insights into the optimal integration of attention mechanisms for robust and precise apple segmentation under real-world conditions.
4.5. Discussion on DABAMNet and Its Benchmarked Models’ Performance
The performance of the proposed DABAMNet model was evaluated against four widely recognized semantic segmentation architectures: PSPNet, SegNet, FC-DenseNet, and DabNet. To further contextualize DABAMNet’s effectiveness, three additional attention-based models from recent literature were also included in the comparison:
DANet + U-Net, based on Fu et al. (2019) [
51], implemented with a U-Net backbone replacing the original ResNet encoder, while retaining the dual attention block that combines positional and channel attention at the bottleneck [
52].
U-Net + CBAM, adapted from Li et al. (2025) [
53], incorporating the CBAM—without multiscale processing—before each max pooling operation in the encoder to enhance spatial and channel-level feature extraction.
DeepLabV3+ + Spatial Attention, inspired by Liu and He (2021) [
54], using a lighter Xception41 encoder in place of Xception65 and integrating spatial attention within the ASPP module as described in the original study.
These models were selected based on their diverse architectural paradigms and proven effectiveness in agricultural image analysis. To ensure a fair comparison, all models were trained and tested on the same apple orchard dataset using standardized preprocessing procedures, identical hyperparameter settings, and consistent evaluation metrics. The comparative analysis focused on two primary indicators of segmentation quality: Overall Accuracy and Intersection over Union (IoU). A summary of the performance outcomes, along with network complexities, is presented in
Table 4.
Among the baseline models without explicit attention mechanisms, DABAMNet achieved the highest performance, with an Accuracy of 0.9813 and an Intersection over Union of 0.7291. This reflects the model’s strength in both accurate class prediction and fine-grained boundary delineation. SegNet recorded a comparable Accuracy of 0.9809, but its lower Intersection over Union of 0.6896 reveals limitations in boundary localization. This shortcoming may be attributed to its reliance on pooled indices during upsampling, which compromises spatial detail recovery. PSPNet achieved an Accuracy of 0.9680 and an Intersection over Union of 0.6903, leveraging pyramid pooling for contextual understanding. However, it exhibits limited precision in reconstructing fine spatial structures. FC-DenseNet, despite being the most lightweight model with 14.7 million parameters, recorded the weakest performance, with an Accuracy of 0.9740 and an Intersection over Union of 0.5674. This decline is likely due to over-compression of spatial information in its densely connected architecture. DabNet, designed for computational efficiency, performed better with an Accuracy of 0.9806 and an Intersection over Union of 0.7156, but still trailed DABAMNet, particularly in its capacity to preserve detail in complex regions.
Within the group of attention-enhanced models, DABAMNet remained the top performer. The DANet plus U-Net configuration, which introduces dual attention modules at the bottleneck of a U-Net backbone, attained an Accuracy of 0.9785 and an Intersection over Union of 0.5897. While it enhances contextual encoding, the relatively shallow backbone limits its expressive power. The U-Net plus CBAM variant, which inserts the CBAM before each max pooling operation, achieved an Accuracy of 0.9756 and an Intersection over Union of 0.4956. This suggests that applying attention at early encoding stages, without multiscale integration, yields modest improvements. DeepLabV3 plus Spatial Attention, adapted with a reduced Xception41 encoder, resulted in the lowest Accuracy of 0.9752 and Intersection over Union at 0.4876, despite its high parameter count of 128 million. The absence of channel attention and substantial spatial downsampling may explain this performance drop. In contrast, DABAMNet, with only 20.6 million parameters, achieved the highest performance metrics among all evaluated models. Its multi-scale design and strategic use of attention in deeper layers contributed to this outcome, demonstrating a balance between efficiency and effectiveness in segmenting complex orchard scenes.
Figure 5 illustrates sample segmentation masks generated by DABAMNet, showcasing its ability to delineate apples under varied environmental conditions. The model effectively distinguishes apples from the background, even in challenging scenarios involving occlusion, overlapping fruit, or variable lighting conditions. These findings underscore the effectiveness of DABAMNet’s architectural innovations, particularly the integration of attention modules and depthwise asymmetric bottlenecks for enriched multi-scale feature encoding. Overall, the results validate DABAMNet as a promising architecture for high-precision segmentation tasks in complex orchard environments.
DABAMNet’s performance advantage can be attributed to its integration of dual attention mechanisms within the Convolutional Block Attention Module (CBAM): the channel attention submodule and the spatial attention submodule. These components collaboratively guide the network’s focus toward the most informative visual features relevant to enhance the segmentation task, which is particularly beneficial for binary classification tasks involving object and non-object categories, such as apple versus background. The channel attention submodule selectively emphasizes meaningful feature maps, while the spatial attention submodule highlights critical spatial regions within those maps. Together, they enable the model to more accurately localize apple regions and suppress irrelevant background patterns, thereby improving segmentation precision.
In addition, DABAMNet adopts a dual pooling strategy within CBAM, combining global max pooling and global min pooling in both the channel attention submodule and the spatial attention submodule. Global max pooling extracts dominant discriminative features by selecting the most activated values, whereas global min pooling captures more subtle, often overlooked information. This hybrid approach increases the model’s sensitivity to diverse contextual cues, enhancing generalization in complex orchard scenes where apples may be partially occluded, affected by variable lighting, or visually similar to the background. The global pooling combination strengthens attention effectiveness and improves segmentation performance under challenging real-world conditions.
The baseline models are competent but lack the advanced attention and pooling mechanisms employed in DABAMNet. As a result, their performance, particularly in terms of Intersection over Union, remains lower. This highlights the value of attention-based enhancements in segmentation tasks that require fine spatial resolution.
Figure 6 presents a comparative analysis of training accuracy and loss across eight segmentation models. Among these, DABAMNet achieved the highest accuracy of 0.9813 with a relatively low loss value of 0.0648, reflecting both high predictive correctness and efficient learning. Classical architectures such as SegNet and DabNet followed closely with accuracies of 0.9809 and 0.9806 and corresponding loss values of 0.0829 and 0.0582. DeepLabV3+ with spatial attention recorded the lowest training loss at 0.0328, suggesting strong convergence; however, its accuracy of 0.9752 remained lower than that of DABAMNet. FC-DenseNet and DANet integrated with U-Net demonstrated moderate results, with accuracies of 0.9740 and 0.9785 and losses of 0.1610 and 0.0763, respectively. PSPNet showed the highest loss value of 0.4130, despite achieving a reasonable accuracy of 0.9680, suggesting potential issues such as poor optimization or limited spatial representation. U-Net enhanced with CBAM yielded a competitive accuracy of 0.9756 but suffered from an abnormally high loss of 27.7922, likely resulting from unstable training dynamics or inadequate normalization mechanisms. To maintain consistency in analysis and uphold the validity of the metric range, excessively high loss values are capped at 1.0000.
It is also important to emphasize the role of both accuracy and Intersection over Union in evaluating segmentation models. Accuracy provides an overall measure of classification correctness, while Intersection over Union is critical for assessing segmentation quality, especially in applications requiring precise boundary delineation. DABAMNet’s Intersection over Union score of 0.7291 significantly outperforms that of DeepLabV3+ with spatial attention, which reached only 0.4876. This result confirms the superiority of DABAMNet in capturing fine spatial structures and contextual dependencies. Such capabilities are essential in agricultural applications like yield estimation and orchard monitoring, where segmentation errors can have direct operational consequences. These findings position DABAMNet as a reliable, high-performing solution for precision agriculture, capable of achieving accurate and fine-grained segmentation under real-world conditions.
5. Conclusions and Future Work
This study presented DABAMNet, an attention-enhanced convolutional neural network tailored for high-precision apple segmentation in visually complex orchard environments. While incorporating established modules such as CBAM and bottleneck designs, DABAMNet introduces three architectural innovations: (1) the integration of depthwise asymmetric bottleneck units for efficient yet expressive feature encoding; (2) the strategic placement of dual attention modules at intermediate layers to balance semantic abstraction with spatial precision; and (3) the novel use of max–min pooling within the attention mechanism to capture both dominant and subtle visual cues. These innovations collectively empower DABAMNet to outperform existing CNN-based segmentation models in both accuracy and Intersection-over-Union (IoU). Its superiority has been validated through ablation studies and benchmarking against four state-of-the-art networks, demonstrating strong potential for real-world deployment in precision agriculture and autonomous harvesting systems.
The empirical evaluation revealed that the optimal configuration for DABAMNet involves applying the CBAM attention mechanism at after DABou unit 5, using a channel reduction ratio of 2, and adopting a dual-pooling strategy combining global max and global min pooling operations. This setup consistently delivered superior segmentation performance across multiple evaluation metrics, confirming the effectiveness of targeted attention refinement in deep convolutional architectures.
However, broader considerations should be acknowledged. The current model has been trained and tested exclusively on apple orchard imagery, and its direct applicability to other fruit types or agricultural environments with different visual characteristics remains untested. As such, DABAMNet’s effectiveness may diminish when applied to crops that vary significantly in color, texture, canopy structure, or background complexity. This limitation highlights the need for further investigation into model generalization and cross-domain adaptability. Future work will include cross-fruit evaluations (e.g., citrus, mango, grapes) to assess the transferability of DABAMNet across varying fruit morphologies and canopy structures.
Moreover, the model’s performance is closely tied to the availability of high-quality annotated datasets, which are essential for supervised training but costly and time-consuming to produce, especially in agricultural settings with complex occlusion and illumination conditions. This reliance on detailed manual labeling presents a significant barrier to deploying similar solutions across diverse crop types or geographic regions. To address this, semi-supervised learning or weak supervision may be explored to reduce the dependence on fully annotated data, thereby enabling more scalable and cost-effective deployment of segmentation models in real-world agricultural scenarios.
Despite these strengths, the proposed framework has some limitations. The integration of attention modules increases architectural complexity, leading to longer training times and higher computational demands. These constraints may limit deployment on edge devices or in scenarios that require real-time inference. While DABAMNet demonstrates high segmentation accuracy through the integration of attention modules and deep bottleneck structures, these enhancements inevitably contribute to increased architectural complexity and computational cost. Potential strategies to mitigate these challenges include the adoption of lightweight attention mechanisms and pruning techniques. Such modifications aim to reduce inference time and resource consumption while preserving segmentation performance. These improvements are especially relevant for deployment in edge computing environments, such as agricultural robots and drone-based monitoring systems, where real-time processing under hardware constraints is essential. Addressing computational efficiency will be a key direction in advancing the model’s practical applicability across diverse agricultural settings. Furthermore, although DABAMNet performs well under typical orchard conditions, its accuracy may decline in cases of severe occlusion or heavy background clutter.
Future research may address these challenges through several avenues. One promising direction involves incorporating lightweight attention mechanisms such as Efficient Channel Attention [
55] or Tiny Attention [
56] to reduce computational complexity while maintaining segmentation performance. In parallel, model compression strategies including the Lottery Ticket Hypothesis [
57] and structured pruning techniques [
58] could be explored to reduce computational overhead without sacrificing accuracy, thereby improving the model’s suitability for edge computing and mobile platforms such as drones or agricultural robots. Second, expanding the training dataset to include a broader range of challenging conditions, such as extreme occlusion, varied illumination, and clutter, could further enhance the model’s generalization capability. Third, DABAMNet could be integrated into real-time automated fruit harvesting pipelines to support large-scale agricultural operations. Finally, incorporating multi-scale feature extraction techniques, as suggested in [
59], may further improve its ability to detect and segment apples across a variety of spatial resolutions [
60].
Finally, future research may also explore domain adaptation techniques to improve robustness across new crops and field conditions. These findings underscore the value of carefully integrating architectural components and optimizing attention placement to advance deep learning solutions for agricultural image analysis.