An Attention-Enhanced Bottleneck Network for Apple Segmentation in Orchard Environments

Jelas, Imran Md; Maluazi, Nur Alia Sofia; Zulkifley, Mohd Asyraf

doi:10.3390/agriculture15171802

Open AccessArticle

An Attention-Enhanced Bottleneck Network for Apple Segmentation in Orchard Environments

by

Imran Md Jelas

,

Nur Alia Sofia Maluazi

and

Mohd Asyraf Zulkifley

^*

Department of Electrical, Electronic and Systems Engineering, Faculty of Engineering and Built Environment, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(17), 1802; https://doi.org/10.3390/agriculture15171802

Submission received: 8 July 2025 / Revised: 30 July 2025 / Accepted: 22 August 2025 / Published: 23 August 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

As global food demand continues to rise, conventional agricultural practices face increasing difficulty in sustainably meeting production requirements. In response, deep learning-driven automated systems have emerged as promising solutions for enhancing precision farming. Nevertheless, accurate fruit segmentation remains a significant challenge in orchard environments due to factors such as occlusion, background clutter, and varying lighting conditions. This study proposes the Depthwise Asymmetric Bottleneck with Attention Mechanism Network (DABAMNet), an advanced convolutional neural network (CNN) architecture composed of multiple Depthwise Asymmetric Bottleneck Units (DABou), specifically designed to improve apple segmentation in RGB imagery. The model incorporates the Convolutional Block Attention Module (CBAM), a dual attention mechanism that enhances channel and spatial feature discrimination by adaptively emphasizing salient information while suppressing irrelevant content. Furthermore, the CBAM attention module employs multiple global pooling strategies to enrich feature representation across varying spatial resolutions. Through comprehensive ablation studies, the optimal configuration was identified as early CBAM placement after DABou unit 5, using a reduction ratio of 2 and combined global max-min pooling, which significantly improved segmentation accuracy. DABAMNet achieved an accuracy of 0.9813 and an Intersection over Union (IoU) of 0.7291, outperforming four state-of-the-art CNN benchmarks. These results demonstrate the model’s robustness in complex agricultural scenes and its potential for real-time deployment in fruit detection and harvesting systems. Overall, these findings underscore the value of attention-based architectures for agricultural image segmentation and pave the way for broader applications in sustainable crop monitoring systems.

Keywords:

apple segmentation; deep learning; attention mechanism; precision agriculture; convolutional neural networks

1. Introduction

Global population growth continues to drive an unprecedented demand for food, placing significant pressure on modern agricultural systems. According to [1], the global volume of food consumption has steadily increased since 2015 and is projected to exceed 300 million metric tons by 2026. As illustrated in Figure 1, this trend highlights the urgent need for scalable and efficient agricultural solutions that can meet rising consumption demands. Ensuring food security under these circumstances is paramount. It requires not only increasing productivity but also guaranteeing equitable access to sufficient, safe, and nutritious food, as outlined by [2].

Within this context, fruits like apples are a vital component of global diets and economies. European Union farmers alone produce over 11.5 million tons annually, surpassing other fruits such as oranges and watermelons [3]. However, traditional fruit harvesting methods remain labor-intensive, time-consuming, and prone to inefficiencies due to occlusions from foliage and inconsistent lighting. Manual localization often results in missed detections and lower productivity, especially when apples are partially hidden by leaves or branches. Moreover, harvesting efficiency is limited by the inability of human labor to operate continuously at scale.

Compounding these challenges, climate-related disasters such as extreme droughts, heatwaves, and unseasonal frosts have increasingly disrupted global crop yields, leading to food insecurity and economic loss, particularly in vulnerable agricultural systems [4]. As climate variability intensifies, the need for resilient, technology-driven approaches in agriculture becomes even more critical. In response, precision farming tools embedded with deep learning capabilities present viable solutions for real-time monitoring and adaptive interventions to mitigate climate-induced stresses, thereby enhancing yield resilience and supporting sustainable food production [5,6].

To address these challenges, the integration of advanced technologies such as automation coupled with artificial intelligence, machine learning, and deep learning has become essential for boosting agricultural productivity [7]. These innovations align with the objectives of Sustainable Development Goal 2, which promotes ending hunger and fostering sustainable agriculture through smarter resource management [8]. In particular, deep learning offers powerful capabilities for high-precision visual recognition through hierarchical feature learning, enabling the automated detection and classification of fruit in complex orchard environments.

Among various deep learning architectures, convolutional neural networks (CNNs) have proven especially effective in agricultural applications, including fruit maturity estimation, disease detection, and object segmentation [9]. CNNs use a series of convolutional and pooling layers to extract spatial patterns and identify features with minimal human intervention. In comparison to aerial drone platforms, ground-based RGB cameras provide higher resolution and lower-noise imagery. When mounted on a stable platform such as a tripod, these cameras can produce clearer images, making them more suitable for precise segmentation tasks like apple area mapping.

Despite their advantages, conventional CNNs often struggle to capture long-range dependencies and highlight the most relevant visual cues, particularly in cluttered scenes with occlusions or non-uniform lighting. To address these limitations, recent research has explored the use of attention mechanisms, either through channel or spatial attention variant, to guide the network toward extracting salient features of interest, while suppressing irrelevant ones. Furthermore, architectural enhancements such as multi-scale processing, residual connections, and grouped convolutions have shown success in various domains [10,11,12], although their use in agricultural image segmentation remains limited.

This study introduces DABAMNet, a custom CNN architecture designed to enhance apple segmentation in orchard environments by integrating depthwise asymmetric bottleneck units with dual attention mechanisms. While DABAMNet builds on established components such as CBAM-based attention and bottleneck structures, its novelty lies in the strategic placement of attention modules within the network hierarchy, the use of a depthwise asymmetric structure for lightweight yet expressive encoding, and the unconventional combination of global max and min pooling operations within the attention block. These architectural refinements offer improved feature discrimination in agricultural image segmentation, particularly under real-world orchard complexities such as occlusion and lighting variation, as confirmed through empirical ablation studies.

To further enhance feature representation, DABAMNet integrates multiple pooling operations including global max pooling, average pooling, and min pooling. This combination allows the model to capture both dominant and subtle visual cues, improving robustness to occlusion, lighting variation, and background clutter. An extensive ablation study has been performed to confirm the effectiveness of this design. In short, this work presents a deep learning solution that addresses core limitations in fruit segmentation by leveraging architectural innovations and attention mechanisms. The proposed DABAMNet model is trained and tested on a publicly available apple orchard dataset and benchmarked against four state-of-the-art CNN baselines. Experimental results demonstrate significant improvements in segmentation accuracy and Intersection over Union (IoU), validating its potential for deployment in precision farming and real-time robotic harvesting systems.

This paper is structured into five sections. Section 1 provides an introduction. Section 2 reviews the approach methods applied in the agriculture sector. Section 3 outlines the complete methodology for the architecture of the proposed DABAMNet. Section 4 presents the results and discussion derived from using the proposed network. Finally, Section 5 offers a concise conclusion, summarizing the limitations of the proposed network and suggesting future directions.

2. Related Works

This section provides a structured overview of recent advances in deep learning-based agricultural image analysis. It highlights the strengths and limitations of core architectural models, explores multi-scale feature integration techniques, and discusses methods designed to mitigate the problems of overfitting. These insights support the development of effective and generalizable models for precision farming applications using remote sensing and ground-level imagery.

The integration of deep learning into agricultural image analysis has catalyzed a transformative shift in how key challenges such as classification, detection, and segmentation are addressed in precision farming. Given the critical importance of agricultural productivity to global food security and economic resilience, recent studies have increasingly explored advanced machine learning techniques to enhance the automation, scalability, and precision of crop monitoring systems. These approaches have shown significant promise in addressing issues related to environmental variability, resource constraints, and the need for real-time decision-making. Deep learning methods, particularly those based on CNN, have demonstrated strong generalization capabilities across a wide variety of agricultural domains by learning hierarchical features directly from raw imagery.

2.1. Classification of Agricultural Products Using Deep Learning Methodology

In agricultural classification tasks, deep learning models are widely adopted to categorize fruits and crops based on ripeness, health condition, and disease severity. CNN-based classifiers, particularly those pre-trained on large-scale datasets such as ImageNet, have emerged as the preferred solution due to their ability to generalize across heterogeneous imaging conditions and object appearances.

Mimma et al. [13] conducted a comprehensive study using ResNet50 and VGG16 architectures for multi-class fruit classification involving eight fruit categories. The study reported high classification accuracy, underscoring the capability of deep CNNs to extract relevant texture, shape, and color features across different fruit morphologies. The authors also noted the effectiveness of transfer learning in mitigating data scarcity issues, which are common in agricultural domains. Similarly, the work in [14] developed a hybrid architecture combining AlexNet and a pre-trained VGG16 encoder for the classification of date fruits. Their method demonstrated improved feature extraction, especially for detecting subtle textural differences across ripeness stages.

Expanding upon these findings, Mahmood et al. [15] applied pre-trained VGG16 and AlexNet models for jujube fruit classification into unripe, ripe, and overripe categories. VGG16 achieved higher accuracy and stability during training, leading to its deployment in a fully automated harvesting and sorting system. The integration of classification results into downstream processing pipelines reflects the operational readiness of these architectures. In another application, the authors of [16] classified hazelnuts into five quality grades using similar CNN backbones. The model achieved superior performance over traditional rule-based approaches, demonstrating deep learning’s adaptability to post-harvest quality control tasks.

Beyond maturity and quality classification, deep learning has also been employed for plant disease detection and severity grading [17]. The researchers have utilized a fine-tuned VGG16 model for apple disease classification, achieving precise identification of various infection types, even under heterogeneous background conditions. In [18], Arshaghi et al. applied a VGG19-based architecture for potato disease severity classification, highlighting the importance of deeper networks in capturing fine-grained pathological patterns. Complementarily, the researchers in [19] employed a ResNet50 model to identify cucumber leaf diseases, with notable improvements in detection sensitivity and specificity. Their study emphasized the model’s utility in field-based deployments, where disease symptoms may vary dynamically due to environmental stressors.

2.2. Fruit Detection in Precision Agriculture

Fruit detection is a cornerstone task in automated agriculture, facilitating yield estimation, robotic harvesting, and supply chain optimization. Object detection models based on the YOLO (You Only Look Once) architecture have become particularly prominent due to their real-time detection capabilities and robustness under variable environmental conditions. YOLO models detect objects by simultaneously regressing bounding box coordinates and object class probabilities, making them suitable for high-throughput agricultural tasks.

Zhou et al. [20] applied YOLOv7 for dragon fruit detection in orchard environments characterized by dense foliage, variable lighting, and occlusion. The model delivered robust performance across multiple collection angles and demonstrated sub-second inference times, confirming its viability for real-time deployment on embedded systems. An optimized YOLOv3 by integrating it with MobileNet has also been researched in [21], resulting in a lightweight and efficient detection framework tailored for low-power devices such as UAVs and agricultural robots.

The authors in [22,23] compared YOLOv3, YOLOv4, and YOLOv5 for cherry detection, highlighting YOLOv5’s superior balance between accuracy and speed. Their studies also integrated condition classification capabilities, enabling robots not only to detect but also to assess fruit ripeness. On the other hand, Reddy and Aishwarya [24] conducted a comparative evaluation of YOLOv4 and YOLOv5 for fruit freshness classification, reporting that YOLOv5 consistently outperformed its predecessor in both precision and recall.

Using the same baseline models, the researchers in [25,26] implemented YOLOv4 across multiple fruit types including bananas, grapes, apples, mangoes, and pears, demonstrating the model’s flexibility and high accuracy in detecting diverse agricultural targets. Melnychenko et al. [27] focused on occlusion-aware apple detection using YOLOv5, while papers [28,29] leveraged the same model for red chili ripeness analysis. Uniquely, Yang and Wang [30] benchmarked four state-of-the-art models for green litchi detection and found YOLOv5-S to offer the best trade-off between detection accuracy and inference latency.

Further usage of deep learning methodology in palm agriculture was proposed in [31] that utilized color-based feature extraction in tandem with deep learning model to determine palm fruit maturity stages. An extension of this work proposed by Jie et al. [32] deployed the Xception model to delineate palm plantation boundaries, showcasing the model’s high representational power in segmenting large-scale agricultural landscapes. The aggregation of these studies confirms the YOLO framework’s dominance in agricultural object detection, driven by its modularity, high detection throughput, and adaptability to evolving field conditions.

2.3. Semantic Segmentation for Crop Monitoring and Yield Estimation

Semantic segmentation plays a pivotal role in precision agriculture by enabling fine-grained classification of vegetation, soil, and infrastructure at the pixel level. Segmentation maps produced by deep learning models are critical for applications such as plant phenotyping, weed detection, and biomass estimation. Encoder–decoder architectures, particularly those enhanced with attention mechanisms or multi-scale feature extractors, have been widely adopted to address the unique challenges posed by agricultural imagery.

The work in [33] proposed a DeepLabv3+ model with a ResNet50 encoder for segmenting fruits across variable scales and orientations. Their model achieved high mean Intersection over Union scores and demonstrated consistent boundary delineation even in visually complex scenes. A simpler model was considered in [34] by proposing an improved Fully Convolutional Network tailored for strawberry segmentation in cluttered environments. By integrating enhanced encoder–decoder pathways, their model effectively captured both fine-grained details and broader spatial context.

A more complex multiscale technique was introduced in [35] by enhancing the PSPNet architecture through Convolutional Block Attention Module (CBAM) embedding for grape bunch segmentation. This attention-guided network achieved superior performance under non-uniform lighting and background variability. They extended the application of PSPNet+CBAM to apple segmentation [10], where their model demonstrated resilience to occlusions and high structural complexity. These studies underscore the importance of attention modules in refining spatial and channel-wise feature representations.

Deb et al. tackled the challenge of overlapping leaf segmentation using LS-Net, a lightweight architecture designed for low-power edge devices [36]. Their model excelled in segmenting rosette plant structures with minimal false positives. A hybridized approach in [37] combined PSPNet with a ResNet50 backbone to accurately segment kiwi regions, achieving improvements in generalization and model interpretability. Conversely, in [38], the authors implemented an integrated segmentation framework for paddy field analysis, showing that model fusion and contextual aggregation substantially improve segmentation consistency in large-area imagery. Collectively, these segmentation approaches reveal a growing reliance on integrated and hybrid networks that combine high-resolution backbones with contextual modules and attention mechanisms. These designs are not only effective in complex agricultural landscapes but also adaptable to varying resolutions and sensor modalities. As segmentation accuracy becomes increasingly vital for downstream tasks such as yield modeling and disease prediction, deep learning continues to provide a robust foundation for scalable, automated, and interpretable agricultural analysis systems.

2.4. Architectural Innovations of DABAMNet

In contrast to existing attention-augmented segmentation architectures such as CBAM-Net [39,40] and BAM-based U-Nets [41,42], DABAMNet introduces a series of architectural innovations specifically designed for orchard segmentation tasks in complex agricultural environments.

The first core innovation is the integration of Dilated Asymmetric Bottleneck Units (DABou), which enable effective multi-scale feature extraction while maintaining computational efficiency. By employing asymmetric convolutions alongside dilated branches, DABou units expand the receptive field without degrading spatial resolution, an essential capability when dealing with high structural variability in orchard imagery.

The second major contribution of DABAMNet is its selective placement of CBAM dual attention modules at an intermediate network depth, after DABou unit 5, where semantic abstraction and spatial granularity intersect. This strategic positioning enhances the model’s ability to simultaneously preserve object boundaries and integrate contextual semantics. Rather than deploying attention mechanisms uniformly or at early stages, as seen in prior models, DABAMNet’s placement is empirically optimized for improved spatial coherence in segmentation maps.

Third, DABAMNet introduces a modified Convolutional Block Attention Module (CBAM) that incorporates a dual pooling strategy, leveraging both global max and global min pooling operations within the channel and spatial attention branches. This enhancement allows the network to capture both prominent and subtle features, addressing common challenges in orchard segmentation such as occlusion, lighting variability, and fine-grained textural differences.

Unlike earlier approaches that often apply standard attention uniformly or in shallow layers, DABAMNet’s adaptive and fine-grained attention strategy contributes significantly to its performance, as validated through ablation experiments. These studies confirm that the proposed attention configuration enhances segmentation accuracy while maintaining robustness and computational tractability.

Beyond agricultural applications, the design of DABAMNet is informed by foundational advances in deep neural networks. The use of convolutional operations for hierarchical feature extraction draws on the seminal work of Liu et al. and Cruttwell et al., which established CNNs as the core architecture for visual recognition tasks [43,44]. Similarly, the incorporation of attention mechanisms is inspired by the work of Ayoub et al. and later generalized through the transformer framework by Wei et al., which emphasized the importance of selective focus in neural information processing [45,46]. The concept of multi-scale feature integration is also grounded in prior architectural designs such as the Feature Pyramid Network (FPN) [47] and Atrous Spatial Pyramid Pooling (ASPP) [48], both of which influenced the dilation strategies employed in the stacked DABou units within the DABAM blocks of DABAMNet.

Together, these innovations position DABAMNet as a purpose-built, attention-enhanced segmentation framework that balances high precision with practical efficiency and demonstrates potential for real-time use in automated orchard analysis.

3. Methodology

This section introduces DABAMNet, a novel deep learning architecture tailored for semantic segmentation in complex orchard environments. DABAMNet is composed of three core modules: the Initial Module for early-stage feature extraction, the Intermediate Module for multi-scale, attention-guided representation learning, and the Final Module for semantic projection and resolution recovery. Central to the architecture are two specialized submodules; the Dilated Asymmetric Bottleneck Unit (DABou) and the Convolutional Block Attention Module (CBAM), which are strategically embedded to enhance spatial fidelity and contextual sensitivity.

The design of DABAMNet is guided by three key objectives, each addressed by a specific module or mechanism within the architecture:

Spatial Preservation—Achieved through the Initial Module (M₁), which minimizes early downsampling and preserves high-resolution spatial details essential for delineating object boundaries.
Contextual Enrichment—Realized in the Intermediate Module (M₂) via the stacked DABou units within DABAM blocks, which progressively expand the receptive field using increasing dilation rates and asymmetric convolution paths.
Attention Calibration—Enabled by the integration of CBAM within the Intermediate Module (M₂), placed after semantically rich DABou unit 5 to refine feature maps by emphasizing task-relevant information and suppressing background noise.

Together, these components form a modular encoder–decoder framework optimized for segmentation performance under the spatial and semantic challenges of orchard imagery. A comprehensive layer-wise breakdown of the architecture including kernel configurations, filter dimensions, and downsampling operations is presented in Table 1, which summarizes the internal composition of DABAMNet across all modules. The following subsections provide detailed descriptions of each module and their contributions to the overall network functionality.

3.1. Convolutional Block Attention Module (CBAM)

The Convolutional Block Attention Module (CBAM) is a lightweight, plug-and-play attention mechanism designed to improve the representational capacity of convolutional neural networks by sequentially applying channel and spatial attention. This dual pooling mechanism strategy refinement enables the network to focus on what and where to emphasize in the feature map, thereby enhancing its ability to discriminate fine-grained patterns in complex visual scenes such as apple orchards. Specifically, CBAM adaptively recalibrates feature responses by first modeling inter-channel relationships to emphasize semantically relevant channels, followed by spatial attention to localize salient regions.

3.1.1. Channel Attention Submodules

The channel attention submodule aims to identify what features are important by capturing inter-channel dependencies. Let

F \in R^{C \times H \times W}

denote the input feature map, where

C

,

H

, and

W

are the number of channels, height, and width, respectively. To compute the channel attention map

M_{C} \in R^{C \times 1 \times 1}

, CBAM applies two spatial pooling operations: global max pooling and global min pooling, to generate two distinct channel descriptors, as defined in Equation (1):

F_{m a x}^{c h} = {M a x P o o l}_{s p a t i a l} (F), F_{m i n}^{c h} = {M i n P o o l}_{s p a t i a l} (F)

(1)

These descriptors are forwarded through a shared Multi-Layer Perceptron (MLP) with a bottleneck structure and reduction ratio r = 2, consisting of two fully connected layers with ReLU activation in between. The channel attention map is defined in Equation (2):

M_{c} (F) = σ (M L P (F_{m a x}^{c h}) + M L P (F_{m i n}^{c h}))

(2)

where

σ

denotes the sigmoid activation function. The refined feature map obtained via channel-wise multiplication defined in Equation (3):

F_{c} = M_{c} (F) ⨀ F

(3)

where

⨀

denotes element-wise multiplication broadcast across spatial dimensions.

3.1.2. Spatial Attention Submodules

After refining features across channels, the spatial attention submodule focuses on identifying where to emphasize by capturing spatial correlations. Given the channel-refined feature map

F_{c}

, CBAM again applies global max pooling and global min pooling but this time along the channel axis, resulting in two spatial descriptors defined in Equation (4):

F_{m a x}^{s p} = {M a x P o o l}_{c h a n n e l} (F_{c}), F_{m i n}^{s p} = {M i n P o o l}_{c h a n n e l} (F_{c})

(4)

These are concatenated along the channel axis and processed using a convolutional layer with a 7 × 7 kernel to produce the spatial attention map defined in Equation (5):

M_{s} (F_{c}) = σ (f^{7 \times 7} ([F_{m a x}^{s p}; F_{m i n}^{s p}]))

(5)

where

f^{7 \times 7}

denotes a convolution operation and

[\cdot; \cdot]

indicates channel-wise concatenation. The spatial attention is then applied via element-wise multiplication defined in Equation (6):

F_{C B A M} = M_{s} (F_{c}) ⨀ F_{c}

(6)

The combined sequential application of channel attention submodule and spatial channel attention submodule allows CBAM to adaptively recalibrate both the semantic importance (channels) and positional relevance (spatial locations) of features. This is particularly beneficial for apple segmentation where occlusion, lighting variation, and background clutter often challenge conventional feature extractors and degrade feature quality. The final output

F_{C B A M}

contains features that are attentively refined in both spatially and channel-wise dimensions. Moreover, CBAM’s dual pooling strategy enhances its sensitivity to diverse feature activations, making it particularly effective for capturing subtle visual cues in complex orchard environments, as illustrated in Figure 2, which shows the overall architecture of the CBAM.

In the proposed DABAMNet architecture, the CBAM is strategically integrated after DABou Unit 5 within the Intermediate Module

M_{2}

, specifically inside the second DABAM block. This placement ensures that the attention mechanism operates on feature representations that are semantically enriched and have undergone multiple levels of receptive field expansion. The sequential application of channel and spatial attention allows CBAM to adaptively recalibrate both the semantic importance (channels) and the spatial relevance (positions) of features. This is particularly advantageous in the context of apple segmentation, where occlusion, illumination variability, and background noise can obscure discriminative patterns. By refining the features after mid-to-deep level abstraction, CBAM enhances the model’s sensitivity to subtle visual cues while preserving contextual coherence, which ultimately contributes to more accurate segmentation outcomes.

3.2. Dilated Asymmetric Bottleneck Unit (DABou)

The Dilated Asymmetric Bottleneck Unit (DABou) serves as the fundamental building unit of the DABAM blocks within the Intermediate Module (

M_{2}

), forming the core of the DABAMNet architecture. Each DABou unit is designed to extract multi-scale spatial and semantic information efficiently by combining depthwise separable convolutions, dilated convolutions, and residual learning.

Inspired by Li et al., 2020, the DABou unit employs dilation and a parameter-efficient design to enlarge the receptive field without significantly increasing computational complexity [49]. The unit processes an input feature map

F \in R^{C \times H \times W}

, where

C

,

H

, and

W

are the number of channels, height, and width, respectively. This input undergoes a sequence of transformations designed to extract both local and contextual features through the following sequence of operations:

1.: Initial Convolution: The input feature map is normalized and activated using Batch Normalization and PReLU, followed by a standard 3 × 3 convolution with a fixed number of output filters (32 or 64, depending on the stage). This initializes the bottleneck structure and prepares the feature map for further filtering, producing an intermediate representation F₀, as defined in Equation (7):

$F_{0} = σ (B N (W_{i n i t}^{3 \times 3} * F))$

(7)

where $σ$ denotes the PReLU activation function.

2.

Dual-Branch Processing: The output F₀ is split into two branches to separately capture local and dilated contextual features:

Branch 1 (Local Context): Applies a depthwise 3 × 3 convolution with no dilation followed by a 1 × 1 pointwise convolution. This branch extracts fine spatial features via standard depthwise separable convolution followed by a pointwise projection, as defined in Equation (8):

$F_{l o c a l} = σ (B N (W_{p w}^{1 \times 1} * B N (W_{d w}^{3 \times 3} * F_{0})))$

(8)
Branch 2 (Dilated Context): Applies a depthwise 3 × 3 convolution where r = {2, 4, 8, 16} denotes the dilation rate, selected based on the DABou unit’s depth in the Intermediate Module, followed by a 1 × 1 convolution. This branch expands the receptive field, allowing the network to learn global and contextual semantics effectively, as defined in Equation (9):

F_{d i l a t e d} = σ (B N (W_{p w}^{1 \times 1} * B N (W_{d i l}^{3 \times 3, r} * F_{0})))

(9)

where

$W_{d w}^{3 \times 3}$ is the depthwise kernel (no dilation);
$W_{d i l}^{3 \times 3, r}$ is the dilated depthwise kernel with rate $r$ ;
$W_{p w}^{1 \times 1}$ is the shared pointwise kernel (per branch).

3.: Feature Fusion: The two branches are combined through element-wise addition to form a composite feature representation. This fusion combines both high-resolution spatial details and long-range semantic dependencies into a unified representation, which is then followed by Batch Normalization and PReLU activation, as defined in Equation (10):

$F_{f u s e} = σ (B N (F_{l o c a l} + F_{d i l a t e d}))$

(10)

4.: Final Projection and Residual Connection: To refine the fused features, a final 1 × 1 pointwise convolution is applied, followed by batch normalization and PReLU activation. A residual connection is then added by summing the output with the initial projection F₀, enhancing gradient flow and stabilizing the learning process, as defined in Equation (11):

$F_{D A B o u} (F; r_{i}) = F_{0} + σ (B N (W_{p w}^{1 \times 1} * F_{f u s e}))$

(11)

where $r_{i}$ denotes the dilation rate used in the DABou unit.

3.3. The Proposed Network: DABAMNet

DABAMNet is a modular deep learning architecture specifically designed for robust semantic segmentation of apple orchard scenes. It aims to capture both fine-grained spatial details and high-level contextual semantics through a carefully structured encoder–decoder design. The architecture is composed of three primary modules: the Initial Module (M₁) for early feature extraction, the Intermediate Module (M₂) for hierarchical and attention-enhanced multiscale learning, and the Final Module (M₃) for dense semantic prediction. This design facilitates efficient feature reuse, adaptive receptive field expansion, and attention-guided refinement.

The overall motivation for DABAMNet stems from the limitations observed in conventional CNN-based segmentation models, which often struggle to balance spatial detail preservation with deep contextual understanding, particularly in cluttered and heterogeneous orchard environments. DABAMNet addresses these challenges through three key strategies: (1) early-stage downsampling coupled with spatially sensitive convolutional layers to retain boundary details, (2) stacked Depthwise Asymmetric Bottleneck (DABou) units with progressively increasing dilation rates to enhance multiscale contextual awareness without excessive parameter growth, and (3) the integration of a Convolutional Block Attention Module (CBAM) to dynamically recalibrate feature maps along both channel and spatial dimensions.

The encoder consists of the Initial and Intermediate Modules. The Initial Module (M₁) transforms the raw RGB input into compact yet informative features using a combination of downsampling and residual convolutional operations. The Intermediate Module (M₂) is composed of two DABAM blocks, each containing multiple DABou units with varying dilation rates and a CBAM positioned strategically to refine feature saliency. This hierarchical structure enables the network to learn progressively more abstract and semantically rich representations. Finally, the decoder is represented by the Final Module (M₃), which projects the encoded features into class scores using a 1 × 1 convolution and restores the spatial resolution via bilinear interpolation. This allows DABAMNet to output pixel-level segmentation maps aligned with the original image dimensions.

3.3.1. Initial Module ( $M_{1}$ )

The Initial Module

M_{1}

of DABAMNet serves as the entry point of the network, transforming the raw RGB input image into an informative feature representation suitable for deeper semantic processing. It performs early-stage downsampling while preserving low-level spatial patterns critical for boundary-aware segmentation.

Let the input image be denoted as

I \in R^{3 \times H \times W}

, where

H

and

W

represent the height and width of the input image, respectively. The processing in

M_{1}

proceeds as follows:

Initial Downsampling (Stride-2 Convolution): This operation $s = 2$ reduces the spatial resolution from $H \times W$ to $\frac{H}{2} \times \frac{W}{2}$ , producing 32 feature maps, as defined in Equation (12):

$F_{1} = σ (B N (W_{1}^{3 \times 3, s = 2} * I)), W_{1}^{3 \times 3} \in R^{3 \times 3 \times 3 \times 32}$

(12)

where $*$ denotes the convolution operator, $B N$ is batch normalization, and $σ$ is the PReLU activation function.
Feature Refinement (Three Stacked Convolutions): Each convolution maintains the spatial dimensions while increasing the network’s capacity to encode rich low-level texture features, as defined in Equation (13):

$F_{2} = σ (B N (W_{2}^{3 \times 3} * F_{1})) F_{3} = σ (B N (W_{3}^{3 \times 3} * F_{2})) F_{4} = σ (B N (W_{4}^{3 \times 3} * F_{3}))$

(13)
Second Downsampling (Another Stride-2 Convolution): This further reduces the spatial size by half, yielding a compact but informative representation, as defined in Equation (14):

F_{o u t} = σ (B N (W_{5}^{3 \times 3, s = 2} * F_{4})), F_{o u t} \in R^{\frac{H}{4} \times \frac{W}{4} \times 32}

(14)

Thus, the overall transformation performed by the Initial Module expressed in Equation (15):

M_{1} (I) = F_{o u t}

(15)

This encoded representation

M_{1}

serves as the input to the subsequent intermediate module for deeper context-aware feature learning.

3.3.2. Intermediate Module ( $M_{2}$ )

The Intermediate Module

M_{2}

of DABAMNet receives its input feature map from the Initial Module

M_{1}

, and is composed of two sequential blocks,

{D A B A M}_{1}

and

{D A B A M}_{2}

. Each block consists of multiple stacked DABou units that facilitate hierarchical feature enrichment. These blocks are specifically designed to progressively expand the receptive field through increasing dilation rates, while preserving spatial resolution and maintaining parameter efficiency.

${D A B A M}_{1}$ block comprises three DABou units, each with a dilation rate of $(2, 2)$ and 32 filters per unit. Its output is concatenated with the block’s input and an earlier shallow skip feature $S_{1}$ . A dual path downsampling follows, using strided convolution and max pooling to reduce spatial resolution while retaining detail.
${D A B A M}_{2}$ block includes six DABou units with 64 filters: DABou unit 4–5 use dilation $(4, 4)$ , DABou unit 6–7 use dilation $(8, 8)$ , and DABou unit 8–9 use dilation $(16, 16)$ . After the fifth unit, a CBAM is inserted to enhance attention over high-level features. The output is concatenated with the block’s input and a deeper skip connection $S_{2}$ , followed by normalization and activation.

To reflect the multiple uses of DABou units in each DABAM block, the operation is formally defined in Equation (16):

F_{D A B o u}^{(i)} = F_{D A B o u} (F; r_{i})

(16)

where

F is the input tensor;
r_i is the dilation rate of the i DABou unit.

The complete DABou operations within

{D A B A M}_{1}

and

{D A B A M}_{2}

blocks are defined in Equation (17):

{D A B A M}_{1} (F) = D o w n ([F_{D A B o u}^{(3)} ° F_{D A B o u}^{(2)} ° F_{D A B o u}^{(1)} (F), F, S_{1}]) {D A B A M}_{2} (F) = [F_{D A B o u}^{(9)} ° \dots ° F_{D A B o u}^{(6)} ° F_{C B A M} ° F_{D A B o u}^{(5)} ° F_{D A B o u}^{(4)} (F), F, S_{2}]

(17)

where

$°$ denotes sequential composition;
$\cdot$ denotes concatenation;
$D o w n$ represents the dual-path downsampling;
$F_{C B A M}$ as applied after $F_{D A B o u}^{(5)}$ ;
$S_{1}$ and $S_{2}$ are resized skip connections from shallow layers.

Finally, the output of

M_{2}

is formulated in Equation (18):

M_{2} = σ (B N ({D A B A M}_{1} ({D A B A M}_{2} (F_{0}))))

(18)

This layered architecture enables multi-scale feature learning while retaining strong spatial-semantic representations critical for segmentation in complex orchard imagery.

3.3.3. Final Module ( $M_{3}$ )

The Final Module

M_{3}

of DABAMNet serves as the classifier head that transforms the encoded features into a semantic segmentation map. It takes the output from the Intermediate Module

M_{2}

, which contains both spatial and semantic information, and performs class-wise projection and resolution restoration to produce dense predictions at the original image size.

Let the input to this module

F \in R^{C \times \frac{H}{4} \times \frac{W}{4}}

, where

C

denotes the number of feature channels and

\frac{H}{4} \times \frac{W}{4}

is the downsampled spatial resolution. The operations in

M_{3}

are formulated in Equation (19):

\begin{matrix} F_{l o g i t s} & = W^{1 \times 1} * F \in R^{N \times \frac{H}{4} \times \frac{W}{4}} \\ F_{u p s a m p l e d} & = R e s i z e (F_{l o g i t s}, (H, W)) \in R^{N \times H \times W} \\ M_{3} & = S o f t m a x (F_{u p s a m p l e d}) \end{matrix}

(19)

where

$W^{1 \times 1}$ denotes the weights of a $1 \times 10$ convolutional layer that projects the features from $C$ to $N$ output classes;
$R e s i z e$ performs bilinear interpolation to restore the original resolution;
$S o f t m a x$ generates the class probabilities for each pixel;
$P \in R^{N \times H \times W}$ represents the final semantic segmentation prediction;
$r_{i}$ is the dilation rate of the $i$ DABou unit.

By integrating the three core modules, Initial Module

M_{1}

, Intermediate Module

M_{2}

, and Final Module

M_{3}

, the complete forward operation of DABAMNet is expressed as a nested function composition in Equation (20):

F_{D A B A M N e t} = (M_{3} (M_{2} (M_{1} (I))))

(20)

where

$I \in R^{3 \times H \times W}$ is the input RGB image;
$M_{1}$ is the Initial Module responsible to extract low-level spatial and contextual features for early feature extraction;
$M_{2}$ is the Intermediate Module composed of stacked DABAM blocks with multiscale attention refinement enhancing semantic richness and receptive field;
$M_{3}$ is the Final Module projecting the learned features into class probabilities through convolution and upsampling.

This formulation emphasizes the modular and progressive design of DABAMNet, wherein hierarchical representations are successively refined from low-level textures to high-level semantic understanding, as illustrated in Figure 3, which depicts the overall network architecture of the proposed model.

4. Results and Discussion

This section presents a comprehensive analysis of the experimental results obtained using the proposed DABAMNet model. It begins by detailing the dataset used for training and evaluation, followed by an overview of the evaluation metrics adopted to quantify segmentation performance. The experimental setup is then described to ensure reproducibility and clarity of the training pipeline. An ablation study is conducted to investigate the contribution of key architectural components, including the placement of attention modules and the impact of different dilation strategies. Finally, the performance of DABAMNet is compared to several established segmentation models across key metrics, highlighting its effectiveness in segmenting apples under complex orchard environments. The findings in this section provide insights into the design choices and practical implications of DABAMNet for real-world agricultural applications.

4.1. Dataset

To evaluate the performance of the proposed DABAMNet architecture, this study utilizes an apple segmentation dataset introduced by [50], developed by the University of Minnesota Research Center. The original dataset was acquired using a standard consumer-grade device, a Samsung Galaxy S4 mobile phone camera, highlighting the practicality of applying computer vision models to low-cost, field-deployable hardware. Data collection was performed by recording videos while walking along apple tree rows at approximately 1 m per second. The camera was held horizontally to face the tree canopy laterally, minimizing motion blur and capturing varying fruit orientations.

From the recorded video streams, image frames were sampled for annotation. While the original study extracted samples every fifth frame, this work adopts a sparser sampling strategy by selecting every 30th frame, effectively reducing redundancy and emphasizing visual diversity. A total of ten video sequences were collected from six distinct tree rows, from which 670 images were randomly selected and manually annotated for use in this study. These images were partitioned using a five-fold cross-validation strategy, where each fold maintains a 4:1 ratio (536 training and 134 testing images) to facilitate supervised learning and model evaluation. The dataset captures a broad range of visual variability. Apples appear in multiple color variations, including green, red, orange, and hybrid tones, and are situated at varying distances from the camera. Illumination conditions vary significantly due to the recordings being conducted at various times of the day, resulting in a variety of lighting angles and shadow patterns. This diversity presents both a realistic and challenging setting for image segmentation tasks.

Each selected image was annotated by trained human labelers using high-resolution polygon masks to delineate individual apple instances with precision. The annotation process, which required approximately 30 min per image, followed a rigorous quality assurance workflow. All masks were subject to a secondary verification process to ensure accuracy and consistency across the dataset. The final annotations were saved in Portable Network Graphics (PNG) format with a resolution of 720 × 1280 pixels, providing a detailed ground truth for evaluating model performance.

Figure 4 illustrates sample images from the dataset, capturing both the front-facing and receding views of the tree rows. This realistic acquisition setting, combined with careful annotation, makes the dataset a strong benchmark for assessing the generalization capability of segmentation models in unconstrained agricultural environments.

4.2. Evaluation Metrics

To assess the performance of the proposed DABAMNet model and its baseline counterparts, two standard evaluation metrics are employed: Accuracy (ACR) and Intersection over Union (IoU). These metrics provide complementary insights into the predictive reliability and spatial localization quality of the segmentation models.

Accuracy (ACR) measures the overall correctness of classification by evaluating both true positive (

T_{+ v e}

) and true negative (

T_{- v e}

) predictions relative to all prediction outcomes. It reflects the proportion of correctly classified pixels (both foreground and background) over the total number of pixels, as expressed in Equation (21):

A C R = \frac{T_{+ v e} + T_{- v e}}{T_{+ v e} + T_{- v e} + F_{+ v e} + F_{- v e}}

(21)

Here,

F_{+ v e}

and

F_{- v e}

represent false positive and false negative predictions, respectively. A high accuracy score indicates that the model effectively distinguishes between the apple (foreground) and non-apple (background) regions.

Intersection over Union (IoU), also known as the Jaccard Index, is used to quantify the overlap between the predicted segmentation mask and the ground truth. It is calculated as the ratio of the intersection area to the union area of the predicted and actual masks, as shown in Equation (22):

I o U = \frac{T_{+ v e}}{T_{+ v e} + F_{+ v e} + F_{- v e}}

(22)

Unlike accuracy, IoU focuses solely on foreground prediction quality and does not consider true negatives. It is particularly well-suited for evaluating segmentation tasks where precise delineation of target objects (e.g., apples) is critical.

Both metrics rely on the standard confusion matrix components: true positives (

T_{+ v e}

), true negatives (

T_{- v e}

), false positives (

F_{+ v e}

), and false negatives (

F_{- v e}

). Misclassifications occur when the predicted label does not match the ground truth, for example, when a pixel predicted as positive is actually negative (false positive), or when a true positive is missed (false negative). Conversely, correct matches between predictions and ground truth labels yield true positives and true negatives.

Additionally, the number of model parameters, encompassing both trainable and non-trainable weights, is reported as an auxiliary metric. This provides insight into the computational cost and efficiency of each CNN model under evaluation. Parameter counts are particularly relevant in real-time or resource-constrained deployment scenarios, such as embedded agricultural robotics. Together, these metrics offer a comprehensive assessment of segmentation accuracy, spatial alignment, and model complexity.

4.3. Experimental Setup

To evaluate the effectiveness of the proposed DABAMNet model, four state-of-the-art CNN architectures—PSPNet, SegNet, FC-DenseNet, and DabNet—were selected as benchmarks, based on their relevance to recent agricultural computer vision studies. All five models, including DABAMNet, were implemented in Python 3.9.10 using the Keras-TensorFlow 2.10.1 library on the Kaggle platform. Their hyperparameters were optimized individually to achieve the best possible classification and segmentation performance. Each model was trained and tested using the complete set of 670 annotated apple images, consistently sampled across all five folds to ensure balanced and comprehensive evaluation. The primary criterion for optimizing the hyperparameters was the convergence of the loss function during training and testing. A convergence threshold of 0.1 was established for the loss value, which all models were required to meet. A grid search strategy was used to explore the hyperparameter space, with the optimal settings summarized in Table 2. Each model achieved successful convergence within the training and validation datasets using these parameters.

DABAMNet training was guided by the categorical cross-entropy loss function with accuracy as the evaluation metric. The Nadam optimizer, configured with a fixed learning rate, was applied consistently across all models. No data augmentation techniques were applied, apart from uniform image resizing to match the input dimensions required by each model. In the case of DABAMNet, the original images (720 × 1280 pixels) were resized to 256 × 512 pixels. Batch sizes were determined based on model size and hardware constraints. All experiments were conducted on a single NVIDIA RTX 2080 Ti GPU. For local testing, DABAMNet was also validated on an Intel i5-8250U processor with a 1.80 GHz clock rate, where a batch size of 9 images was feasible. The total number of parameters in DABAMNet was 20,603,811, of which 20,592,871 were trainable and 10,940 were non-trainable.

To mitigate sampling bias and reduce overfitting, a five-fold cross-validation strategy was employed. In each fold, the dataset was divided into training and testing subsets using a consistent 4:1 ratio. This yielded 536 training images and 134 testing images per fold. This validation approach ensures balanced evaluation and reliable performance metrics across diverse image samples.

4.4. Ablation Study

To identify the optimal configuration of the proposed DABAMNet model, an ablation study was conducted focusing on three key design factors: (1) the placement of the Convolutional Block Attention Module (CBAM) within the DABAMNet architecture, (2) the channel reduction ratio

(r)

used in CBAM’s channel attention submodule, and (3) the global pooling strategy employed within CBAM. The objective was to evaluate how each of these factors affect segmentation accuracy and computational efficiency when localizing apples in complex orchard scenes. CBAM was selected for its ability to enhance CNN-based feature extraction by sequentially applying the channel attention submodule and the spatial attention submodule, allowing the network to prioritize both relevant feature maps and spatial regions. Within the channel attention submodule, the reduction ratio

(r)

is applied in the internal Multi-Layer Perceptron (MLP) to compress and then restore the channel dimensions. This operation regulates model complexity, promotes generalization, and controls redundancy. It provides a balance between expressive power and computational cost. Such capabilities are especially advantageous for addressing challenges like occlusion, variable lighting, and background clutter in fruit segmentation tasks.

To further explore the optimal placement of the Convolutional Block Attention Module (CBAM) within the DABAMNet architecture, CBAM was inserted into three distinct locations. Each of these locations represents a distinct level of feature abstraction:

After DABou unit 5: Captures low-level to mid-level patterns beyond basic edges and textures, supporting small object detection.
After DABou unit 7: Extracts more complex structural information and helps interpret spatial context across the image.
After DABou unit 9: Encodes highly abstract representations essential for differentiating object classes and conditions.

This experimental design enabled the identification of the most suitable feature abstraction level for integrating attention mechanisms.

The reduction ratio

(r)

in the channel attention submodule was varied to evaluate how dimensionality compression influences the network’s ability to prioritize prominent features. Ratios of 2, 4, and 8 were evaluated:

r = 2: Retains detailed channel information but increases computational cost.
r = 4: Balances performance and efficiency by moderately reducing channel dimensions.
r = 8: Enhances computational efficiency but may result in the loss of finer feature details.

The objective was to identify a reduction setting that maintains high segmentation performance while avoiding unnecessary computational overhead.

Finally, various global pooling strategies were evaluated within the attention module to determine their effect on model performance. Global Max Pooling, Global Average Pooling, and Global Min Pooling were applied individually and in combinations to assess their influence on attention precision and segmentation accuracy:

Global Max Pooling: Highlights the most dominant activations, directing attention toward the most salient and discriminative features.
Global Average Pooling: Captures overall contextual patterns across the feature map, promoting stability and generalization.
Global Min Pooling: Identifies minimally activated regions, which can suppress background noise and filter out irrelevant or redundant information.

This exploration aimed to identify the pooling configuration that most effectively supports attention learning in complex orchard environments.

This ablation study, as shown in Table 3, demonstrated that the highest Intersection over Union (IoU) score of 0.7291 was achieved when the Convolutional Block Attention Module (CBAM) was integrated after DABou unit 5, using a channel reduction ratio of 2 in CBAM’s channel attention submodule and a dual pooling strategy combining global max pooling and global min pooling in both the channel attention submodule and the spatial attention submodule of CBAM. This configuration underscores the importance of applying attention at an intermediate stage of the network, where low-level textures and structures remain informative and higher-level semantic features begin to form. At this depth, the network can effectively benefit from attention mechanisms that refine feature representations by selectively amplifying relevant cues and suppressing noise. The low reduction ratio preserves detailed channel relationships within the attention submodule, while the combined pooling operations ensure that both dominant and subtle activations are captured across spatial and channel dimensions. Collectively, these design choices contribute to improved segmentation performance in complex orchard environments.

Placing CBAM after DABou unit 5 yielded superior performance compared to placements after DABou unit 7 and after DABou unit 9 due to its strategic position at an intermediate depth, where both low-level textures and emerging high-level semantics coexist. At this stage, the network retains rich spatial details essential for segmenting small, irregularly shaped objects like apples, while beginning to integrate broader contextual information. In contrast, applying CBAM after deeper layers (after DABou unit 7 and after DABou unit 9) limits its effectiveness, as feature maps at these stages are more abstract and spatially compressed. This abstraction, while useful for classification, often results in a loss of fine-grained spatial cues critical for precise localization. Consequently, CBAM has less spatial detail to refine, reducing its impact. Integrating attention earlier allows the model to enhance feature representation when both spatial precision and semantic cues are still accessible, resulting in more accurate segmentation.

Using a channel reduction ratio of 2 in CBAM’s channel attention submodule proved more effective than higher ratios of 4 and 8 due to its ability to preserve richer feature information during the dimensionality reduction process. A lower reduction ratio retains a larger proportion of the original channel descriptors when passing through the Multi-Layer Perceptron (MLP), thereby maintaining more nuanced inter-channel dependencies. This leads to finer attention weighting and more precise emphasis on informative feature channels. In contrast, higher reduction ratios compress the feature space more aggressively, which, although computationally efficient, risks discarding subtle but important discriminative signals. Such losses can be detrimental in segmentation tasks involving small or visually complex objects like apples, where detailed preservation is essential. The ratio of 2 offers a balanced trade-off, ensuring sufficient feature expressiveness while maintaining manageable computational complexity.

The dual pooling strategy combining global max pooling and global min pooling in both the channel and spatial attention submodules of CBAM demonstrated superior performance compared to other combinations involving global average pooling. This configuration effectively captures both the most salient and the least activated features within the feature maps. Global max pooling emphasizes the most prominent activations, which are crucial for identifying strong object signals, while global min pooling highlights underrepresented or suppressed features that may correspond to subtle object boundaries or background noise. Together, they provide a complementary view that enhances feature discrimination and robustness. In contrast, strategies that include global average pooling tend to smooth out activation differences, potentially diluting the contrast between critical and non-informative regions. Even when average pooling is combined with max or min pooling, its averaging nature can obscure sharp feature responses or fail to emphasize weak but relevant signals. Therefore, the combination of max and min pooling provides a more dynamic and discriminative attention mechanism, leading to improved segmentation precision, as confirmed in Table 3.

In summary, the ablation study systematically evaluated key design choices within the CBAM-integrated DABAMNet architecture, focusing on attention module placement, channel reduction ratio, and pooling strategies. The results, as shown in Table 3, demonstrate that inserting CBAM after DABou unit 5, using a channel reduction ratio of 2, and employing a dual pooling strategy combining global max pooling and global min pooling in both the channel attention submodule and the spatial attention submodule yields the highest segmentation accuracy. This configuration achieves an effective balance between feature expressiveness and computational efficiency, allowing the network to enhance relevant spatial and semantic cues while suppressing background noise. These findings not only validate the proposed architectural design but also offer theoretical insights into the optimal integration of attention mechanisms for robust and precise apple segmentation under real-world conditions.

4.5. Discussion on DABAMNet and Its Benchmarked Models’ Performance

The performance of the proposed DABAMNet model was evaluated against four widely recognized semantic segmentation architectures: PSPNet, SegNet, FC-DenseNet, and DabNet. To further contextualize DABAMNet’s effectiveness, three additional attention-based models from recent literature were also included in the comparison:

DANet + U-Net, based on Fu et al. (2019) [51], implemented with a U-Net backbone replacing the original ResNet encoder, while retaining the dual attention block that combines positional and channel attention at the bottleneck [52].
U-Net + CBAM, adapted from Li et al. (2025) [53], incorporating the CBAM—without multiscale processing—before each max pooling operation in the encoder to enhance spatial and channel-level feature extraction.
DeepLabV3+ + Spatial Attention, inspired by Liu and He (2021) [54], using a lighter Xception41 encoder in place of Xception65 and integrating spatial attention within the ASPP module as described in the original study.

These models were selected based on their diverse architectural paradigms and proven effectiveness in agricultural image analysis. To ensure a fair comparison, all models were trained and tested on the same apple orchard dataset using standardized preprocessing procedures, identical hyperparameter settings, and consistent evaluation metrics. The comparative analysis focused on two primary indicators of segmentation quality: Overall Accuracy and Intersection over Union (IoU). A summary of the performance outcomes, along with network complexities, is presented in Table 4.

Among the baseline models without explicit attention mechanisms, DABAMNet achieved the highest performance, with an Accuracy of 0.9813 and an Intersection over Union of 0.7291. This reflects the model’s strength in both accurate class prediction and fine-grained boundary delineation. SegNet recorded a comparable Accuracy of 0.9809, but its lower Intersection over Union of 0.6896 reveals limitations in boundary localization. This shortcoming may be attributed to its reliance on pooled indices during upsampling, which compromises spatial detail recovery. PSPNet achieved an Accuracy of 0.9680 and an Intersection over Union of 0.6903, leveraging pyramid pooling for contextual understanding. However, it exhibits limited precision in reconstructing fine spatial structures. FC-DenseNet, despite being the most lightweight model with 14.7 million parameters, recorded the weakest performance, with an Accuracy of 0.9740 and an Intersection over Union of 0.5674. This decline is likely due to over-compression of spatial information in its densely connected architecture. DabNet, designed for computational efficiency, performed better with an Accuracy of 0.9806 and an Intersection over Union of 0.7156, but still trailed DABAMNet, particularly in its capacity to preserve detail in complex regions.

Within the group of attention-enhanced models, DABAMNet remained the top performer. The DANet plus U-Net configuration, which introduces dual attention modules at the bottleneck of a U-Net backbone, attained an Accuracy of 0.9785 and an Intersection over Union of 0.5897. While it enhances contextual encoding, the relatively shallow backbone limits its expressive power. The U-Net plus CBAM variant, which inserts the CBAM before each max pooling operation, achieved an Accuracy of 0.9756 and an Intersection over Union of 0.4956. This suggests that applying attention at early encoding stages, without multiscale integration, yields modest improvements. DeepLabV3 plus Spatial Attention, adapted with a reduced Xception41 encoder, resulted in the lowest Accuracy of 0.9752 and Intersection over Union at 0.4876, despite its high parameter count of 128 million. The absence of channel attention and substantial spatial downsampling may explain this performance drop. In contrast, DABAMNet, with only 20.6 million parameters, achieved the highest performance metrics among all evaluated models. Its multi-scale design and strategic use of attention in deeper layers contributed to this outcome, demonstrating a balance between efficiency and effectiveness in segmenting complex orchard scenes.

Figure 5 illustrates sample segmentation masks generated by DABAMNet, showcasing its ability to delineate apples under varied environmental conditions. The model effectively distinguishes apples from the background, even in challenging scenarios involving occlusion, overlapping fruit, or variable lighting conditions. These findings underscore the effectiveness of DABAMNet’s architectural innovations, particularly the integration of attention modules and depthwise asymmetric bottlenecks for enriched multi-scale feature encoding. Overall, the results validate DABAMNet as a promising architecture for high-precision segmentation tasks in complex orchard environments.

DABAMNet’s performance advantage can be attributed to its integration of dual attention mechanisms within the Convolutional Block Attention Module (CBAM): the channel attention submodule and the spatial attention submodule. These components collaboratively guide the network’s focus toward the most informative visual features relevant to enhance the segmentation task, which is particularly beneficial for binary classification tasks involving object and non-object categories, such as apple versus background. The channel attention submodule selectively emphasizes meaningful feature maps, while the spatial attention submodule highlights critical spatial regions within those maps. Together, they enable the model to more accurately localize apple regions and suppress irrelevant background patterns, thereby improving segmentation precision.

In addition, DABAMNet adopts a dual pooling strategy within CBAM, combining global max pooling and global min pooling in both the channel attention submodule and the spatial attention submodule. Global max pooling extracts dominant discriminative features by selecting the most activated values, whereas global min pooling captures more subtle, often overlooked information. This hybrid approach increases the model’s sensitivity to diverse contextual cues, enhancing generalization in complex orchard scenes where apples may be partially occluded, affected by variable lighting, or visually similar to the background. The global pooling combination strengthens attention effectiveness and improves segmentation performance under challenging real-world conditions.

The baseline models are competent but lack the advanced attention and pooling mechanisms employed in DABAMNet. As a result, their performance, particularly in terms of Intersection over Union, remains lower. This highlights the value of attention-based enhancements in segmentation tasks that require fine spatial resolution. Figure 6 presents a comparative analysis of training accuracy and loss across eight segmentation models. Among these, DABAMNet achieved the highest accuracy of 0.9813 with a relatively low loss value of 0.0648, reflecting both high predictive correctness and efficient learning. Classical architectures such as SegNet and DabNet followed closely with accuracies of 0.9809 and 0.9806 and corresponding loss values of 0.0829 and 0.0582. DeepLabV3+ with spatial attention recorded the lowest training loss at 0.0328, suggesting strong convergence; however, its accuracy of 0.9752 remained lower than that of DABAMNet. FC-DenseNet and DANet integrated with U-Net demonstrated moderate results, with accuracies of 0.9740 and 0.9785 and losses of 0.1610 and 0.0763, respectively. PSPNet showed the highest loss value of 0.4130, despite achieving a reasonable accuracy of 0.9680, suggesting potential issues such as poor optimization or limited spatial representation. U-Net enhanced with CBAM yielded a competitive accuracy of 0.9756 but suffered from an abnormally high loss of 27.7922, likely resulting from unstable training dynamics or inadequate normalization mechanisms. To maintain consistency in analysis and uphold the validity of the metric range, excessively high loss values are capped at 1.0000.

It is also important to emphasize the role of both accuracy and Intersection over Union in evaluating segmentation models. Accuracy provides an overall measure of classification correctness, while Intersection over Union is critical for assessing segmentation quality, especially in applications requiring precise boundary delineation. DABAMNet’s Intersection over Union score of 0.7291 significantly outperforms that of DeepLabV3+ with spatial attention, which reached only 0.4876. This result confirms the superiority of DABAMNet in capturing fine spatial structures and contextual dependencies. Such capabilities are essential in agricultural applications like yield estimation and orchard monitoring, where segmentation errors can have direct operational consequences. These findings position DABAMNet as a reliable, high-performing solution for precision agriculture, capable of achieving accurate and fine-grained segmentation under real-world conditions.

5. Conclusions and Future Work

This study presented DABAMNet, an attention-enhanced convolutional neural network tailored for high-precision apple segmentation in visually complex orchard environments. While incorporating established modules such as CBAM and bottleneck designs, DABAMNet introduces three architectural innovations: (1) the integration of depthwise asymmetric bottleneck units for efficient yet expressive feature encoding; (2) the strategic placement of dual attention modules at intermediate layers to balance semantic abstraction with spatial precision; and (3) the novel use of max–min pooling within the attention mechanism to capture both dominant and subtle visual cues. These innovations collectively empower DABAMNet to outperform existing CNN-based segmentation models in both accuracy and Intersection-over-Union (IoU). Its superiority has been validated through ablation studies and benchmarking against four state-of-the-art networks, demonstrating strong potential for real-world deployment in precision agriculture and autonomous harvesting systems.

The empirical evaluation revealed that the optimal configuration for DABAMNet involves applying the CBAM attention mechanism at after DABou unit 5, using a channel reduction ratio of 2, and adopting a dual-pooling strategy combining global max and global min pooling operations. This setup consistently delivered superior segmentation performance across multiple evaluation metrics, confirming the effectiveness of targeted attention refinement in deep convolutional architectures.

However, broader considerations should be acknowledged. The current model has been trained and tested exclusively on apple orchard imagery, and its direct applicability to other fruit types or agricultural environments with different visual characteristics remains untested. As such, DABAMNet’s effectiveness may diminish when applied to crops that vary significantly in color, texture, canopy structure, or background complexity. This limitation highlights the need for further investigation into model generalization and cross-domain adaptability. Future work will include cross-fruit evaluations (e.g., citrus, mango, grapes) to assess the transferability of DABAMNet across varying fruit morphologies and canopy structures.

Moreover, the model’s performance is closely tied to the availability of high-quality annotated datasets, which are essential for supervised training but costly and time-consuming to produce, especially in agricultural settings with complex occlusion and illumination conditions. This reliance on detailed manual labeling presents a significant barrier to deploying similar solutions across diverse crop types or geographic regions. To address this, semi-supervised learning or weak supervision may be explored to reduce the dependence on fully annotated data, thereby enabling more scalable and cost-effective deployment of segmentation models in real-world agricultural scenarios.

Despite these strengths, the proposed framework has some limitations. The integration of attention modules increases architectural complexity, leading to longer training times and higher computational demands. These constraints may limit deployment on edge devices or in scenarios that require real-time inference. While DABAMNet demonstrates high segmentation accuracy through the integration of attention modules and deep bottleneck structures, these enhancements inevitably contribute to increased architectural complexity and computational cost. Potential strategies to mitigate these challenges include the adoption of lightweight attention mechanisms and pruning techniques. Such modifications aim to reduce inference time and resource consumption while preserving segmentation performance. These improvements are especially relevant for deployment in edge computing environments, such as agricultural robots and drone-based monitoring systems, where real-time processing under hardware constraints is essential. Addressing computational efficiency will be a key direction in advancing the model’s practical applicability across diverse agricultural settings. Furthermore, although DABAMNet performs well under typical orchard conditions, its accuracy may decline in cases of severe occlusion or heavy background clutter.

Future research may address these challenges through several avenues. One promising direction involves incorporating lightweight attention mechanisms such as Efficient Channel Attention [55] or Tiny Attention [56] to reduce computational complexity while maintaining segmentation performance. In parallel, model compression strategies including the Lottery Ticket Hypothesis [57] and structured pruning techniques [58] could be explored to reduce computational overhead without sacrificing accuracy, thereby improving the model’s suitability for edge computing and mobile platforms such as drones or agricultural robots. Second, expanding the training dataset to include a broader range of challenging conditions, such as extreme occlusion, varied illumination, and clutter, could further enhance the model’s generalization capability. Third, DABAMNet could be integrated into real-time automated fruit harvesting pipelines to support large-scale agricultural operations. Finally, incorporating multi-scale feature extraction techniques, as suggested in [59], may further improve its ability to detect and segment apples across a variety of spatial resolutions [60].

Finally, future research may also explore domain adaptation techniques to improve robustness across new crops and field conditions. These findings underscore the value of carefully integrating architectural components and optimizing attention placement to advance deep learning solutions for agricultural image analysis.

Author Contributions

Conceptualization, I.M.J., N.A.S.M. and M.A.Z.; methodology, I.M.J., N.A.S.M. and M.A.Z.; formal analysis, I.M.J., N.A.S.M. and M.A.Z.; writing—original draft preparation, I.M.J., N.A.S.M. and M.A.Z.; writing—review and editing, I.M.J., N.A.S.M. and M.A.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This project was funded by APNIC Foundation through ISIF-Asia Grant (KK-2022-018 & M-202205-01161) and Universiti Kebangsaan Malaysia through Dana Padanan Kolaborasi—Geran Pembiayaan Sepadan (DPK-GPS-JORDAN-2024-021).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available from https://rsn.umn.edu/projects/orchard-monitoring/minneapple (accessed on 9 October 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DABAMNet	Depthwise Asymmetric Bottleneck with Attention Mechanism Network
CNN	Convolutional neural network
CBAM	Convolutional Block Attention Module
DABou	Depthwise Asymmetric Bottleneck Unit
DABAM	Depthwise Asymmetric Bottleneck Attention Module
ACR	Accuracy
IoU	Intersection over Union

References

Statista Research Department Worldwide. Volume of Food Consumption 2015–2027. Available online: https://www.statista.com/forecasts/1298375/volume-food-consumption-worldwide (accessed on 7 July 2025).
Food and Agriculture Organization of the United Nations (FAO). The State of Food Security and Nutrition in the World 2021. Available online: https://www.fao.org/3/cb4474en/cb4474en.pdf (accessed on 7 July 2025).
Messe Berlin GmbH. Fruit Logistica 2022 Statistics Handbook; Messe Berlin GmbH: Berlin, Germany, 2022; Available online: https://messe-berlinprod-media.e-spirit.cloud/a1df0db7-5587-490f-b68d-2d8767a5500f/fruit-logistica/downloads-alle-sprachen/auf-einen-blick/european_statistics_handbook_2022.pdf (accessed on 21 August 2025).
Piedra-Bonilla, E.B.; Da Cunha, D.A.; Braga, M.J.; Oliveira, L.R. Extreme Weather Events and Crop Diversification: Climate Change Adaptation in Brazil. Mitig. Adapt. Strateg. Glob. Change 2025, 30, 28. [Google Scholar] [CrossRef]
Shah, W.U.H.; Lu, Y.; Liu, J.; Rehman, A.; Yasmeen, R. The Impact of Climate Change and Production Technology Heterogeneity on China’s Agricultural Total Factor Productivity and Production Efficiency. Sci. Total Environ. 2024, 907, 168027. [Google Scholar] [CrossRef] [PubMed]
Albahar, M. A Survey on Deep Learning and Its Impact on Agriculture: Challenges and Opportunities. Agriculture 2023, 13, 540. [Google Scholar] [CrossRef]
Akkem, Y.; Biswas, S.K.; Varanasi, A. Smart Farming Using Artificial Intelligence: A Review. Eng. Appl. Artif. Intell. 2023, 120, 105899. [Google Scholar] [CrossRef]
Bizikova, L.; Jungcurt, S.; McDougal, K.; Tyler, S. How Can Agricultural Interventions Enhance Contribution to Food Security and SDG 2.1? Glob. Food Secur. 2020, 26, 100450. [Google Scholar] [CrossRef]
Dai, D. An Introduction of CNN: Models and Training on Neural Network Models. In Proceedings of the Proceedings—2021 International Conference on Big Data, Artificial Intelligence and Risk Management, ICBAR, Shanghai, China, 5–7 November 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 135–138. [Google Scholar]
Zulkifley, M.A.; Moubark, A.M.; Saputro, A.H.; Abdani, S.R. Automated Apple Recognition System Using Semantic Segmentation Networks with Group and Shuffle Operators. Agriculture 2022, 12, 756. [Google Scholar] [CrossRef]
Elizar, E.; Zulkifley, M.A.; Muharar, R.; Zaman, M.H.M.; Mustaza, S.M. A Review on Multiscale-Deep-Learning Applications. Sensors 2022, 22, 7384. [Google Scholar] [CrossRef]
Zampokas, G.; Mariolis, I.; Giakoumis, D.; Tzovaras, D. Residual Cascade CNN for Detection of Spatially Relevant Objects in Agriculture: The Grape-Stem Paradigm. In Computer Vision Systems; Christensen, H.I., Corke, P., Detry, R., Weibel, J.-B., Vincze, M., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 159–168. [Google Scholar]
Mimma, N.E.A.; Ahmed, S.; Rahman, T.; Khan, R. Fruits Classification and Detection Application Using Deep Learning. Sci. Program. 2022, 2022, 4194874. [Google Scholar] [CrossRef]
Altaheri, H.; Alsulaiman, M.; Muhammad, G. Date Fruit Classification for Robotic Harvesting in a Natural Environment Using Deep Learning. IEEE Access 2019, 7, 117115–117133. [Google Scholar] [CrossRef]
Mahmood, A.; Singh, S.K.; Tiwari, A.K. Pre-Trained Deep Learning-Based Classification of Jujube Fruits According to Their Maturity Level. Neural Comput. Appl. 2022, 34, 13925–13935. [Google Scholar] [CrossRef]
Erbaş, N.; Çinarer, G.; Kiliç, K. Classification of Hazelnuts According to Their Quality Using Deep Learning Algorithms. Czech J. Food Sci. 2022, 40, 240–248. [Google Scholar] [CrossRef]
Agarwal, M.; Kaliyar, R.K.; Gupta, S.K. Differential Evolution Based Compression of CNN for Apple Fruit Disease Classification. In Proceedings of the 5th International Conference on Inventive Computation Technologies, ICICT 2022—Proceedings, Lalitpur, Nepal, 20–22 July 2022; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2022; pp. 76–82. [Google Scholar]
Arshaghi, A.; Ashourian, M.; Ghabeli, L. Potato Diseases Detection and Classification Using Deep Learning Methods. Multimed. Tools Appl. 2023, 82, 5725–5742. [Google Scholar] [CrossRef]
Singh, M.K.; Kumar, A. Cucumber Leaf Disease Detection and Classification Using a Deep Convolutional Neural Network. J. Inf. Technol. Manag. 2023, 15, 94–110. [Google Scholar] [CrossRef]
Zhou, J.; Zhang, Y.; Wang, J. A Dragon Fruit Picking Detection Method Based on YOLOv7 and PSP-Ellipse. Sensors 2023, 23, 3803. [Google Scholar] [CrossRef]
Li, X.; Qin, Y.; Wang, F.; Guo, F.; Yeow, J.T.W. Pitaya Detection in Orchards Using the MobileNet-YOLO Model. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 6274–6278. [Google Scholar]
Gai, R.; Li, M.; Chen, N. Cherry Detection Algorithm Based on Improved YOLOv5s Network. In Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Haikou, China, 20–22 December 2021; pp. 2097–2103. [Google Scholar]
Gai, R.; Chen, N.; Yuan, H. A Detection Algorithm for Cherry Fruits Based on the Improved YOLO-v4 Model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
Reddy, M.S.S.A.; Aishwarya, N. A Deep Learning Approach to Identify Fresh and Stale Fruits and Vegetables with YOLO. In Proceedings of the IEEE International Conference on Advances in Electronics, Communication, Computing and Intelligent Information Systems, ICAECIS 2023—Proceedings, Bangalore, India, 19–21 April 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 606–610. [Google Scholar]
Yi, C.; Wu, W.; Yang, L.; Jia, R. Research on Fruit Recognition Method Based on Improved YOLOv4 Algorithm. In Proceedings of the 2023 IEEE 2nd International Conference on Electrical Engineering, Big Data and Algorithms, EEBDA 2023, Changchun, China, 24–26 February 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 1892–1901. [Google Scholar]
Wu, Y.; Yang, Y.; Wang, X.; Cui, J.; Li, X. Fig Fruit Recognition Method Based on YOLO v4 Deep Learning. In Proceedings of the ECTI-CON 2021—2021 18th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology: Smart Electrical System and Technology, Proceedings, Chiang Mai, Thailand, 19–22 May 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 303–306. [Google Scholar]
Melnychenko, O.; Savenko, O.; Radiuk, P. Apple Detection with Occlusions Using Modified YOLOv5-V1. In Proceedings of the Proceedings of the IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS, Dortmund, Germany, 7–9 September 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 107–112. [Google Scholar]
Pichhika, H.C.; Subudhi, P. Detection of Multi-Varieties of On-Tree Mangoes Using MangoYOLO5. In Proceedings of the 2023 11th International Symposium on Electronic Systems Devices and Computing, ESDC 2023, Sri City, India, 4–6 May 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023. [Google Scholar]
Ram, P.P.V.S.; Yaswanth, K.V.S.; Kamepalli, S.; Sankar, B.S.; Madupalli, M. Deep Learning Model YOLOv5 for Red Chilies Detection from Chilly Crop Images. In Proceedings of the 2023 IEEE 8th International Conference for Convergence in Technology, I2CT, Lonavla, India, 7–9 April 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023. [Google Scholar]
Yang, M.; Wang, C. Automatic Detection of Green Litchi Based on YOLOV3, YOLOV4, YOLOV5-S and YOLOV5-X Deep Learning Algorithm. In Proceedings of the 2023 4th International Conference on Computer Engineering and Application, ICCEA, Hangzhou, China, 7–9 April 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 865–868. [Google Scholar]
Mamat, N.; Othman, M.F.; Abdulghafor, R.; Alwan, A.A.; Gulzar, Y. Enhancing Image Annotation Technique of Fruit Classification Using a Deep Learning Approach. Sustainability 2023, 15, 901. [Google Scholar] [CrossRef]
Jie, B.X.; Zulkifley, M.A.; Mohamed, N.A. Remote Sensing Approach to Oil Palm Plantations Detection Using Xception. In Proceedings of the 2020 11th IEEE Control and System Graduate Research Colloquium (ICSGRC), Shah Alam, Malaysia, 8 August 2020; pp. 38–42. [Google Scholar]
Fujinaga, T.; Nakanishi, T. Semantic Segmentation of Strawberry Plants Using DeepLabV3+ for Small Agricultural Robot. In Proceedings of the 2023 IEEE/SICE International Symposium on System Integration, SII, Atlanta, GA, USA, 17–20 January 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023. [Google Scholar]
Ilyas, T.; Kim, H. A Deep Learning Based Approach for Strawberry Yield Prediction via Semantic Graphics. In Proceedings of the International Conference on Control, Automation and Systems, Jeju, Republic of Korea, 12–15 October 2021; pp. 1835–1841. [Google Scholar]
Chen, S.; Song, Y.; Su, J.; Fang, Y.; Shen, L.; Mi, Z.; Su, B. Segmentation of Field Grape Bunches via an Improved Pyramid Scene Parsing Network. Int. J. Agric. Biol. Eng. 2021, 14, 185–194. [Google Scholar] [CrossRef]
Deb, M.; Garai, A.; Das, A.; Dhal, K.G. LS-Net: A Convolutional Neural Network for Leaf Segmentation of Rosette Plants. Neural Comput. Appl. 2022, 34, 18511–18524. [Google Scholar] [CrossRef]
Deng, J.; Niu, Z.; Zhang, X.; Zhang, J.; Pan, S.; Mu, H. Kiwifruit Vine Extraction Based on Low Altitude UAV Remote Sensing and Deep Semantic Segmentation. In Proceedings of the 2021 IEEE International Conference on Artificial Intelligence and Computer Applications, ICAICA, Dalian, China, 28–30 June 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 843–846. [Google Scholar]
Abdani, S.R.; Zulkifley, M.A.; Siham, M.N.; Zanal Abiddin, N.; Abdul Aziz, N.A. Paddy Fields Segmentation Using Fully Convolutional Network with Pyramid Pooling Module. In Proceedings of the 2020 IEEE 5th International Symposium on Telecommunication Technologies (ISTT), Shah Alam, Malaysia, 9–11 November 2020; pp. 30–34. [Google Scholar]
Xue, H.; Sun, Y.; Chen, J.; Tian, H.; Liu, Z.; Shen, M.; Liu, L. CAT-CBAM-Net: An Automatic Scoring Method for Sow Body Condition Based on CNN and Transformer. Sensors 2023, 23, 7919. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Q.; Gu, J.; Yu, H.; Xu, D. U-Net model based on cbam attention mechanism for coronary angiography segmentation. J. Mech. Med. Biol. 2024, 24, 2440062. [Google Scholar] [CrossRef]
Ren, Z.; Li, L.; Chen, B.; Ning, Z.; Jia, Z.; Weng, H. A Modified U-Net with Dilated Convolution and CBAM. In Proceedings of the 2023 2nd International Conference on Robotics, Artificial Intelligence and Intelligent Control, RAIIC, Mianyang, China, 11–13 August 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 359–364. [Google Scholar]
Zulaikha Beevi, S.; Harish Kumar, P.; Harish, S.; Sabari Sundar, A.R. Liver Tumor Segmentation Using CBAM-U-NET. In Proceedings of International Conference on Intelligent Vision and Computing (ICIVC 2023); Saha, A.K., Sharma, H., Prasad, M., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 1–11. [Google Scholar]
Liu, J.; Xing, T.; Wang, X. Optimizing AlexNet for Accurate Tree Species Classification via Multi-Branch Architecture and Mixed-Domain Attention. Complex Intell. Syst. 2025, 11, 244. [Google Scholar] [CrossRef]
Cruttwell, G.S.H.; Gavranović, B.; Ghani, N.; Wilson, P.; Zanasi, F. Categorical Foundations of Gradient-Based Learning. In Programming Languages and Systems; Sergey, I., Ed.; Springer International Publishing: Cham, Switzerland, 2022; pp. 1–28. [Google Scholar]
Ayoub, S.; Gulzar, Y.; Reegu, F.A.; Turaev, S. Generating Image Captions Using Bahdanau Attention Mechanism and Transfer Learning. Symmetry 2022, 14, 2681. [Google Scholar] [CrossRef]
Wei, X.; Wang, G.; Schmalz, B.; Hagan, D.F.T.; Duan, Z. Evaluate Transformer Model and Self-Attention Mechanism in the Yangtze River Basin Runoff Prediction. J. Hydrol. Reg. Stud. 2023, 47, 101438. [Google Scholar] [CrossRef]
Chen, Z.; Ji, H.; Zhang, Y.; Zhu, Z.; Li, Y. High-Resolution Feature Pyramid Network for Small Object Detection on Drone View. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 475–489. [Google Scholar] [CrossRef]
Sunandini, G.; Sivanpillai, R.; Sowmya, V.; Variyar, V.V.S. Significance of Atrous Spatial Pyramid Pooling (ASPP) in Deeplabv3+ for Water Body Segmentation. In Proceedings of the 2023 10th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 23–24 March 2023; pp. 744–749. [Google Scholar]
Li, G.; Jiang, S.; Yun, I.; Kim, J.; Kim, J. Depth-Wise Asymmetric Bottleneck with Point-Wise Aggregation Decoder for Real-Time Semantic Segmentation in Urban Scenes. IEEE Access 2020, 8, 27495–27506. [Google Scholar] [CrossRef]
Hani, N.; Roy, P.; Isler, V. MinneApple: A Benchmark Dataset for Apple Detection and Segmentation. IEEE Robot. Autom. Lett. 2020, 5, 852–858. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 3141–3149. [Google Scholar]
Rajamani, K.; Gowda, S.D.; Tej, V.N.; Rajamani, S.T. Deformable Attention (DANet) for Semantic Image Segmentation. In Proceedings of the Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, Glasgow, UK, 11–15 July 2022; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2022; pp. 3781–3784. [Google Scholar]
Li, W.; Wu, J.; Chen, H.; Wang, Y.; Jia, Y.; Gui, G. UNet Combined with Attention Mechanism Method for Extracting Flood Submerged Range. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6588–6597. [Google Scholar] [CrossRef]
Liu, R.; He, D. Semantic Segmentation Based on Deeplabv3+ and Attention Mechanism. In Proceedings of the IMCEC 2021—IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference, Chongqing, China, 18–20 June 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 255–259. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Patil, P.S.; Holambe, R.S.; Waghmare, L.M. An Attention Augmented Convolution-Based Tiny-Residual UNet for Road Extraction. IEEE Trans. Artif. Intell. 2024, 5, 3951–3964. [Google Scholar] [CrossRef]
Schlake, G.S.; Hüwel, J.D.; Berns, F.; Beecks, C. Evaluating the Lottery Ticket Hypothesis to Sparsify Neural Networks for Time Series Classification. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW), Kuala Lumpur, Malaysia, 9 May 2022; pp. 70–73. [Google Scholar]
Shao, T.; Shin, D. Structured Pruning for Deep Convolutional Neural Networks via Adaptive Sparsity Regularization. In Proceedings of the 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA, USA, 27 June–1 July 2022; pp. 982–987. [Google Scholar]
Wang, D.; Cao, W.; Zhang, F.; Li, Z.; Xu, S.; Wu, X. A Review of Deep Learning in Multiscale Agricultural Sensing. Remote Sens. 2022, 14, 559. [Google Scholar] [CrossRef]
Abdani, S.R.; Zulkifley, M.A.; Mamat, M. U-Net with Spatial Pyramid Pooling Module for Segmenting Oil Palm Plantations. In Proceedings of the 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota Kinabalu, Malaysia, 26–27 September 2020. [Google Scholar]

Figure 1. Global volume of food consumption from 2015 to projected 2026 [1].

Figure 2. The overall network architecture of CBAM.

Figure 3. The overall network architecture of DABAMNet.

Figure 4. Samples of apple images from the dataset.

Figure 5. Example segmentation results produced by DABAMNet.

Figure 6. Accuracy and loss comparison of baseline and DABAMNet during training.

Table 1. The overall layer design of the DABAMNET architecture.

No.	Module	Layer Name	Size of Output	Size of Filter	Size of Kernel	Remarks
1	$M_{1}$	Dual Downsampling	128 × 256	32	3 × 3
2	$M_{1}$	Conv 1	128 × 256	32	3 × 3
3	$M_{1}$	Conv 2	128 × 256	32	3 × 3
4	$M_{1}$	Conv 3	128 × 256	32	3 × 3
5	$M_{2}$	Dual Downsampling	64 × 128	32	3 × 3	DABAM Block 1
6	$M_{2}$	DABou unit 1	64 × 128	32	3 × 3	dilation rate = 2
7	$M_{2}$	DABou unit 2	64 × 128	32	3 × 3	dilation rate = 2
8	$M_{2}$	DABou unit 3	64 × 128	32	3 × 3	dilation rate = 2
9	$M_{2}$	Dual Downsampling	32 × 64	64	3 × 3	DABAM Block 2
10	$M_{2}$	DABou unit 4	32 × 64	64	3 × 3	dilation rate = 4
11	$M_{2}$	DABou unit 5	32 × 64	64	3 × 3	dilation rate = 4
12	$M_{2}$	CBAM	32 × 64	64	3 × 3	r = 2 (Multi-ratio MLP)
13	$M_{2}$	DABou unit 6	32 × 64	64	3 × 3	dilation rate = 8
14	$M_{2}$	DABou unit 7	32 × 64	64	3 × 3	dilation rate = 8
15	$M_{2}$	DABou unit 8	32 × 64	64	3 × 3	dilation rate = 16
16	$M_{2}$	DABou unit 9	32 × 64	64	3 × 3	dilation rate = 16
17	$M_{3}$	Conv (1 × 1 Projection)	32 × 64	64	1 × 1
18	$M_{3}$	Upsample (Resize)	256 × 512	-	-

Note: All Dual Downsampling operations apply a parallel combination of a strided convolution (stride = 2) and a max-pooling operation (stride = 2) across two branches, followed by feature concatenation.

Table 2. Hyperparameter configurations used for training the proposed DABAMNet model.

Hyperparameter	Value
Batch size	9
Training epochs	50
Backpropagation method	Nadam optimizer
Input image size	256 × 512 pixels
Optimizer learning rate	0.001
Optimizer momentum	0.9
Label encoding format	One-hot encoded

Table 3. IoU scores of the ablation study across different CBAM placements, channel reduction ratios

(r)

, and global pooling combinations.

Table 3. IoU scores of the ablation study across different CBAM placements, channel reduction ratios

(r)

, and global pooling combinations.

Reduction Ratio (r)	Global Pooling Combination	After DABou Unit 5	After DABou Unit 7	After DABou Unit 9
2	Average + Max	0.7100	0.5531	0.7223
	Average + Max + Min	0.7224	0.5521	0.6228
	Max + Min	0.7291	0.7134	0.6423
	Average + Max + Min	0.5892	0.5397	0.7275
4	Average + Max	0.6815	0.5954	0.6412
	Average + Max + Min	0.6826	0.6068	0.7194
	Max + Min	0.6721	0.7148	0.6425
	Average + Max + Min	0.6237	0.5567	0.5644
8	Average + Max	0.7252	0.6785	0.5947
	Average + Max + Min	0.6582	0.6319	0.6701
	Max + Min	0.7223	0.5379	0.7213
	Average + Max + Min	0.7180	0.5330	0.6820

Bold values indicate the highest IoU achieved.

Table 4. Performance comparison of DABAMNet and benchmark models based on Accuracy and IoU.

Model	Accuracy	IoU	Total Parameters	Trainable Parameters
PSPNet	0.9680	0.6903	27,896,000	27,838,400
SegNet	0.9809	0.6896	29,460,042	29,444,166
FC-DenseNet	0.9740	0.5674	14,729,860	14,594,658
DabNet	0.9806	0.7156	20,172,354	20,163,066
DABAMNet	0.9813	0.7291	20,603,811	20,592,871
DANet + U-Net	0.9785	0.5897	27,888,067	27,888,067
U-Net + CBAM	0.9756	0.4956	31,208,186	31,208,186
DeepLabV3+ + Spatial Attention	0.9752	0.4876	128,133,144	127,937,064

Bold values indicate the highest Accuracy and IoU achieved across all models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jelas, I.M.; Maluazi, N.A.S.; Zulkifley, M.A. An Attention-Enhanced Bottleneck Network for Apple Segmentation in Orchard Environments. Agriculture 2025, 15, 1802. https://doi.org/10.3390/agriculture15171802

AMA Style

Jelas IM, Maluazi NAS, Zulkifley MA. An Attention-Enhanced Bottleneck Network for Apple Segmentation in Orchard Environments. Agriculture. 2025; 15(17):1802. https://doi.org/10.3390/agriculture15171802

Chicago/Turabian Style

Jelas, Imran Md, Nur Alia Sofia Maluazi, and Mohd Asyraf Zulkifley. 2025. "An Attention-Enhanced Bottleneck Network for Apple Segmentation in Orchard Environments" Agriculture 15, no. 17: 1802. https://doi.org/10.3390/agriculture15171802

APA Style

Jelas, I. M., Maluazi, N. A. S., & Zulkifley, M. A. (2025). An Attention-Enhanced Bottleneck Network for Apple Segmentation in Orchard Environments. Agriculture, 15(17), 1802. https://doi.org/10.3390/agriculture15171802

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Attention-Enhanced Bottleneck Network for Apple Segmentation in Orchard Environments

Abstract

1. Introduction

2. Related Works

2.1. Classification of Agricultural Products Using Deep Learning Methodology

2.2. Fruit Detection in Precision Agriculture

2.3. Semantic Segmentation for Crop Monitoring and Yield Estimation

2.4. Architectural Innovations of DABAMNet

3. Methodology

3.1. Convolutional Block Attention Module (CBAM)

3.1.1. Channel Attention Submodules

3.1.2. Spatial Attention Submodules

3.2. Dilated Asymmetric Bottleneck Unit (DABou)

3.3. The Proposed Network: DABAMNet

3.3.1. Initial Module ( $M_{1}$ )

3.3.2. Intermediate Module ( $M_{2}$ )

3.3.3. Final Module ( $M_{3}$ )

4. Results and Discussion

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Ablation Study

4.5. Discussion on DABAMNet and Its Benchmarked Models’ Performance

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

An Attention-Enhanced Bottleneck Network for Apple Segmentation in Orchard Environments

Abstract

1. Introduction

2. Related Works

2.1. Classification of Agricultural Products Using Deep Learning Methodology

2.2. Fruit Detection in Precision Agriculture

2.3. Semantic Segmentation for Crop Monitoring and Yield Estimation

2.4. Architectural Innovations of DABAMNet

3. Methodology

3.1. Convolutional Block Attention Module (CBAM)

3.1.1. Channel Attention Submodules

3.1.2. Spatial Attention Submodules

3.2. Dilated Asymmetric Bottleneck Unit (DABou)

3.3. The Proposed Network: DABAMNet

3.3.1. Initial Module ( M 1 )

3.3.2. Intermediate Module ( M 2 )

3.3.3. Final Module ( M 3 )

4. Results and Discussion

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Ablation Study

4.5. Discussion on DABAMNet and Its Benchmarked Models’ Performance

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.1. Initial Module ( $M_{1}$ )

3.3.2. Intermediate Module ( $M_{2}$ )

3.3.3. Final Module ( $M_{3}$ )