Semantic Segmentation of Small Target Diseases on Tobacco Leaves

Zou, Yanze; Qiang, Zhenping; Zhang, Shuang; Lin, Hong

doi:10.3390/agronomy15081825

Open AccessArticle

Semantic Segmentation of Small Target Diseases on Tobacco Leaves

¹

College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming 650224, China

²

Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR 999078, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(8), 1825; https://doi.org/10.3390/agronomy15081825

Submission received: 9 June 2025 / Revised: 15 July 2025 / Accepted: 23 July 2025 / Published: 28 July 2025

(This article belongs to the Section Pest and Disease Management)

Download

Browse Figures

Versions Notes

Abstract

The application of image recognition technology plays a vital role in agricultural disease identification. Existing approaches primarily rely on image classification, object detection, or semantic segmentation. However, a major challenge in current semantic segmentation methods lies in accurately identifying small target objects. In this study, common tobacco leaf diseases—such as frog-eye disease, climate spots, and wildfire disease—are characterized by small lesion areas, with an average target size of only 32 pixels. This poses significant challenges for existing techniques to achieve precise segmentation. To address this issue, we propose integrating two attention mechanisms, namely cross-feature map attention and dual-branch attention, which are incorporated into the semantic segmentation network to enhance performance on small lesion segmentation. Moreover, considering the lack of publicly available datasets for tobacco leaf disease segmentation, we constructed a training dataset via image splicing. Extensive experiments were conducted on baseline segmentation models, including UNet, DeepLab, and HRNet. Experimental results demonstrate that the proposed method improves the mean Intersection over Union (mIoU) by 4.75% on the constructed dataset, with only a 15.07% increase in computational cost. These results validate the effectiveness of our novel attention-based strategy in the specific context of tobacco leaf disease segmentation.

Keywords:

tobacco leaf disease; semantic segmentation; attention; small target recognition; image splicing

1. Introduction

Tobacco holds significant economic value. Although its harmful effects on human health are widely recognized, it remains a major cash crop globally, especially in southwestern China, where it is a vital source of income for many farmers. Ensuring the healthy development of tobacco cultivation is thus essential for improving farmers’ livelihoods and promoting regional agricultural economies. However, tobacco plants are frequently affected by various diseases, which severely impact yield and quality. Traditional disease diagnosis methods rely heavily on human expertise, which is inefficient and often leads to overly generalized treatment strategies. For instance, farmers tend to spray pesticides over large areas, resulting in excessive pesticide use and serious ecological pollution. In contrast, computer vision techniques enable faster and more accurate identification of tobacco leaf diseases, providing scientific support for disease prevention and control decisions [1].

In recent years, image classification techniques have been widely applied to tobacco leaf disease recognition with notable progress. For example, Lin et al. [2] proposed a Meta-Baseline-based few-shot learning (FSL) method, which enhances feature representation through cascaded multi-scale fusion and channel attention mechanisms, alleviating data scarcity issues. The method achieved 61.24% and 77.43% accuracy on single-plant 1-shot and 5-shot tasks, and 82.52% and 92.83% on multi-plant tasks. Subsequently, Lin et al. [3] introduced instance embedding and task adaptation techniques, achieving 66.04% 5-way 1-shot accuracy on the PlantVillage dataset and 45.5% and 56.5% accuracy under two TLA dataset settings. Additionally, frequency-domain features have been integrated into FSL frameworks to improve generalization and feature expression under limited data conditions [4].

Deep learning models have also shown excellent performance in crop disease diagnosis. Mohanty et al. [5] trained CNNs on over 54,000 leaf images captured by smartphones, achieving 99.35% accuracy and demonstrating strong scalability and device independence. Ferentinos et al. [6] trained CNNs on 87,848 images covering 58 plant disease classes, reaching 99.53% accuracy. Too et al. evaluated various CNN architectures and found DenseNet to perform best with 99.75% accuracy due to its parameter efficiency and resistance to overfitting. Sun et al. [7] used ResNet-101 to grade aphid damage on tobacco leaves, significantly improving accuracy with data augmentation and a three-tier classification strategy. Sladojevic [8] proposed a deep CNN achieving 96.3% average accuracy in single-class plant disease classification. Bharali et al. [9] demonstrated that a lightweight model trained on only 1400 images can still achieve 96.6% accuracy using Keras and TensorFlow. In UAV-based recognition, Thimmegowda et al. [10] combined HOF, MBH, and optimized HOG features with PCA-based selection, achieving 95% accuracy and 92% sensitivity.

For object detection, Lin et al. [4] proposed an improved YOLOX-Tiny network with Hierarchical Mixed-scale Units (HMUs) in the neck module, enhancing cross-channel interaction and achieving 80.56% accuracy for tobacco brown spot under natural conditions. Zhang et al. [11] used deep residual networks and k-means-optimized anchor boxes to boost tomato disease detection accuracy by 2.71%. Sadi Uysal et al. [12] applied ResNet-18 and Class Activation Maps (CAMs) for lesion localization, performing well in small-object detection. Iwano et al. [13] built a Hierarchical Object Detection and Recognition Framework (HODRF), effectively reducing false positives and improving small-lesion recognition. Other works include a YOLOv5 variant with BiFPN and Shuffle Attention for pine wilt detection in UAV images [14], and a Mask R-CNN variant with COT for accurate segmentation of overlapping tobacco leaves [15].

While image classification determines disease presence, it cannot localize lesions. Object detection offers bounding boxes but fails to delineate lesion contours accurately, especially for multi-scale small objects due to downsampling errors. Despite the impressive performance of classification and detection models in plant disease diagnosis, their applicability to small lesion segmentation tasks remains limited. Classification models, by design, provide only global predictions without spatial localization, making them unsuitable for scenarios where precise lesion boundaries are required, such as disease monitoring or progression analysis. Object detection methods, while more spatially aware, often suffer from significant performance degradation when applied to small targets due to several factors. First, the multi-stage downsampling in convolutional backbones reduces the spatial resolution of feature maps, leading to the loss of fine-grained lesion details. Second, the use of fixed-size anchor boxes and IoU-based assignment strategies makes it difficult to generate high-quality proposals for tiny, irregularly shaped lesions.Third, when lesions are densely distributed or partially occluded, which is common in tobacco leaves, detection models may misidentify overlapping lesions as a single entity or completely miss them.

Furthermore, many detection frameworks are optimized for instance-level recognition rather than pixel-level delineation, limiting their capacity for accurately segmenting small, scattered disease regions. In contrast, semantic segmentation assigns labels at the pixel level, achieving precise lesion localization, which particularly suitable when multiple small diseased areas appear on a single leaf. These limitations motivate the adoption of semantic segmentation models with refined attention mechanisms that can preserve high-resolution information and focus selectively on subtle lesion features. Our proposed AFMA and DA modules aim to address precisely these challenges by enhancing feature representation at multiple scales and guiding the model’s attention to meaningful spatial and channel-wise cues.

Semantic segmentation provides fine-grained, spatially accurate information by assigning semantic labels to each pixel, making it ideal for identifying both disease type and lesion distribution. For instance, Chen et al. [16] proposed MD-UNet, which integrates multi-scale convolution and dense residual dilated convolution modules, achieving effective segmentation of tobacco and other plant diseases. Zhang et al. [17] improved Mask R-CNN with feature fusion and hybrid attention to handle occlusions and blurred edges, boosting mIoU by 11.10%. Ou et al. [18] designed a segmentation network combining CBAM and skip connections, achieving 64.99% mIoU on tobacco disease segmentation. Chen et al. [19] further incorporated attention into MD-UNet, achieving 84.93% IoU, though the model distinguishes only between healthy and unhealthy regions.

Semantic segmentation, a fundamental task in computer vision, aims to assign semantic categories to each pixel. Despite advancements in deep learning, small-object segmentation remains challenging, especially for tobacco leaf disease recognition in agriculture. Lesions are often irregularly shaped, have fuzzy edges, and occupy a small portion of the image, complicating detection and segmentation.

Recent architectures, such as FCN [20], SegFormer [21], Mask2Former [22], and PointRend [23], have advanced semantic segmentation. Meanwhile, baseline models like UNet [24], DeepLab [25,26,27], and HRNet [28] remain widely used due to their unique strengths. UNet, with its simple structure, detail retention, and strong small-object performance, is especially suited for plant disease segmentation.

To address the limitations of traditional architectures in small-object processing, this study explores a hybrid attention-enhanced UNet framework by integrating cross-feature attention (AFMA) and dual-branch attention (DA) modules. This design improves fine-grained feature extraction while maintaining low computational overhead, significantly enhancing small-object segmentation performance.

This study focuses on the identification of small target diseases. Small target diseases refer to lesion areas occupying less than 32 pixels of the image, with blurred contours and easily confused with the background. They often appear as spots or small irregular patches, typically representing the early stage of disease with important warning significance. If not identified promptly, these lesions can spread into large-scale infections, causing severe damage. Semantic segmentation can accurately delineate disease boundaries at the pixel level, providing finer spatial information compared with rough localization, thereby supporting subsequent disease assessment and precise control measures. Therefore, the precise segmentation of small target diseases holds both scientific significance and practical application value.

Considering the small-object nature of tobacco leaf diseases, we propose a novel method combining AFMA and DA attention mechanisms to improve segmentation accuracy for minor lesions.

The main contributions of this work are summarized as follows:

Image stitching training strategy: We introduce a data augmentation method based on image stitching by randomly combining tobacco leaf images with different diseases. This alleviates pixel-level class imbalance, simulates co-occurrence scenarios, and enhances dataset diversity and representativeness.

Dual attention mechanism design: A hierarchical feature enhancement framework is constructed by integrating two complementary attention modules. The DA module captures fine-grained local features with minimal parameters, while the AFMA module enhances cross-scale feature representation. Their synergy significantly improves minor lesion segmentation.

Lightweight attention integration: The proposed attention modules are lightweight and plug-and-play, making them easy to embed into mainstream segmentation networks such as UNet, DeepLab, and HRNet. They deliver performance gains with minimal computational cost.

The source code and data processing scripts have been publicly released at: https://github.com/km19991230/UNet-AFMA-DA (accessed on 22 July 2025) to support reproducibility and future research.

2. Materials and Methods

This section describes the data sources, preprocessing steps, baseline network, attention modules, and overall network architecture, enabling the reproducibility of our work.

2.1. Dataset and Preprocessing

The dataset used in this study was collected between 2021 and 2024 from various regions in Yunnan Province, including Dali, Qujing, and Honghe. It consists of images of tobacco leaves infected by various diseases under natural field conditions. To ensure diversity and representativeness, the samples cover different growth stages and environmental conditions.

The main disease types include frog-eye leaf spot, climate spot, and wildfire disease (see Figure 1). As depicted in Figure 2, our data collection from 2021 to 2024 collected a total of 133 images of frog-eye spot disease, 96 images of climate spot disease, and 128 images of wildfire disease.

Images were captured using a Canon EOS 700D digital camera (Canon Inc., Tokyo, Japan) under natural daylight. A reflective umbrella was used to minimize the effects of strong sunlight and shadows. The camera was consistently positioned at approximately a 90-degree angle to the leaf surface. Image collection mainly took place on sunny or slightly overcast days. Images that were out of focus, had strong reflections, or contained occlusions were excluded.

Before training, all images were resized to a standardized resolution suitable for network input (e.g., 512 × 512 pixels). Semantic annotations were assisted by the Segment Anything tool (see Figure 3) to reduce labeling subjectivity. In addition, image stitching was performed, as illustrated in Figure 4. A fixed random seed was set for all procedures to ensure reproducibility of the experiments.

Data splicing aims to create a composite sample of size 512 × 1536 by combining three randomly selected images of different diseases (such as frog-eye spot, climate spot, and wildfire), thereby increasing data diversity, reducing the risk of overfitting, and simulating the coexistence of multiple diseases. The specific steps include the following: resizing each image to 512 × 512 pixels, optionally applying random horizontal flipping to increase variation, and then vertically stacking these three images from top to bottom. The final spliced image has dimensions of 512 pixels (width) × 1536 pixels (height). Finally, the processed dataset was split into training, validation, and test sets in a ratio of 8:1:1. This means there are 8000 samples in the training set, 1000 samples in the validation set, and 1000 samples in the test set.

2.2. Selection of Baseline Networks

To evaluate and compare the performance of state-of-the-art semantic segmentation networks on the tobacco leaf disease dataset, we selected three widely used models: UNet, DeepLab, and HRNet. UNet employs a symmetric encoder-decoder architecture with skip connections that effectively fuse shallow spatial details with deep semantic features. It is particularly well-suited for segmenting small lesion areas, as it can better preserve boundary and texture information. In contrast, DeepLab utilizes atrous convolutions and conditional random fields (CRF) to capture multi-scale contextual information and refine boundaries. However, its relatively large downsampling rate may cause loss of details for small or blurry targets. HRNet maintains high-resolution feature representations and is suitable for large-scale scenes, but its high computational complexity makes it less effective for recognizing sparsely distributed small lesions.

Furthermore, Table 1, which compares the Throughput and Inference Latency (ms) of the three baseline networks, further demonstrates the advantages of the UNet baseline network.

After a comprehensive comparison, we selected UNet as the baseline model due to its stable performance, simple architecture, and superior results in small-object segmentation tasks.

2.3. Dual-Attention Module (DA)

The Dual-Attention (DA) module [29] integrates both spatial (positional) and channel attention mechanisms to enhance feature representation. By simultaneously emphasizing critical spatial locations and channel-wise features, the DA module (as shown in Figure 5) improves the network’s ability to localize and distinguish small and complex target regions. This module consists of the Position Attention Module (PAM, see Figure 6) and the Channel Attention Module (CAM, see Figure 7).

UNet: In this study, we integrate Dual Attention (DA) modules into a UNet-based image segmentation framework to enhance multi-scale feature representation capabilities. We adopt ResNet50 as the encoder backbone network and embed DANet modules at multiple feature levels (i.e., after the second to fifth ResNet modules). The feature maps output from these levels (denoted as feat2 to feat5) are first optimized by the corresponding DANet modules (DANet2 to DANet5) before being passed to the upsampling path. This design enables the model to capture long-range contextual dependencies and focus on information-rich regions in both spatial and channel dimensions.

DeepLab: We add Dual Attention (DA) modules to improve feature extraction at different levels. The ResNet50 backbone provides two feature maps: one with detailed spatial information and another with strong semantic information. We apply DA modules to both maps—DANet3 for the detailed map and DANet5 for the semantic map. These modules use spatial and channel attention to highlight useful parts and reduce noise.

HRNet: We add Dual Attention (DA) modules to the HRNet’s high-resolution multi-branch structure to improve semantic features at specific resolutions. We use HRNetV2, which keeps multiple feature streams at different resolutions. From the final stage, four feature maps are created. We enhance the 3rd and 4th branches (medium and low resolution) using DA modules (DA2 and DA3).

This module is introduced specifically to address the challenge of inadequate attention on small target areas, improving segmentation accuracy in such cases.

2.4. Cross Feature-Map Attention (AFMA)

The Cross Feature-Map Attention (AFMA) module [30] establishes cross-scale context modeling by aligning and modulating multi-scale feature attention maps. During the decoding phase, AFMA adaptively recalibrates features from different scales to better capture spatial dependencies and enhance semantic consistency.

Figure 8 illustrates the modulation process of the AFMA module within the decoder.

2.5. Small Object Segmentation Strategies

Small object segmentation poses unique challenges due to limited pixel representation and weak contrast against complex backgrounds. Definitions vary, with relative scale [31] describing small objects as having a median area ratio between 0.08% and 0.58% of the image, and absolute scale defining them as less than

32 \times 32

pixels (as per MS COCO standards).

To tackle this, multi-branch architectures have shown promising results. For example, Fact-Seg [32] introduces foreground activation and semantic refinement branches, GSCNN [33] utilizes a shape stream with gated convolution, and MLDA-Net [34] incorporates self-supervised depth learning and attention modules.

Inspired by these methods, we design a multi-branch attention module to enhance the segmentation accuracy of small lesions in tobacco leaf images.

2.6. Attention Mechanism Fusion in Semantic Segmentation

Attention mechanisms play a crucial role in emphasizing informative features and suppressing irrelevant ones. The channel attention mechanism, exemplified by SE-Net [35], enhances feature discrimination across channels. However, its application to high-resolution inputs can lead to increased computational cost and overfitting.

The position attention mechanism (PAM), as used in DANet [36], models long-range spatial dependencies and enhances global context aggregation. Despite its effectiveness, PAM incurs a quadratic complexity with respect to image size, which may limit its real-time applicability.

To balance detail retention and efficiency, this study integrates cross-feature map attention and dual-branch attention mechanisms, drawing inspiration from DANet [37] and HMANet [38]. This combination enhances both local sensitivity and global awareness, enabling better segmentation of small and irregular lesions with minimal additional overhead.

2.7. Network Architecture (UNet+AFMA+DA)

The overall network architecture integrates the baseline UNet with the AFMA and DA modules to leverage their complementary strengths. The DA module is embedded within the encoder to boost attention on spatial and channel dimensions, while the AFMA module refines multi-scale feature fusion in the decoder.

Figure 9 presents the full network diagram, highlighting the interaction and placement of the modules.

The combined model contains approximately X million parameters, balancing model complexity and performance gains.

2.8. Loss Function

The overall training loss consists of three terms: (1) Cross Entropy Loss (ce), which measures the difference between the model’s predictions and the true labels to ensure pixel-level classification accuracy; (2) AFMA loss, which minimizes the discrepancy between the AFMA features learned by the model and the ideal AFMA features, thereby improving the accuracy and consistency of multi-scale feature modeling; (3) Focal Loss, which addresses class imbalance by focusing training on hard, misclassified examples.

By incorporating AFMA loss, the network is encouraged to achieve better AFMA attention scores during training, aiding in the segmentation of small target objects. The total loss is given by Equation (1):

{Loss}_{total} = {Loss}_{ce} + {Loss}_{afma} + {Loss}_{focal}

(1)

The cross-entropy loss function introduces category weights W to address imbalance in the number of samples between categories. The cross-entropy loss is computed as follows (Equation (2)):

{Loss}_{CE} = - \frac{1}{N} \sum_{i = 1}^{N} W_{y_{i}} log (\frac{exp (P_{i, y_{i}})}{\sum_{j = 1}^{C} exp (P_{i, j})})

(2)

where N is the number of samples,

W_{y_{i}}

is the weight for the true class

y_{i}

to balance class imbalance, C is the number of classes,

P_{i, y_{i}}

is the predicted score for the true class of sample i, and

P_{i, j}

is the predicted score for class j of sample i.

AFMA loss is calculated using the mean square error (MSE) between the predicted attention map and the ground truth attention map. Its formula is given by Equation (3):

{Loss}_{afma} = \frac{1}{N_{c} \cdot L_{1} \cdot L_{2}} \sum_{k = 1}^{N_{c}} \sum_{l_{1} = 1}^{L_{1}} \sum_{l_{2} = 1}^{L_{2}} {[A_{ith}^{k} (l_{1}, l_{2}) - A_{gt}^{k} (l_{1}, l_{2})]}^{2}

(3)

where

N_{c}

is the number of categories,

L_{1}

and

L_{2}

are the height and width of the feature map,

A_{ith}^{k} (l_{1}, l_{2})

is the predicted attention value at location

(l_{1}, l_{2})

in the k-th channel of the i-th layer feature map, and

A_{gt}^{k} (l_{1}, l_{2})

is the corresponding ground truth attention value.

Focal Loss is designed to mitigate class imbalance by reducing the loss contribution from easy examples, thus focusing training on harder, misclassified examples. The formula is shown in Equation (4):

{Loss}_{focal} = - \sum_{i = 1}^{C} y_{i} \cdot α_{i} \cdot {(1 - p_{i})}^{γ} \cdot log (p_{i})

(4)

where

y_{i} \in {0, 1}

denotes the ground truth label for class i, and

p_{i} \in [0, 1]

is the predicted probability for class i. Define

p_{t} = p_{i}

when

y_{i} = 1

, and

p_{t} = 1 - p_{i}

when

y_{i} = 0

. The weighting factor

α_{t}

balances class importance:

α_{t} = α

if

y_{i} = 1

, and

α_{t} = 1 - α

if

y_{i} = 0

(typically

α = 0.25

). The focusing parameter

γ \geq 0

controls the rate at which easy examples are down-weighted, commonly set as

γ = 2

.

3. Results

3.1. Implementation Details

The implementation utilizes the hardware configuration shown in Table 2 and the training hyperparameters listed in Table 3.

3.2. Evaluation Metrics

This study uses Intersection over Union (IoU) and mean Intersection over Union (mIoU) as evaluation criteria. IoU measures the overlap between the predicted and ground truth areas, calculated as follows:

IoU = \frac{| A \cap B |}{| A \cup B |}

(5)

where A represents the predicted area and B represents the ground truth area. The numerator is the intersection of the predicted and true areas, while the denominator is their union. A larger IoU value indicates a closer match between the predicted and ground truth regions.

mIoU is the average IoU over all categories, providing a comprehensive evaluation of the model’s segmentation performance across different classes. This metric is especially effective for small targets, such as tobacco leaf spots, which have varying shapes and sizes. The formula for mIoU is as follows:

mIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{| A_{i} \cap B_{i} |}{| A_{i} \cup B_{i} |}

(6)

where N is the total number of categories,

A_{i}

and

B_{i}

represent the predicted and ground truth regions for the i-th category, respectively.

mPA (mean Pixel Accuracy) is the average pixel accuracy over all categories, providing a comprehensive evaluation of the model’s pixel-wise classification performance across different classes. This metric is particularly useful for assessing the overall correctness of pixel-level predictions, especially in scenarios with diverse object sizes and shapes. The formula for mPA is as follows:

mPA = \frac{1}{N} \sum_{i = 1}^{N} \frac{| A_{i} \cap B_{i} |}{| B_{i} |}

(7)

where N is the total number of categories,

A_{i}

and

B_{i}

represent the predicted and ground truth regions for the i-th category, respectively.

3.3. Comparison of Segmentation Performance on Tobacco Leaf Dataset

To comprehensively evaluate the effectiveness of the proposed method, we conducted comparative experiments against both traditional machine learning (HOG+SVM) and state-of-the-art deep learning models (PSPNet, PIDNet, DeepLab, HRNet, SegFormer, etc.). As shown in Table 4, the proposed UNet+AFMA+DA model achieved the highest overall performance, with an mIoU of 54.79% and the best IoU scores across all three lesion types—41.28% (Frogeye), 31.72% (Weatherfleck), and 46.58% (Wildfire). This superior performance is attributed to the effective combination of the UNet architecture and the AFMA and DA attention modules. The AFMA module facilitates multi-scale feature fusion, preserving fine-grained lesion structures, while the DA module strengthens contextual understanding through spatial and channel-wise attention.

3.4. Ablation Study

3.4.1. mIoU Ablation Analysis of Dual Attention Mechanisms on Baseline

As shown in Table 5, this mIoU-based ablation study systematically compares the segmentation performance of different semantic segmentation models and their combinations with attention mechanisms (AFMA, DA, and AFMA+DA) on three types of tobacco leaf diseases: Frogeye, Weatherfleck, and Wildfire.

When individual attention modules are introduced, AFMA and DA provide moderate performance improvements for the weaker DeepLab model. However, in the case of UNet, adding AFMA or DA alone results in a slight decline in overall performance, with mIoU decreasing to 48.41% and 47.85%, respectively. This suggests that a single attention mechanism may introduce redundant focus or suppress useful features, thereby diminishing the model’s original representation capacity.

In contrast, when AFMA and DA are jointly integrated into the UNet architecture, performance significantly improves to an mIoU of 54.79%, a 5.80 percentage point gain over the baseline. This combination achieves the best or second-best results across all three disease categories, demonstrating a clear synergy between the modules. AFMA enhances adaptive spatial and channel feature learning, while DA captures global dependencies and improves class discrimination. Together, their complementary design boosts the semantic consistency and accuracy of feature extraction.

Compared with the potential drawbacks of using a single module—such as focus bias or overfitting—the collaborative mechanism of AFMA and DA is more effectively leveraged within the UNet framework. As a result, the model demonstrates stronger capability in segmenting small, morphologically complex, and edge-blurred lesions. These results validate the effectiveness and robustness of the combined AFMA+DA design for small-object semantic segmentation tasks in plant disease scenarios.

3.4.2. mPA Ablation Analysis of Dual Attention Mechanisms on Baseline

This mPA-based ablation study evaluates the segmentation performance of different semantic segmentation models and their combinations with attention mechanisms (AFMA, DA, and AFMA+DA) on three types of tobacco leaf diseases: Frogeye, Weatherfleck, and Wildfire, with a focus on pixel-level accuracy. As shown in Table 6, among the three backbones, UNet performs best as the baseline model with an mPA of 57.46%, particularly excelling in Frogeye (79.11%). HRNet follows with 55.51%, while DeepLab lags behind with only 45.10%, performing poorly on Weatherfleck (2.41%), indicating its limitations in segmenting small target lesions.

Introducing individual attention mechanisms improves DeepLab’s performance, with AFMA raising mPA to 50.46% and DA to 48.47%, mainly benefiting the Wildfire class. For HRNet, AFMA slightly reduces performance, while DA and the AFMA+DA combination provide minor improvements, reaching 56.40%. These marginal gains suggest that HRNet’s strong inherent feature representation limits the added value of attention mechanisms.

In contrast, attention modules have a stronger impact on UNet. AFMA significantly increases mPA to 65.05%, with Wildfire accuracy rising from 35.42% to 63.06%. DA alone yields smaller gains (58.15%) and lowers Frogeye accuracy. Notably, combining AFMA and DA results in a substantial jump to 72.66% mPA—15.20 points above the baseline—and achieves the best performance across all categories. Weatherfleck accuracy improves dramatically from 15.16% to 48.68%, the largest observed gain.

These findings demonstrate a strong synergy between AFMA and DA. Individually, they may cause redundancy or feature suppression, but together they complement each other: AFMA enhances local spatial and channel modeling, while DA captures global semantic context. This collaboration improves feature completeness and discrimination, especially for small, ambiguous, or multi-scale lesions. The superior performance of UNet+AFMA+DA confirms the effectiveness and robustness of this dual attention design for small-object semantic segmentation.

3.5. Discussion on Two Attention Modules

3.5.1. Discussion on the Depth of AFMA Application

The AFMA module models the relationship between small and large targets by computing the cross-correlation between the original image block and the feature map block, thereby compensating for information loss during feature propagation. To assess the effectiveness of different feature layers in reconstructing this relationship, we conducted comparative experiments on AFMA across various feature layers, with the DA module applied four times. The results, presented in Table 7, show that

Applying the AFMA module at various depths enhances the performance of the baseline segmentation model, suggesting that leveraging the relationship between the original image and the feature map can steadily improve segmentation performance.
The IoU scores for all three lesion types (Frogeye, Weatherfleck, and Wildfire) improved across all tested depths. From a theoretical standpoint, this can be attributed to the interplay between semantic abstraction and spatial resolution across different network depths. In convolutional neural networks (CNNs), shallow layers (e.g., depth 1) capture low-level features such as edges and textures, which are crucial for delineating fine lesion boundaries but lack semantic discriminability. In contrast, deeper layers (e.g., depth 5) extract high-level semantic representations but often lose spatial granularity due to successive downsampling operations and large receptive fields, making them less suitable for small-target segmentation.

Therefore, the superior segmentation performance achieved at depth 2 is not merely an empirical observation, but also theoretically grounded in the hierarchical representation properties of CNNs and the spatially selective nature of attention mechanisms. As a result, depth 2 is adopted as the final configuration for the AFMA module to maximize segmentation fidelity, especially for small and subtle lesions.

3.5.2. Discussion on the Number of Layers Applied to the DA Module

To further assess the DA module’s performance based on AFMA, we integrated it into the encoder feature maps of UNet, DeepLab, and HRNet. For UNet, the DA module was inserted at various scales within the skip connections to evaluate its effect on segmentation.

As shown in Table 8, incorporating the DA module into individual layers significantly boosts segmentation performance, especially for small objects. Applying it across all layers achieves the highest mIoU. These results confirm the effectiveness of layer-wise DA integration in enhancing tobacco leaf disease segmentation.

3.5.3. Verification of the Effectiveness of the Image Stitching Training Strategy

To evaluate the real-world applicability of the image stitching strategy, we tested models on unstitched images from the test set. Results are shown in Table 9.

UNet+AFMA+DA achieves the highest mIoU (58.51%) and best per-class performance, especially for the small-target Weatherfleck class. HRNet and Segformer perform relatively well but still fall short. These results confirm that the stitching strategy improves generalization and robustness, especially for difficult disease categories in real-world settings.

3.6. Visual Comparison

Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 illustrate segmentation results of three baseline methods with various attention modules on the tobacco leaf disease test set. Frog eye disease is shown in green (0, 255, 0), weather fleck in yellow (255, 255, 0), and wildfire in blue (0, 0, 255). Manually added rectangular boxes highlight predictions—red for good and white for poor—to aid network comparison.

3.6.1. Test Images Visual Comparison

In the visualization results, the red boxes denote the most accurate recognition outcomes, whereas the white boxes indicate missed or erroneous detections.In particular, models highlighted in bold represent those that yielded the most favorable visualization results.

Figure 10 and Figure 11 visualize the 9012th test image using the UNet baseline. The upper rectangular boxes show that both the baseline and single-module models failed to detect wildfire disease, while the combined AFMA+DA modules successfully identified it.

The lower boxes reveal that the baseline misclassified weather fleck as frog eye disease, whereas AFMA, DA, or their combination correctly recognized weather fleck.

Compared with the self-attention UNet, the AFMA+DA modules deliver more accurate wildfire segmentation and avoid false detections, demonstrating superior performance over Transformer-style self-attention in distinguishing fine-grained, visually similar lesions.

Figure 12 and Figure 13 visualize the 9012th test image using the DeepLab baseline. The lower rectangular box shows that both the baseline and single-module models failed to identify weather fleck correctly, while the combined modules successfully detected some weather fleck, reflected by the IoU increase from 2.34% to 5.99%.

Figure 14 and Figure 15 show the 9012th test image with the HRNet baseline. The upper rectangular box highlights that the baseline misidentifies some wildfires as weather flecks. However, with AFMA, DA, or their combination, the wildfires are correctly identified.

Figure 16 and Figure 17 visualize the 9014th test image using the UNet baseline. The lower rectangular box shows the baseline only partially identifying weather fleck, misclassifying some spots as frog eye disease. With AFMA, more spots were correctly identified, with one misclassification remaining. Introducing DA or the dual-module combination led to nearly complete and accurate identification of the weather fleck.

Notably, UNet+DA misclassified all frog eye disease as weather fleck, but this was resolved by the dual-module synergy, demonstrating their complementary effectiveness.

The self-attention UNet misclassified some weather flecks as frog eye disease, while the AFMA+DA model avoided this and achieved more precise segmentation, highlighting the superiority of the combined modules.

Figure 18 and Figure 19 visualize the 9014th test image with the DeepLab baseline. The upper rectangular box shows that DeepLab+AFMA and DeepLab+AFMA+DA achieve finer recognition of wildfire disease, with the combined modules yielding the best results, demonstrating their synergy. The lower box shows similar improvements for weather fleck, again with the module combination performing best.

Figure 20 and Figure 21 visualize the 9014th test image with the HRNet baseline. The middle rectangular box shows that all three module combinations improve fine-grained recognition of frog eye disease compared with the baseline. The lower box shows that for weather fleck, the dual-module combination achieves finer recognition than single modules.

This visualization confirms the effectiveness of the dual-module collaboration across all three baselines, with UNet+AFMA+DA consistently delivering the best results, aligning with the IoU metrics for the three disease types.

3.6.2. True Images Visual Comparison

We evaluated the method on unstitched real images to assess its effectiveness and generalization in practical scenarios. Unlike preprocessed images, real images have complex backgrounds, lighting variations, and multi-scale disease spots, providing a more realistic test of model robustness. Small targets in real images are more prone to background interference, making this evaluation key to validating the method’s advantage in small target recognition and the attention mechanism’s role in enhancing fine-grained features.

As shown in Figure 22 and Figure 23, real image 9366 contains both frog eye and wildfire diseases, with two sunlight-affected regions. The UNet baseline missed the only frog eye lesion. AFMA improved confidence in detecting the frog eye lesion, but misclassified the sunlight-affected area as a climate spot. DA correctly recognized the frog eye lesion but misclassified wildfire spots as frog eye disease. Self-attention slightly improved frog eye detection. The dual attention (AFMA+DA) accurately identified the frog eye lesion, showing the complementary strengths and enhanced effectiveness of the combined modules.

Figure 24 and Figure 25 show the real image 9549 with both frog eye and wildfire diseases. The UNet baseline misclassified all lesions as frog eye disease. AFMA improved recognition and disease differentiation, nearing the dual attention’s performance, but two frog eye lesions in the upper right were still misclassified. DA improved wildfire detection, but it was suboptimal. Self-attention caused some frog eye lesions to be misidentified as wildfire. The AFMA+DA dual attention achieved the best balance of small target detection and accurate classification.

Figure 26 shows a visualization of our method on a single-class frog-eye leaf spot image. Hollow centers were intentionally unannotated due to feature extraction challenges. Nevertheless, the prediction accurately captured the black edges and disease regions, including the unmarked hollow areas. For frog eyes, the black circle is an important feature. Although we did not label the disease source of the cavity, the model still recognized it, indicating that this shape and black border are important features captured by the model for identifying frog eyes.

Figure 27 illustrates the visualization results of our method on a single-class Weather Fleck image. As observed from the prediction, despite slight blurring at the lesion edges, our method accurately identifies each disease spot.

Figure 28 shows a visualization of our method on a single-class Wildfire image. The model accurately segments each lesion with clear boundaries, demonstrating strong segmentation without edge blurring.

By training with stitched images, our method effectively handles real-world cases with multiple lesion types and individual lesion segmentation, confirming the efficacy of the stitching strategy and module enhancements.

Thus, the dual attention mechanism (AFMA+DA) outperforms others on complex real images, proving its effectiveness in small target detection and multi-disease classification.

3.6.3. Segmentation Visualization on Field-Collected Multi-Disease Tobacco Images

Due to the lack of public tobacco disease datasets, the training used data from 2021 to 2024. Here, we analyze segmentation on real multi-disease field images to assess model generalization. The image shows co-occurring Frog-eye leaf spot and Wildfire disease. White rectangles highlight areas where other methods underperform compared with AFMA+DA.

Figure 29 and Figure 30 show Wildfire lesions in the upper right and Frog-eye lesions in the lower left. The UNet baseline misclassifies some Frog-eye lesions as Wildfire. Using AFMA alone still loses fine Frog-eye details. DA alone misclassifies some Frog-eye as weather spots and some Wildfire as Frog-eye. SegFormer loses fine Frog-eye details and more in Wildfire regions.

The dual-attention AFMA+DA combination delivers superior accuracy and fine-grained segmentation, demonstrating strong applicability in real tobacco field conditions despite training on stitched data.

3.7. Efficiency Analysis

In deep learning, larger models often perform better. This section compares the parameter scale and training time of the baseline and improved models. Adding the DA module on top of AFMA increases parameters by about 30 M. As Table 10 shows, UNet+AFMA converged to 77.15% mIoU after 697 epochs and 154 h, while UNet+AFMA+DA reached 81.26% mIoU in just 151 epochs and 72 h when applying attention to the 5th-layer feature map (Figure 31).

The faster convergence of UNet+AFMA+DA stems from the DA module’s dual attention mechanism, which better guides the model to focus on small lesions and inter-class differences, improving boundary sensitivity and contextual awareness. Although DA adds parameters (197 MB vs. 168 MB) and computation, it acts as auxiliary supervision and feature alignment, reducing early training oscillations and accelerating convergence. It also enhances multi-scale attention fusion in AFMA.

The heatmaps in Figure 32 highlight a key difference: combining data augmentation (DA) with the Attention Fusion Module (AFMA) enables the model to focus more precisely on dense lesion areas, effectively identifying and amplifying critical regions. In contrast, using AFMA alone causes the model to over-attend to background areas, wasting effort on irrelevant regions instead of lesions. This synergy between DA and AFMA improves lesion localization accuracy and reduces background interference, helping the model better detect small, low-contrast lesions.

Despite higher FLOPS (91.4 M vs. 83.6 M for UNet+AFMA and 43.9 M baseline), UNet+AFMA+DA trains faster and achieves nearly 5% mIoU improvement with no extra training time, demonstrating an effective balance of computational cost and performance gains.

As shown in Table 11, the computational latency and throughput of UNet with attention modules are presented.

Throughput: Compared with the baseline Unet (849.30 samples/s), Unet+AFMA (763.92 samples/s) and Unet+AFMA+DA (598.36 samples/s) show reduced throughput, with Unet+AFMA+DA being 29.4% lower. This decline is expected due to added attention and fusion computations, but is justified by improved segmentation accuracy.
Inference Latency: Unet+AFMA+DA has a latency of 30.92 ms, higher than Unet (15.85 ms), Unet+AFMA (24.07 ms), and Unet+DA (24.34 ms). Despite this increase, 30.92 ms remains acceptable for many real-time applications where fine-grained segmentation is critical.
Overall: The combined AFMA+DA modules enable more precise segmentation with reasonable latency trade-offs. While throughput decreases slightly, the improved accuracy, especially for small targets and complex regions, makes Unet+AFMA+DA a competitive choice.

4. Discussion

To enrich data diversity and simulate the co-occurrence of multiple disease types, we employed an image splicing strategy to construct composite samples. While effective in exposing the model to complex patterns, this method may introduce artifacts such as lighting inconsistencies, sharp edges, and unnatural transitions, potentially affecting model generalization. To mitigate these issues, we normalized brightness and contrast before stitching and applied augmentations (e.g., rotation, flipping, Gaussian blur) to reduce overfitting. Additionally, the attention mechanisms (AFMA and DA) help the model focus on relevant lesion features while suppressing noise. Experimental results show stable performance on both stitched and original images, indicating minimal impact from potential artifacts.

Experimental results indicate that using AFMA alone led to a decline in wildfire disease identification performance (IoU decreased from 43.77% to 37.88%). This is attributed to its difficulty in handling lesions with variable morphology or ambiguous developmental stages, often resulting in the neglect of edges or misdirected attention towards healthy regions. Conversely, using DA alone also significantly reduced performance (IoU dropped by 30.11%), as its inherent focus on small lesions caused it to overlook larger ones, potentially suppressing the response to large targets and leading to misclassification, blurred boundaries, and intensified competition among smaller categories. However, the combination of AFMA and DA achieved complementary advantages: AFMA’s mapping of lesion size relationships effectively alleviated DA’s suppression of large targets, significantly improving overall segmentation performance. The Unet+AFMA+DA model achieved 46.58% IoU on wildfire lesions and 54.79% overall mIoU, outperforming models using either module individually.

Regarding the issue of dataset size, data scarcity has long been a major constraint on the widespread application of deep learning and related methods in tobacco disease identification. In addition to further data collection and accumulation, techniques such as image splicing, image fusion, and image generation can help mitigate data-related limitations, particularly in cases involving multiple coexisting diseases. In summary, while AFMA and DA individually have limitations, their combination delivers the best performance for recognizing various tobacco leaf disease classes.

5. Conclusions

This paper proposes a novel training strategy that generates new samples by randomly splicing single-class tobacco leaf disease images. This approach effectively addresses the issue of data scarcity and enhances the segmentation performance of small disease targets through data augmentation.

We develop an image segmentation method that integrates a UNet baseline network with a dual-branch attention module (DA module) and a cross-feature map attention mechanism (AFMA). The AFMA module models cross-scale mapping relationships between large and small targets during feature encoding, thereby enhancing decoder outputs and significantly improving segmentation accuracy. Meanwhile, the DA module focuses on capturing fine-grained spatial and channel-wise features, improving feature discrimination.

Ablation studies demonstrate the complementary effects of these two attention modules. Notably, the DA module accelerates convergence with minimal additional parameters, alleviating the increased training complexity introduced by AFMA. The combined AFMA+DA module effectively addresses the challenges of small target segmentation in complex field environments, achieving superior accuracy while maintaining methodological simplicity and computational efficiency. Furthermore, the incorporation of focal loss proves effective in mitigating class imbalance among tobacco leaf disease categories.

Future research will focus on further optimizing the encoder-decoder architecture to better handle multiple diseases and overlapping lesions. The proposed method shows promising potential for accurate identification and prevention of tobacco diseases, offering robust data support and technical guarantees for crop disease monitoring.

Author Contributions

Conceptualization, Y.Z. and H.L.; methodology, Y.Z. and S.Z.; software, Y.Z. and S.Z.; validation, S.Z. and S.Z.; formal analysis, Z.Q. and Y.Z.; investigation, Y.Z.; resources, Z.Q.; data curation, S.Z. and S.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z. and Z.Q.; visualization, S.Z. and Y.Z.; supervision, H.L. and Z.Q.; project administration, H.L. and Z.Q.; funding acquisition, H.L. and Z.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the projects of Natural Science Foundation of China (Grant No. 12163004), the Yunnan Fundamental Research Project (Grant No. 202301BD070001-008, 202401AS070009), the Key Research and Development Project of Yunnan Province (Grant No. 202402AD080002-5), and the Xingdian Talent Support Program (Grant No. YNWR-QNBJ-2019-286).

Data Availability Statement

The data used in this research were collected and labeled by ourselves. We will publish the data after the article is accepted. https://drive.google.com/drive/folders/11pV2HZmfYfoSZ9I29FoCwEezqZ4-3_FQ?usp=drive_link (accessed on 22 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFMA	Attention across feature maps
PAM	Position Attention Module
CAM	Channel Attention Module
DA	Dual-branch Attention

References

Martinelli, F.; Scalenghe, R.; Davino, S.; Panno, S.; Scuderi, G.; Ruisi, P.; Villa, P.; Stroppiana, D.; Boschetti, M.; Goulart, L.R.; et al. Advanced methods of plant disease detection. A review. Agron. Sustain. Dev. 2015, 35, 1–25. [Google Scholar] [CrossRef]
Lin, H.; Tse, R.; Tang, S.-K.; Qiang, Z.; Pau, G. Few-shot learning approach with multi-scale feature fusion and attention for plant disease recognition. Front. Plant Sci. 2022, 13, 907916. [Google Scholar] [CrossRef] [PubMed]
Lin, H.; Tang, S.K.; Tse, R.; Pau, G.; Qiang, Z.P. Instance Embedding and Task Adaptation for Few-Shot Tobacco Leaf Disease Recognition. Front. Plant Sci. 2023, 14, 1167982. [Google Scholar]
Lin, J.; Yu, D.; Pan, R.; Cai, J.; Liu, J.; Zhang, L.; Wen, X.; Peng, X.; Cernava, T.; Oufensou, S.; et al. Improved YOLOX-Tiny network for detection of tobacco brown spot disease. Front. Plant Sci. 2023, 14, 1135105. [Google Scholar] [CrossRef]
Mohanty, S.P.; Hughes, D.P.; Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci. 2016, 7, 1419. [Google Scholar] [CrossRef]
Ferentinos, K.P. Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
Sun, J.; Li, Q.; Lin, X.; Liang, G.; Hu, Y.; Li, L.; Ding, W. Development of Image Recognition System for Tobacco Aphid Based on Resnet-101 Model. Plant Health Med. 2024, 3, 26–31. [Google Scholar] [CrossRef]
Sladojevic, S.; Arsenovic, M.; Anderla, A.; Culibrk, D.; Stefanovic, D. Deep neural networks based recognition of plant diseases by leaf image classification. Comput. Intell. Neurosci. 2016, 2016, 3289801. [Google Scholar] [CrossRef]
Bharali, P.; Bhuyan, C.; Boruah, A. Plant disease detection by leaf image classification using convolutional neural network. In Proceedings of the International Conference on Information, Communication and Computing Technology, New Delhi, India, 11 May 2019; Springer: Singapore, 2019; pp. 194–205. [Google Scholar]
Thimmegowda, T.G.M.; Jayaramaiah, C. Cluster-based segmentation for tobacco plant detection and classification. Bull. Electr. Eng. Inform. 2023, 12, 75–85. [Google Scholar] [CrossRef]
Zhang, Y.; Song, C.; Zhang, D. Deep learning-based object detection improvement for tomato disease. IEEE Access 2020, 8, 56607–56614. [Google Scholar] [CrossRef]
Uysal, E.S.; Sen, D.; Ornek, A.H.; Yetkin, A.E. Lesion Detection on Leaves using Class Activation Maps. arXiv 2023, arXiv:2306.13366. [Google Scholar]
Iwano, K.; Shibuya, S.; Kagiwada, S.; Iyatomi, H. Hierarchical Object Detection and Recognition Framework for Practical Plant Disease Diagnosis. arXiv 2024, arXiv:2407.17906. [Google Scholar]
Zhang, P.; Wang, Z.; Rao, Y.; Zheng, J.; Zhang, N.; Wang, D.; Zhu, J.; Fang, Y.; Gao, X. Identification of Pine Wilt Disease Infected Wood Using UAV RGB Imagery and Improved YOLOv5 Models Integrated with Attention Mechanisms. Forests 2023, 14, 588. [Google Scholar] [CrossRef]
Yang, X.; Wang, J.; Liu, Z.; Chen, Y. Overlapping Tobacco Leaf Segmentation Based on Improved Mask R-CNN with COT Attention. Comput. Electron. Agric. 2024, 213, 108079. [Google Scholar]
Chen, Z.; Peng, Y.; Jiao, J.; Yang, Y. MD-UNet for Tobacco Leaf Disease Spot Segmentation Based on Multi-Scale Residual Dilated Convolutions. Sci. Rep. 2025, 15, 2759. [Google Scholar] [CrossRef]
Zhang, L.; Chen, Y.; Wang, H.; Liu, Q. Overlapping Tobacco Leaf Segmentation with Enhanced Mask R-CNN Using Feature Fusion and Hybrid Attention Mechanism. Comput. Electron. Agric. 2023, 206, 107684. [Google Scholar] [CrossRef]
Ou, J.L.; Lin, H.; Qiang, Z.P.; Liu, L.; Chen, Z.; Zou, Y. Tobacco Leaf Disease Segmentation based on TDSSNet. In Proceedings of the 12th International Conference on Image Processing Theory, Tools and Applications (IPTA), Paris, France, 16–19 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Chen, Z.; Peng, Y.; Jiao, J.; Yang, X.; Wang, G. MD-Unet: A Multi-Scale Dense Attention Network for Semantic Segmentation of Tobacco Leaf Diseases. Comput. Electron. Agric. 2024, 213, 108052. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
Cheng, B.; Schwing, A.G.; Kirillov, A. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 420–440. [Google Scholar]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image Segmentation as Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9799–9808. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. UNet: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3577–3585. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Sun, K.; Cheng, T.; Liang, S.; Zhang, Z.; Yang, J. High-resolution representations for labeling pixels and regions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Sang, S.; Zhou, Y.; Islam, M.T.; Xing, L. Small-object sensitive segmentation using across feature map attention. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6289–6306. [Google Scholar] [CrossRef]
Sun, G.; Pan, Y.; Kong, W.; Xu, Z.; Ma, J.; Racharak, T.; Nguyen, L.M.; Xin, J. DA-TransUNet: Integrating spatial and channel dual attention with transformer UNet for medical image segmentation. Front. Bioeng. Biotechnol. 2024, 12, 1398237. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Liu, M.Y.; Tuzel, O.; Xiao, J. R-CNN for small object detection. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Cham, Switzerland, 2016; pp. 214–230. [Google Scholar]
Ma, A.; Wang, J.; Zhong, Y.; Zheng, Z. FactSeg: Foreground activation-driven small object semantic segmentation in large-scale remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5606216. [Google Scholar] [CrossRef]
Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-SCNN: Gated shape CNNs for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5229–5238. [Google Scholar]
Song, X.; Li, W.; Zhou, D.; Dai, Y.; Fang, J.; Li, H.; Zhang, L. MLDA-Net: Multi-Level Dual Attention-Based Network for Self-Supervised Monocular Depth Estimation. IEEE Trans. Image Process. 2021, 30, 4691–4705. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3146–3154. [Google Scholar]
Guo, L.; Lei, B.; Chen, W.; Du, J.; Frangi, A.F.; Qin, J.; Zhao, C.; Shi, P.; Xia, B.; Wang, T. Dual attention enhancement feature fusion network for segmentation and quantitative analysis of pediatric echocardiography. Med Image Anal. 2021, 71, 102042. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Han, Z.; Liu, Z.; Zhang, J. HMA-Net: A deep U-shaped network combined with HarDNet and multi-attention mechanism for medical image segmentation. Med Phys. 2023, 50, 1635–1646. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Three types of small target lesions: (a) Frog Eye, (b) Weather Fleck, (c) Wild Fire.

Figure 2. Image counts of three small-target tobacco leaf diseases (2021–2024).

Figure 3. Segment Anything assists in annotating semantic segmentation tasks.

Figure 4. The process of stitching three types of images.

Figure 5. Dual-branch attention structure.

Figure 6. Architecture of position attention mechanism (PAM).

Figure 7. Architecture of channel attention mechanism (CAM).

Figure 8. Across Feature Map Attention (AFMA) module.

Figure 9. Architecture of UNet+AFMA+DA.

Figure 10. Segmentation results of image_9012 using UNet.

Figure 11. Segmentation results of image_9012 using UNet.

Figure 12. Segmentation results of image_9012 using Deeplab.

Figure 13. Segmentation results of image_9012 using Deeplab.

Figure 14. Segmentation results of image_9012 using HRNet.

Figure 15. Segmentation results of image_9012 using HRNet.

Figure 16. Segmentation results of image_9014 using UNet.

Figure 17. Segmentation results of image_9014 using UNet.

Figure 18. Segmentation results of image_9014 using Deeplab.

Figure 19. Segmentation results of image_9014 using Deeplab.

Figure 20. Segmentation results of image_9014 using UNet.

Figure 21. Segmentation results of image_9014 using UNet.

Figure 22. The results of true image_9366 using UNet.

Figure 23. The results of true image_9366 using UNet.

Figure 24. The results of true image_9549 using UNet.

Figure 25. The results of true image_9549 using UNet.

Figure 26. The results of true single frog eye using UNet.

Figure 27. The results of true single weather fleck using UNet.

Figure 28. The results of true single wild fire using UNet.

Figure 29. Segmentation results on field-collected Images_1.

Figure 30. Segmentation results on field-collected Images_2.

Figure 31. Miou curves of two network training.

Figure 32. Attention visualization with and without DA.

Table 1. Baseline network performance comparison.

Metric	UNet	DeepLab	HRNet
Frogeye IoU (%)	38.60	25.97	33.82
Weatherfleck IoU (%)	14.11	2.34	17.62
Wildfire IoU (%)	43.77	18.43	43.24
mIoU (95% CI) (%)	$48.99 \pm 0.2$	$36.52 \pm 0.33$	$48.52 \pm 0.25$
Throughput	649.65	484.46	405.68
Inference Latency (ms)	15.85	24.70	72.13

Table 2. Hardware configuration explanation.

Configuration Items	Explanation
CPU	Intel(R) Xeon(R) Platinum 8336C (32 cores, 2.30 GHz)
GPU	NVIDIA GeForce RTX 4090
Memory	251 GiB DRAM

Table 3. Training hyperparameter settings.

Hyperparameters	Explanation
Optimizer	Adam
Initial learning rate	1 × 10⁻⁴
Learning rate adjustment strategy	Current batch size/base batch size × initial learning rate
Maximum learning rate	1 × 10⁻⁴
Minimum learning rate	1 × 10⁻⁶
Batch size	10 (adjusted according to model size)
Epochs	1000
Weight decay	1 × 10⁻⁵
Momentum	0.9
Learning rate scheduler	Cosine annealing
Loss function	Cross-entropy loss for segmentation
Early stopping	Enabled if validation loss stops improving
Random seed	42 (for reproducibility)

Table 4. mIoU Comparison of Tobacco Disease Segmentation Models.

Model Name	Frogeye (%)	Weatherfleck (%)	Wildfire (%)	mIoU (95% CI)(%)
HOG+SVM	2.91	6.93	7.57	$24.31 \pm 0.34$
PSPNet	29.60	8.82	33.42	$42.80 \pm 0.28$
PIDNet	30.41	7.21	38.50	$43.82 \pm 0.26$
DeepLab	25.97	2.34	18.43	$36.52 \pm 0.33$
HRNet	33.82	17.62	43.24	$48.52 \pm 0.25$
SegFormer	28.50	8.90	29.48	$41.57 \pm 0.22$
DeepLab+AFMA+DA	27.07	5.99	21.44	$37.67 \pm 0.25$
HRNet+AFMA+DA	34.92	20.24	44.58	$48.86 \pm 0.21$
UNet+AFMA+DA	41.28	31.72	46.58	$54.79 \pm 0.16$

Table 5. mIoU ablation of attention mechanisms in baseline.

Model	Frogeye (%)	Weatherfleck (%)	Wildfire (%)	mIoU (95% CI) (%)
UNet	38.60	14.11	43.77	$48.99 \pm 0.20$
DeepLab	25.97	2.34	18.43	$36.52 \pm 0.33$
HRNet	33.82	17.62	43.24	$48.52 \pm 0.25$
UNet+AFMA	38.52	17.79	37.88	$48.41 \pm 0.17$
DeepLab+AFMA	29.05	4.43	39.20	$43.01 \pm 0.27$
HRNet+AFMA	33.01	19.23	42.90	$48.64 \pm 0.19$
UNet+DA	41.83	36.25	13.66	$47.85 \pm 0.23$
DeepLab+DA	35.79	2.10	30.19	$41.85 \pm 0.28$
HRNet+DA	33.37	16.30	44.22	$48.33 \pm 0.22$
DeepLab+AFMA+DA	27.07	5.99	21.44	$37.67 \pm 0.25$
HRNet+AFMA+DA	34.92	20.24	44.58	$48.86 \pm 0.21$
UNet+AFMA+DA	41.28	31.72	46.58	$54.79 \pm 0.16$

Table 6. mPA ablation of attention mechanisms in baseline.

Model Name	Frogeye (%)	Weatherfleck (%)	Wildfire (%)	mPA (%)
Deeplab	55.07	2.41	23.26	45.10
Deeplab+AFMA	61.18	4.64	34.46	50.46
Deeplab+DA	55.07	2.19	36.85	48.47
Deeplab+AFMA+DA	62.99	6.98	37.38	51.25
HRNet	67.23	12.33	42.76	55.51
HRNet+AFMA	70.44	12.05	38.67	55.20
HRNet+DA	68.27	10.75	46.26	56.24
HRNet+AFMA+DA	71.21	13.55	38.12	56.40
UNet	79.11	15.16	35.42	57.46
UNet+AFMA	74.42	23.08	63.06	65.05
UNet+DA	53.34	37.65	30.78	58.15
UNet+AFMA+DA	79.64	48.68	73.96	72.66

Table 7. Effect of AFMA module depth on mIoU in tobacco disease segmentation.

UNet+AFMA+DA	Frogeye (%)	Weatherfleck (%)	Wildfire (%)	mIoU (95% CI) (%)
AFMA_depth = 1	42.95	25.09	45.48	$53.25 \pm 0.25$
AFMA_depth = 2	41.37	26.34	47.76	$53.74 \pm 0.16$
AFMA_depth = 3	41.19	25.92	48.24	$53.71 \pm 0.13$
AFMA_depth = 4	43.10	24.37	47.32	$53.57 \pm 0.15$
AFMA_depth = 5	42.06	23.84	47.76	$53.28 \pm 0.22$

The results in the table represent the comparison when focal loss is not applied.

Table 8. Impact of DA module depth on mIoU for tobacco disease segmentation.

DA Module Layers	Frogeye (%)	Weatherfleck (%)	Wildfire (%)	mIoU (95% CI) (%)
2	40.20	19.15	40.81	$49.92 \pm 0.17$
2,3	40.92	20.14	43.95	$51.13 \pm 0.25$
2,3,4	40.81	19.34	40.81	$50.31 \pm 0.21$
2,3,4,5	41.37	26.34	47.76	$53.74 \pm 0.16$

The results in the table represent the comparison when focal loss is not applied.

Table 9. Real-world tobacco leaf disease segmentation performance based on mIoU.

Model Name	Frogeye (%)	Weatherfleck (%)	Wildfire (%)	mIoU (95% CI) (%)
UNet	40.56	17.73	48.13	$51.48 \pm 0.15$
DeepLab	38.57	12.75	41.35	$48.03 \pm 0.23$
HRNet	43.22	27.25	48.31	$54.59 \pm 0.21$
PSPNet	41.70	14.72	46.30	$50.56 \pm 0.26$
PIDNet	31.61	9.46	40.88	$47.98 \pm 0.31$
Segformer	40.61	29.49	46.39	$53.99 \pm 0.17$
UNet+AFMA+DA	46.07	39.15	49.30	$58.51 \pm 0.12$

Table 10. Model efficiency comparison.

Model Name	Train mIoU (95% CI) (%)	Model Size (M)	FLOPS (M)	Epoch	Time (h)
UNet	$76.33 \pm 0.11$	167.91	43.9	392	72
UNet+AFMA	$77.15 \pm 0.92$	168.11	83.6	697	154
UNet+DA	$78.87 \pm 0.89$	197.54	51.7	269	116
UNet+AFMA+DA	$81.26 \pm 0.73$	197.75	91.4	151	72

Table 11. Model speed and latency evaluation.

Model Name	Throughput (Samples/s)	Inference Latency (ms)
UNet	849.30	15.85
UNet+AFMA	763.92	24.07
UNet+AFMA+DA	598.36	30.92
UNet+DA	519.05	24.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zou, Y.; Qiang, Z.; Zhang, S.; Lin, H. Semantic Segmentation of Small Target Diseases on Tobacco Leaves. Agronomy 2025, 15, 1825. https://doi.org/10.3390/agronomy15081825

AMA Style

Zou Y, Qiang Z, Zhang S, Lin H. Semantic Segmentation of Small Target Diseases on Tobacco Leaves. Agronomy. 2025; 15(8):1825. https://doi.org/10.3390/agronomy15081825

Chicago/Turabian Style

Zou, Yanze, Zhenping Qiang, Shuang Zhang, and Hong Lin. 2025. "Semantic Segmentation of Small Target Diseases on Tobacco Leaves" Agronomy 15, no. 8: 1825. https://doi.org/10.3390/agronomy15081825

APA Style

Zou, Y., Qiang, Z., Zhang, S., & Lin, H. (2025). Semantic Segmentation of Small Target Diseases on Tobacco Leaves. Agronomy, 15(8), 1825. https://doi.org/10.3390/agronomy15081825

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Segmentation of Small Target Diseases on Tobacco Leaves

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Preprocessing

2.2. Selection of Baseline Networks

2.3. Dual-Attention Module (DA)

2.4. Cross Feature-Map Attention (AFMA)

2.5. Small Object Segmentation Strategies

2.6. Attention Mechanism Fusion in Semantic Segmentation

2.7. Network Architecture (UNet+AFMA+DA)

2.8. Loss Function

3. Results

3.1. Implementation Details

3.2. Evaluation Metrics

3.3. Comparison of Segmentation Performance on Tobacco Leaf Dataset

3.4. Ablation Study

3.4.1. mIoU Ablation Analysis of Dual Attention Mechanisms on Baseline

3.4.2. mPA Ablation Analysis of Dual Attention Mechanisms on Baseline

3.5. Discussion on Two Attention Modules

3.5.1. Discussion on the Depth of AFMA Application

3.5.2. Discussion on the Number of Layers Applied to the DA Module

3.5.3. Verification of the Effectiveness of the Image Stitching Training Strategy

3.6. Visual Comparison

3.6.1. Test Images Visual Comparison

3.6.2. True Images Visual Comparison

3.6.3. Segmentation Visualization on Field-Collected Multi-Disease Tobacco Images

3.7. Efficiency Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI