SymbioMamba: An Efficient Dual-Stream State-Space Framework for Real-Time Maize Disease and Yield Analysis on UAV Platforms

Wang, Zihuan; Wang, Yuru; Zhou, Bocheng; Yan, Xu; Guo, Peijiang; Yang, Hanyu; Song, Yihong

doi:10.3390/agriculture16070801

Open AccessArticle

SymbioMamba: An Efficient Dual-Stream State-Space Framework for Real-Time Maize Disease and Yield Analysis on UAV Platforms

by

Zihuan Wang

^1,2,†,

Yuru Wang

^1,†,

Bocheng Zhou

²,

Xu Yan

²,

Peijiang Guo

²,

Hanyu Yang

¹ and

Yihong Song

^1,*

¹

China Agricultural University, Beijing 100083, China

²

National School of Development, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2026, 16(7), 801; https://doi.org/10.3390/agriculture16070801

Submission received: 9 February 2026 / Revised: 24 March 2026 / Accepted: 2 April 2026 / Published: 3 April 2026

(This article belongs to the Special Issue Smart Sensor-Based Systems for Crop Monitoring)

Download

Browse Figures

Versions Notes

Abstract

In UAV (unmanned aerial vehicle)-enabled precision agriculture, achieving high-accuracy disease diagnosis and yield estimation simultaneously on resource-constrained edge devices remains a significant challenge. Existing solutions are commonly hindered by conflicts in visual feature scales, the absence of explicit agronomic causal logic, and the trade-off between lightweight design and global modeling capability. To address these challenges, a heterogeneous dual-stream state-space framework termed SymbioMamba is proposed. The proposed framework incorporates three key innovations: first, a heterogeneous dual-stream encoder is constructed, in which a micro-texture stream captures high-frequency disease details while a macro-context-scan stream models field-scale biomass continuity; second, a pathology–biomass collaborative interaction (PBCI) module is designed to explicitly inject the biological prior—disease stress leading to yield reduction—into the feature space. Third, a topology-aligning cross-architecture distillation (TACAD) paradigm is introduced to transfer global knowledge from a heavyweight teacher to a lightweight student. Experimental results from a maize UAV dataset comprising 12,074 annotated image patches demonstrate that SymbioMamba achieves 89.4% mAP@0.5 and an

R^{2}

of 0.915. Compared to the industry-standard YOLOv11, the framework improves mAP@0.5:0.95 by 2.4% while reducing the parameter count to 6.2 M—a 50% decrease relative to monolithic state-space baselines. Furthermore, yield prediction error is significantly reduced to an RMSE of 485.6 kg/ha. With a compact model size of 6.2 M parameters and 2.4 G FLOPs, SymbioMamba attains an inference speed of 38.2 FPS on the NVIDIA Jetson AGX Orin platform, providing a high-performance, real-time solution for intelligent agricultural phenotypic analysis.

Keywords:

crop monitoring; unmanned aerial vehicle; disease grading; yield analysis; dual-stream state space

1. Introduction

Under the dual pressures of intensifying global climate change and continuous population growth, achieving stable crop production and efficient management under resource constraints has emerged as a critical challenge for agricultural systems worldwide [1]. Precision agriculture has been recognized as a key technological pathway to address this challenge, with its core objective centered on the timely identification of stress factors affecting crop growth through high spatiotemporal resolution phenotypic sensing, as well as the reliable assessment of final yield outcomes. Among various remote sensing platforms, unmanned aerial vehicles (UAVs) have increasingly become a pivotal bridge between field-scale observation and intelligent agricultural decision-making, owing to their flexible deployment, centimeter-level spatial resolution, and multi-sensor integration capability [2,3]. In contrast to satellite remote sensing, which is constrained by revisit cycles and cloud occlusion, and ground-based surveys, which suffer from low efficiency and potential crop disturbance, UAVs enable rapid acquisition of canopy imagery at critical growth stages, thereby providing high-value data support for early disease diagnosis and yield-potential evaluation [4].

With the deep integration of computer vision techniques into agricultural remote sensing, UAV-based crop analysis methods have undergone a paradigm shift from handcrafted features to deep-learning-driven approaches. Early studies predominantly relied on manually designed texture or color features combined with traditional classifiers for disease identification; however, their generalization capability was limited in complex field environments [5]. Subsequently, convolutional neural networks (CNNs), benefiting from strong modeling capacity for local textures and morphological structures, substantially improved disease detection accuracy and became the dominant solution for related tasks [6]. Nevertheless, the inherently local receptive fields of CNNs hinder their ability to effectively capture spatial dependencies at the field scale. To address this limitation, Vision Transformers and their variants have been introduced into agricultural remote sensing scenarios. By leveraging self-attention mechanisms to model global dependencies, these architectures have demonstrated advantages in biomass distribution modeling and yield-estimation tasks [7].

In practical UAV operational scenarios, models are required not only to achieve high accuracy but also to satisfy stringent constraints on computational resources, energy consumption, and inference latency for edge deployment. To this end, extensive efforts have been conducted at multiple levels. At the task level, multi-task learning has been employed to jointly perform disease detection and yield prediction within a single forward inference pass [8]. At the network architecture level, lightweight designs such as MobileNet and YOLO-Nano have significantly reduced computational complexity through depthwise separable convolutions [9]. At the model compression level, knowledge distillation has emerged as a mainstream strategy, enhancing the performance of lightweight models by transferring implicit knowledge from teacher networks [10]. Meanwhile, the recent emergence of visual state-space models (VSSMs), particularly architectures represented by Mamba, has introduced new possibilities for efficient processing of high-resolution remote sensing imagery by enabling long-range dependency modeling with linear computational complexity [11].

Despite progress along these individual dimensions, there are still three critical challenges to the unified integration and deployment of such approaches on UAV edge platforms for collaborative disease perception and yield estimation, as shown in Figure 1. First, at the level of structural challenges in multi-task visual modeling, the early diagnosis of Northern Corn Leaf Blight (NCLB) relies heavily on high-frequency, fine-grained local texture features to identify necrotic lesions, whereas maize yield estimation requires modeling low-frequency, global contextual information associated with field-scale biomass distribution. This discrepancy in scale and frequency renders a single backbone network insufficient for simultaneously accommodating both tasks: lightweight CNNs lack global perception capability, while Transformer- or SSM-based architectures often fail to meet the computational constraints of edge devices [12]. Second, at the level of multi-objective agricultural decision complexity, existing multi-task learning methods predominantly adopt simple feature concatenation or shared-backbone strategies, while neglecting the intrinsic agronomic causal relationship whereby NCLB necrotic stress impairs maize photosynthesis and consequently induces productivity loss during the critical grain-filling stage. Such oversimplification often leads to feature competition and negative transfer, thereby undermining the biological consistency of prediction results [13]. Third, in terms of the trade-off between high-resolution sensing and edge computing, existing lightweight designs frequently trade inference speed for aggressive downsampling, resulting in substantial loss of subtle lesion information. Furthermore, while Transformers offer robust global modeling, the quadratic computational complexity inherent in their self-attention mechanism poses a prohibitive bottleneck when processing high-resolution UAV imagery, driving the research focus toward more efficient architectures like Mamba. However, conventional knowledge distillation methods are primarily tailored for homogeneous networks and struggle to effectively align feature distributions or preserve long-range contextual information—such as the spatial continuity of maize biomass—when faced with pronounced structural discrepancies between these globally expressive teacher models and student models constrained to local receptive fields [12].

The overarching goal of this study is to move beyond the conventional paradigm of independent multi-task learning by developing SymbioMamba, a first-of-its-kind causality-aware symbiotic framework for real-time UAV-based maize sensing. Unlike previous architectures that rely on homogeneous backbones—which often struggle to reconcile the fine-grained local textures required for disease diagnosis with the long-range global context needed for yield estimation—SymbioMamba introduces a fundamental architectural decoupling. The nomenclature, a portmanteau of “Symbiosis” and “Mamba,” signifies our departure from loose multi-task branching toward a mutually beneficial synergy between disparate modeling paradigms and interdependent biological processes. Specifically, we propose a heterogeneous dual-stream encoder that captures multi-scale features through distinct mathematical lenses, a pathology–biomass collaborative interaction (PBCI) module that transforms statistical correlations into agronomic causal logic, and a topology-aligned distillation paradigm to compress global state-space knowledge into edge-ready structures. By integrating CNN-based local perception and Mamba-based global scanning, this study provides a scalable technical paradigm that shifts agricultural UAVs from passive data collectors to active, biologically consistent decision-making systems.

The main contributions of this work are summarized as follows:

A heterogeneous dual-stream encoding framework is proposed, which, unlike standard homogeneous backbones, utilizes architectural asymmetry to fundamentally resolve the scale conflict between microscopic lesion recognition and macroscopic biomass continuity features.
A pathology– biomass collaborative interaction (PBCI) mechanism is designed to replace traditional black-box multi-task heads. By explicitly embedding the agronomic prior that “disease stress suppresses yield” into the gating logic, the framework ensures that predictions are not merely numerically accurate but biologically consistent.
A topology-aligned cross-architecture knowledge distillation paradigm is introduced. This method addresses the foundational challenge of transferring knowledge across mathematically distinct domains—specifically from global state-space Mamba representations to lightweight local convolutions—by aligning the underlying feature manifold topology.
Comprehensive experimental validation on field-scale UAV datasets and the NVIDIA Jetson AGX Orin platform demonstrates that SymbioMamba outperforms conventional SOTA architectures in both predictive precision and deployment efficiency, establishing a new benchmark for causality-driven precision agriculture.

2. Related Work

2.1. UAV-Based Deep Learning Methods for Crop Phenotyping

With the widespread adoption of UAVs in precision agriculture, crop phenotyping analysis methods specifically designed for aerial viewpoints and field-scale scenarios have gradually become a major research focus [14]. Unlike ground-based close-range imagery, UAV-acquired crop images exhibit pronounced scale heterogeneity. On the one hand, variations in flight altitude and sensor resolution constraints render early-stage disease lesions extremely small at the pixel level; on the other hand, yield assessment and growth monitoring rely on macroscopic biomass distribution patterns spanning crop rows and entire fields [15,16]. This coexistence of “ultra-fine local details and long-range global dependencies” poses substantial challenges for unified modeling [17]. To address local-scale issues, existing studies have enhanced small-object detection capability through multi-scale feature fusion or state-space modeling [18]. Meanwhile, to meet macroscopic modeling requirements, multimodal and contextual fusion approaches have gained increasing attention [19]. Nevertheless, the majority of existing frameworks are constrained to isolated-task objectives, effectively decoupling disease perception from growth analysis. This disjointed modeling paradigm fails to account for the intrinsic regulatory influence of disease spatial patterns on biomass accumulation, which significantly limits the potential for models to inform and calibrate yield predictions through pathological context.

2.2. Evolution of Visual Backbone Networks

The design of visual backbone networks largely determines the upper bound of the representational capacity attainable in UAV-based agricultural image analysis. CNNs, owing to their strong inductive bias toward local textures and morphological structures, have long dominated crop disease recognition tasks [20,21]. Nevertheless, the limited receptive fields of CNNs inherently restrict their ability to capture field-scale spatial dependencies [22]. In recent years, state-space modules (SSMs), particularly those using the Mamba architecture, have emerged as a promising alternative for global visual modeling [23,24]. Through selective scanning mechanisms, Mamba achieves long-range dependency modeling while maintaining linear computational complexity [25]. This property has rapidly attracted attention in the remote sensing community [26]. However, existing Mamba-based remote sensing models are primarily optimized for highly structured, large-scale satellite imagery and lack the capacity to model the unstructured, high-frequency texture details required for early-stage plant disease detection [27].

2.3. Efficient Computation and Knowledge Transfer for UAV Edge Deployment

Deploying deep learning models on battery-powered UAV edge devices, such as the NVIDIA Jetson AGX Orin, imposes stringent requirements on computational efficiency and energy consumption [28]. Existing studies have predominantly explored two complementary directions: efficient architecture design [29] and model compression [30]. At the architectural level, efforts have focused on constructing compact network structures with low FLOPs [31]. An alternative pathway is knowledge distillation (KD), which enhances lightweight student models by transferring implicit knowledge from heavyweight teacher models [32]. However, existing architecture design often sacrifices fine-grained representations of small disease lesions due to aggressive downsampling and lightweight operations [33], while model compression and knowledge distillation methods struggle to effectively transfer global and sequential modeling capabilities across heterogeneous architectures without incurring additional inference costs [34].

3. Materials and Method

3.1. Dataset Acquisition

Maize (Zea mays L.) was selected as the target crop to evaluate the robustness of the proposed SymbioMamba framework under heterogeneous agronomic conditions. Field experiments were conducted in two representative agricultural regions in Inner Mongolia, China, from January 2024 to November 2025. The first experimental site was located in the Hetao Irrigation District of Bayannur City (40°41′ N, 107°08′ E), characterized by flat terrain and stable irrigation conditions that are favorable for high-yield maize production. The second site was situated in Jungar Banner (39°51′ N, 111°13′ E), a rainfed hilly region frequently exposed to water stress, resulting in pronounced variability in crop growth and disease pressure. The data acquisition process is shown in Figure 2.

UAV-based RGB imagery was acquired using a DJI Matrice 300 RTK platform, as depicted in Figure 2, with its comprehensive technical specifications detailed in Table 1. The aircraft was equipped with a Zenmuse H20T gimbal camera, providing a spatial resolution of 5184 × 3888 pixels. All flight missions were conducted in August during the maize grain-filling stage, a critical phenological period in which leaf disease severity is closely associated with final yield formation. To ensure consistent spatial resolution across different locations and flight missions, a terrain-following flight strategy was implemented. The flight altitude was dynamically adjusted between 30 m and 50 m relative to the ground elevation to maintain a fixed ground sampling distance (GSD) of 1.5 cm/pixel. Regarding aerial coverage, the flight missions were executed with a frontal overlap of 80% and a side overlap of 70% at a constant cruising speed of 5 m/s, ensuring sufficient feature redundancy for high-quality reconstruction. Following data acquisition, the raw images were processed through a standardized Structure-from-Motion (SfM) workflow using DJI Terra software. This photogrammetric pipeline involved image alignment via feature-point matching, optimization of camera intrinsics using RTK-based GNSS metadata, dense point cloud generation, and final orthomosaic stitching. This rigorous standardization was essential for the reliable capture of fine lesion textures associated with early disease development across heterogeneous agronomic environments.

Northern corn leaf blight (NCLB) was identified as the dominant foliar disease observed at both sites and was therefore selected as the target disease for fine-grained severity analysis. Five disease severity grades were defined by experienced plant pathologists based on the percentage of infected leaf area—quantified as the ratio of the total necrotic lesion area to the total leaf area within the frame—as shown in Figure 3. To ensure objective and consistent classification, experts performed visual estimations calibrated against Standard Area Diagrams (SADs) specifically developed for NCLB [35]. Using the LabelMe annotation tool, bounding boxes were manually delineated on image patches and assigned corresponding severity labels based on these calibrated expert assessments, providing high-quality supervision for the disease perception pipeline [36]. Concurrently with UAV image acquisition, destructive sampling was performed within 200 designated subplots (2 m × 2 m). Maize ears were harvested, threshed, and oven-dried to constant weight, after which subplot yields were normalized to kg/ha and used as regression targets. Following a rigorous quality control (QC) procedure to eliminate outliers caused by UAV motion blur or sampling inconsistencies, 150 high-quality subplot samples were finalized for the yield prediction experiments and the subsequent comparative analysis.

The final dataset comprised 12,074 annotated image patches covering all disease severity levels. To bolster the model’s generalization capability across heterogeneous environments, the field-collected UAV imagery was supplemented with a curated selection of publicly available maize leaf samples sourced from open-access agricultural repositories. This integration was specifically intended to augment the distributional diversity of pathological morphologies and ambient illumination variances, ensuring that the SymbioMamba framework remains robust against the domain shifts inherent in unstructured field conditions. Statistical analysis revealed a clear negative correlation between disease severity and mean yield, with higher NCLB pressure consistently associated with reduced biomass accumulation. The observed pathology–yield coupling provides strong empirical evidence for modeling the collaborative interaction between disease perception and biomass estimation within the proposed SymbioMamba framework, as summarized in Table 2.

3.2. Data Preprocessing and Augmentation

The raw data acquired by the UAV consisted of high-resolution orthomosaic imagery, mathematically denoted as

X_{r a w} \in R^{H_{r a w} \times W_{r a w} \times C}

, where

H_{r a w}

and

W_{r a w}

represent the height and width of the original image (typically exceeding 5000 pixels), and

C = 3

corresponds to the RGB channels. Since directly processing such high-dimensional tensors on edge devices is computationally infeasible, a sequential preprocessing pipeline comprising tiling, stochastic augmentation, and normalization was implemented.

First, to bridge the substantial scale gap between gigapixel-level orthomosaics and the receptive field of the model while preserving the topological integrity of microscopic disease lesions, an overlapping sliding-window strategy was adopted. Let L denote the target patch size (fixed at 640), and let

η \in [0, 1)

denote the overlap ratio (set to

0.2

). The scanning stride S was defined as

S = ⌊ L \cdot (1 - η) ⌋

. The dataset

X = {x^{(i, j)}}

was constructed by extracting image patches, where the top-left coordinate

(u_{i}, v_{j})

of the

(i, j)

-th patch was computed as follows [37]:

\begin{matrix} u_{i} = & i \cdot S, v_{j} = j \cdot S, \\ s . t . & u_{i} + L \leq W_{r a w}, v_{j} + L \leq H_{r a w} . \end{matrix}

(1)

Each extracted image patch

x^{(i, j)} \in R^{L \times L \times 3}

ensured that disease lesions located near grid boundaries were fully captured in at least one view, thereby preventing information loss caused by truncation.

Subsequently, to improve model generalization under unstructured environmental variations such as changing solar angles and cloud-induced shadows, a composite augmentation function was constructed. In particular, a photometric distortion operator

T_{p h o t o}

was introduced. For an input RGB image patch

x_{r g b}

, a nonlinear mapping

ψ

was applied to transform it into the HSV color space, yielding

x_{h s v} = {[h, s, v]}^{T}

. To encourage the model to focus on structured lesion patterns rather than superficial color cues, random noise

δ = {[δ_{h}, δ_{s}, δ_{v}]}^{T}

sampled from uniform distributions

δ_{k} \sim U (- β_{k}, β_{k})

was injected. The perturbed vector

x_{h s v}^{'}

was computed as [38]

x_{h s v}^{'} = clip (x_{h s v} + δ, 0, 1),

(2)

where

β_{h} = 0.015

,

β_{s} = 0.7

, and

β_{v} = 0.4

denote the specific perturbation bounds for hue, saturation, and value, respectively, which were determined empirically to optimize model robustness against illumination variance.

In addition, to specifically address the small-object disappearance issue and to enhance the global contextual modeling capability of the Mamba stream, Mosaic composition Φ was employed as a geometric regularization strategy. Let

{x_{1}, x_{2}, x_{3}, x_{4}}

denote four randomly sampled training image patches. A composite canvas

X_{m o s a i c} \in R^{2 L \times 2 L \times 3}

was defined, along with a random center point

(x_{c}, y_{c})

sampled from a uniform distribution

U (\frac{L}{2}, \frac{3 L}{2})

. The composition operation stitched cropped regions of the four images relative to this center point [39]:

X_{m o s a i c} = Φ (x_{1}, x_{2}, x_{3}, x_{4}; x_{c}, y_{c}) .

(3)

This operation implicitly introduced complex background transitions and multi-scale targets within a single input, serving as a strong regularization term for the state-space model.

Finally, to accelerate gradient convergence during training, the augmented tensor

x_{a u g}

was normalized using the channel-wise mean

μ_{i m g}

and standard deviation

σ_{i m g}

of the ImageNet dataset. The normalized input

\hat{x}

was derived as [40]

{\hat{x}}_{c} = \frac{x_{a u g, c} - μ_{i m g, c}}{σ_{i m g, c}}, \forall c \in {R, G, B} .

(4)

3.3. The SymbioMamba Framework

3.3.1. Overall Architecture

The SymbioMamba framework is engineered to overcome the primary hurdles of UAV-based precision agriculture: structural challenges in multi-task visual modeling arising from local–global scale conflicts, multi-objective agricultural decision complexity regarding disease–yield causal logic, and the efficiency requirements of high-resolution sensing and edge computing. By unifying a heterogeneous dual-stream topology with a cross-architecture distillation paradigm, the system achieves an optimal balance between predictive accuracy and real-time inference efficiency.

The architecture initiates with a shared convolutional stem that bifurcates into a micro-texture (CNN) stream for capturing fine-grained lesion details and a macro-context-scan (Mamba) stream for modeling global biomass continuity. These pathways are integrated via the Pathology–Biomass Collaborative Interaction (PBCI) module, which leverages a health-aware gating mechanism to calibrate yield predictions based on pathological stress. Finally, the Topology-Aligning Cross-Architecture Distillation (TACAD) strategy transfers global contextual knowledge from a heavyweight teacher to the student model, ensuring high performance on resource-constrained edge devices without increasing inference-time computational costs.

3.3.2. Stem Layer: Visual Embedding

Before entering the decoupled dual-stream backbone networks, the raw high-resolution UAV imagery must be mapped from pixel space into a latent feature space. To mitigate the aliasing effects and information loss commonly induced by non-overlapping patch projection in standard Transformers, a lightweight convolutional stem module was adopted. This module functions as a soft visual tokenizer, preserving local spatial continuity while performing initial downsampling. Formally, given an input image

X_{i n} \in R^{H \times W \times 3}

, where H, W, and 3 denote the image height, width, and number of RGB channels, respectively, the stem module projects the image into a lower-resolution feature map through a sequence of overlapping convolution and normalization operations. This process is formulated as follows:

F_{0} = LayerNorm (F_{s t e m} (X_{i n})) .

(5)

Here,

F_{s t e m} (\cdot)

denotes a structure composed of two stacked

3 \times 3

convolutional layers with specific stride configurations, followed by GELU activation functions. The resulting output

F_{0} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

serves as the initial feature representation for the subsequent dual-stream networks, where C denotes the initial embedding dimension. This fourfold hierarchical downsampling substantially reduces the computational burden of subsequent stages while retaining critical low-level texture information.

3.4. Heterogeneous Dual-Stream Encoder

To resolve the conflict between high-frequency local detail capture required for disease detection and low-frequency global context modeling required for biomass estimation, a heterogeneous dual-stream encoder was designed. This module decomposes the visual processing pipeline into two dedicated pathways: a convolution-based micro-texture stream and a state-space-model-based macro-context-scan stream.

3.4.1. Micro-Texture Stream

The primary objective of the micro-texture stream is to preserve the fine-grained morphological characteristics of crop lesions, such as lesion boundaries, color gradients, and texture patterns. These features are critical for early-stage disease diagnosis but tend to degrade in deep semantic networks. By leveraging the inherent translation invariance and local inductive bias of convolutional neural networks, this stream is dedicated to extracting locally discriminative features from the initial embeddings, as shown in Figure 4.

To balance feature extraction capability with the strict computational constraints of UAV edge devices, this stream is constructed using stacked inverted residual blocks (IRBs). Unlike standard convolutions, IRBs decouple spatial filtering and channel mixing through depthwise separable convolutions, thereby significantly reducing parameter count and floating-point operations (FLOPs). Formally, taking the visual embedding

F_{0}

produced by the stem layer as input, the micro-texture stream applies a transformation function

F_{c n n} (\cdot)

. Let

F_{c n n}

denote the output feature map of this stream. The processing of the i-th inverted residual block can be mathematically expressed as

{\hat{F}}_{i} = σ (BN (PW (F_{i - 1}))),

(6)

F_{i} = F_{i - 1} + BN (PW (σ (BN (DW ({\hat{F}}_{i}))))) .

(7)

Here,

F_{i - 1}

and

F_{i}

denote the input and output features respectively, of the i-th block, respectively, with

F_{i n i t i a l} = F_{0}

. The operator

PW (\cdot)

denotes pointwise convolution for channel expansion or projection,

DW (\cdot)

denotes depthwise convolution (

3 \times 3

) for spatial feature extraction,

BN (\cdot)

represents batch normalization, and

σ

denotes a nonlinear activation function (e.g., SiLU). After processing through N stacked blocks, the final output of the micro-texture stream is given by

F_{c n n} = F_{N} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

, which encapsulates rich high-frequency local information, providing a solid foundation for subsequent interaction modules to identify subtle disease targets.

3.4.2. Macro-Context-Scan Stream

Operating in parallel with the micro-texture stream, the macro-context-scan stream, as shown in Figure 5, focuses on characterizing the continuity and spatial heterogeneity of crop growth at the field scale. From a high-altitude UAV perspective, yield formation is not determined by isolated leaf conditions but is jointly governed by biomass distribution density and canopy uniformity across the entire field. Although Transformers are capable of modeling global dependencies, their quadratic computational complexity poses a significant bottleneck when processing ultra-high-resolution orthomosaic imagery acquired by UAVs. To address this limitation, a Mamba-based architecture was introduced to exploit its linear complexity property for efficient field-scale context modeling on edge devices.

The core component of this stream is the visual state-space module (VSSM). Unlike CNNs, which model texture patterns within local receptive fields, a VSSM interprets visual features as continuous signal sequences unfolded along spatial dimensions, thereby explicitly characterizing the continuity and evolutionary trends in crop growth at the field scale. Formally, let the input feature map from the stem layer be denoted as

F_{0} \in R^{H^{'} \times W^{'} \times C}

. This feature map is rearranged along a predefined scanning path into a visual token sequence

X \in R^{L \times C}

of length

L = H^{'} \times W^{'}

. In state-space modeling, each channel is treated as an independent one-dimensional dynamical system, whose evolution is governed by a continuous-time linear time-invariant (LTI) system. Specifically, the hidden state

h (t) \in R^{N}

encodes global contextual information across spatial locations, and its dynamics are defined by the following differential equation [25]:

h^{'} (t) = A h (t) + B x (t), y (t) = C h (t),

(8)

where

A \in R^{N \times N}

is a learnable state transition matrix that characterizes the evolution of global spatial dependencies, and

B \in R^{N \times 1}

and

C \in R^{1 \times N}

are the input and output projection matrices, respectively, which inject local visual features into the state space and map hidden states back to feature responses. These parameters are learned end-to-end via backpropagation and are shared across all spatial locations at the same layer, ensuring both parameter efficiency and training stability. Since UAV imagery is discretely sampled, the continuous-time system must be discretized for efficient computation. A zero-order hold (ZOH) method is adopted, where the discrete time step

Δ

controls the scale of state updates. Unlike fixed-step configurations,

Δ

is adaptively generated from the input features to enhance the model’s ability to accommodate varying spatial change rates. Specifically,

Δ

is obtained by applying a lightweight linear projection to the input feature

x (t)

followed by a nonlinear activation, enabling data-dependent dynamic time-scale modeling. Under this formulation, the continuous system can be equivalently expressed in the following discrete form [41]:

\bar{A} = exp (Δ A), \bar{B} = {(Δ A)}^{- 1} (exp (Δ A) - I) \cdot Δ B,

(9)

where

\bar{A}

and

\bar{B}

denote the discretized state transition and input projection matrices, respectively. Notably, these discrete parameters are not independently learned but are analytically derived from the continuous parameters

A

,

B

, and the adaptive time scale

Δ

, thereby avoiding additional parameter overhead. In discrete form, the hidden state can be efficiently updated through the following recurrence [41]:

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t} .

(10)

This recursive structure enables long-range spatial dependency modeling with linear computational complexity and minimal memory consumption, making it particularly suitable for capturing crop growth context across rows and fields in high-resolution UAV orthomosaic imagery.

Considering the non-causal nature of field environments, where disease stress often propagates radially and spatial variations in soil moisture and nutrient conditions are isotropic, a simple unidirectional scan is insufficient to capture complex spatial correlations. To address this issue, a two-dimensional selective scanning (SS2D) mechanism is introduced. The input feature

F_{0}

is unfolded along four scanning paths: from top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right. This multi-path scanning strategy emulates the omnidirectional perspective adopted by agronomists during field inspection, ensuring that disease centers located at arbitrary positions can be accurately contextualized through directional information from multiple spatial orientations.

Finally, context-enhanced features from the four scanning directions are mapped back to the original two-dimensional space and fused. Specifically, let

{{\hat{F}}_{k}}_{k = 1}^{4}

denote the direction-aware feature maps obtained from the four scanning paths, where

{\hat{F}}_{k} \in R^{H^{'} \times W^{'} \times C}

. These features are first concatenated along the channel dimension to form an aggregated tensor:

F_{cat} = Concat ({\hat{F}}_{1}, {\hat{F}}_{2}, {\hat{F}}_{3}, {\hat{F}}_{4}) \in R^{H^{'} \times W^{'} \times 4 C} .

(11)

Subsequently, a

1 \times 1

convolutional layer

ϕ (\cdot)

is applied to perform channel compression and direction-adaptive weighting, thereby integrating multi-directional contextual information into a unified global context representation without altering the spatial resolution:

F_{s s m} = ϕ (F_{cat}) \in R^{H^{'} \times W^{'} \times C} .

(12)

During training, the

1 \times 1

convolution

ϕ (\cdot)

automatically adjusts the relative contributions of features from different scanning directions through end-to-end optimization, enabling the model to adaptively emphasize context propagation paths that are more discriminative for the given spatial structure and task requirements.

3.5. Pathology–Biomass Collaborative Interaction

Existing multi-task frameworks typically treat disease detection and yield prediction as parallel and independent tasks, relying only on simple feature concatenation for fusion. Such designs neglect a fundamental agronomic causal relationship: pathogen stress disrupts photosynthesis and directly leads to losses in biomass accumulation. To bridge this logical gap, the pathology–biomass collaborative interaction (PBCI) module, as shown in Figure 6, is introduced. This module embeds biological prior knowledge into the feature space and leverages microscopic disease cues to dynamically calibrate macroscopic yield representations.

Formally, let

F_{c n n} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

denote the micro-texture features extracted by the CNN stream, and let

F_{s s m} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

denote the macro-context features captured by the Mamba stream. The objective of PBCI is to generate a calibrated feature map

F_{c a l i b r a t e d}

, in which biomass signals in diseased regions are adaptively suppressed according to infection severity. This process consists of two stages: health-aware gating and causal calibration.

First, the disease-discriminative micro-texture features

F_{c n n}

are transformed into a spatial health gate, which characterizes the crop health status at each spatial location and its potential contribution to yield accumulation. To this end, a lightweight health-gate generator

G (\cdot)

is designed using a bottleneck convolutional structure that compresses and remaps high-dimensional texture features while preserving spatial resolution. Specifically,

G (\cdot)

consists of a

1 \times 1

pointwise convolution for channel compression, followed by a

3 \times 3

depthwise separable convolution to aggregate local spatial context, and another

1 \times 1

pointwise convolution to project features into a single-channel response map. A Sigmoid activation function is then applied to normalize the output to the range

[0, 1]

, yielding the spatial health gate

M_{h e a l t h}

:

M_{h e a l t h} = σ (G (F_{c n n})) \in R^{\frac{H}{4} \times \frac{W}{4} \times 1} .

(13)

Here,

M_{h e a l t h}^{(i, j)} \to 1

indicates healthy regions with high yield potential, whereas

M_{h e a l t h}^{(i, j)} \to 0

corresponds to severely infected regions whose biomass accumulation should be penalized. The health gate is then applied to the global context feature

F_{s s m}

via element-wise multiplication (Hadamard product, denoted by ⊙). To prevent complete loss of background contextual information in diseased areas, a residual connection from the original

F_{s s m}

is introduced:

F_{c a l i b r a t e d} = F_{s s m} + α_{s c a l e} \cdot (F_{s s m} ⊙ M_{h e a l t h}),

(14)

where

α_{s c a l e}

is a learnable scaling factor initialized to 0, allowing the model to gradually learn the optimal calibration strength. Through this mechanism, the PBCI module explicitly models the negative causal effect of disease on yield, enforcing biologically consistent predictions from the regression head. The calibrated feature

F_{c a l i b r a t e d}

is subsequently fed into the yield regression head, while the original

F_{c n n}

is delivered to the disease detection head.

Conventional multi-task frameworks commonly regard disease detection and yield prediction as independent parallel tasks and fuse them via simple feature concatenation, thereby overlooking the fundamental agronomic causality that disease stress suppresses biomass accumulation by impairing photosynthesis. To further address this limitation, the PBCI module is extended with structured dual-branch enhancement and causal gating mechanisms, through which disease information is explicitly injected into the yield representation process.

In the biomass branch, a bidirectional Mamba module is introduced to further strengthen the long-range dependency modeling capability of

F_{s s m}

along spatial dimensions. This module aggregates both forward and backward state-space information to mitigate directional bias introduced by unidirectional scanning. Its output is added to the input feature through a residual connection, followed by layer normalization and a feedforward multilayer perceptron (MLP) block for nonlinear transformation, resulting in an enhanced biomass representation:

F_{s s m}^{'} = MLP (LN (F_{s s m} + BiMamba (F_{s s m}))) .

(15)

Meanwhile, in the pathology branch, a linear attention mechanism is applied to

F_{c n n}

to enhance discriminative responses to key lesion regions while maintaining computational efficiency. Similarly, this attention module is followed by a residual connection, layer normalization, and an MLP block to obtain an enhanced pathology representation:

F_{c n n}^{'} = MLP (LN (F_{c n n} + LinAttn (F_{c n n}))) .

(16)

Based on these refined features, a causal calibration mechanism is performed. First, a health gate

M_{h e a l t h}

is generated by applying a Sigmoid activation function

σ (\cdot)

to the refined pathology features, quantifying the probability of crop health at each spatial location:

M_{h e a l t h} = σ (F_{c n n}^{'}) \in (0, 1) .

(17)

Finally, to model yield loss induced by disease stress, the refined biomass feature

F_{s s m}^{'}

is modulated by the health gate via element-wise multiplication (⊙), effectively suppressing biomass signals in diseased regions. To preserve baseline biomass information, the modulated signal is added back to the refined feature through a residual connection, producing the final calibrated output

F_{c a l i b r a t e d}

:

F_{c a l i b r a t e d} = F_{s s m}^{'} + (F_{s s m}^{'} ⊙ M_{h e a l t h}) .

(18)

This output is subsequently fed into the yield regression head, ensuring that yield prediction is explicitly conditioned on pathological status, while the enhanced pathology feature

F_{c n n}^{'}

is utilized for the disease detection task.

3.6. Task-Specific Decoupled Prediction Heads

After feature enhancement and calibration by the PBCI module, the processed representations are routed to two task-specific decoupled prediction heads. This design ensures that each task exploits the most relevant feature abstractions, with the detection task emphasizing local morphological details and the yield estimation task leveraging globally calibrated contextual information.

For the disease detection head, a lightweight decoupled head architecture analogous to that of single-stage detectors is adopted. It consists of two parallel

1 \times 1

convolutional branches that operate on the refined micro-texture features

F_{c n n}^{'}

to predict the class probability

P_{c l s}

and bounding box coordinates

B_{r e g}

for each anchor, respectively:

{\hat{Y}}_{d e t} = {H_{c l s} (F_{c n n}^{'}), H_{r e g} (F_{c n n}^{'})} .

(19)

For the yield regression head, the input is the biomass-calibrated feature map

F_{c a l i b r a t e d}

. Since yield is a field-level metric, global average pooling (GAP) is first applied to collapse the spatial dimensions, followed by a multilayer perceptron to regress the final yield value

{\hat{y}}_{y i e l d}

:

{\hat{y}}_{y i e l d} = MLP (GAP (F_{c a l i b r a t e d})) .

(20)

This decoupled output mechanism ensures that yield prediction is explicitly conditioned on disease severity through the upstream PBCI module, while the detection task maintains a focused sensitivity to discriminative local texture patterns.

3.7. Topology-Aligning Cross-Architecture Distillation

To bridge the representational gap between a globally expressive but computationally expensive teacher and a lightweight student suitable for edge deployment, we propose a topology-aligning cross-architecture distillation strategy. In this setting, a pre-trained VMamba model serves as the teacher, while SymbioMamba acts as the student. Unlike conventional distillation approaches that focus solely on logits or homogenous feature spaces, our method explicitly aligns intermediate representations across heterogeneous architectures, enabling effective knowledge transfer from state-space models to convolutional streams.

At the feature level, hierarchical representations from the teacher network are aligned with the micro-texture (CNN) stream of the student. Injecting global context into a stream specifically designed for local texture extraction is motivated by the principle of contextual disambiguation. Purely local convolutional networks are inherently susceptible to visual ambiguity, as non-pathological artifacts (e.g., senescent leaf tips or soil background) often exhibit texture patterns similar to disease lesions. Owing to its global receptive field, the VMamba teacher captures holistic semantic and biomass-related context across the entire field. By aligning the student’s local features with the teacher’s global representations, the teacher’s knowledge acts as a form of semantic regularization, suppressing spurious background activations and reinforcing pathology-consistent texture responses.

Formally, let

F_{T}^{l} \in R^{H_{l} \times W_{l} \times C_{T}}

and

F_{S, c n n}^{l} \in R^{H_{l} \times W_{l} \times C_{S}}

denote the feature maps at stage l of the teacher and the student CNN streams, respectively. To address channel dimensionality discrepancies (

C_{T} ≫ C_{S}

) and manifold mismatch, a lightweight contextual alignment projector

ϕ (\cdot)

is introduced. The topology alignment loss jointly enforces consistency in both magnitude and direction:

\begin{matrix} L_{t o p o} = \sum_{l} ( & ∥ F_{T}^{l} - ϕ (F_{S, c n n}^{l}) ∥_{2}^{2} \\ + λ_{c o s} \cdot (1 - Cos Sim (F_{T}^{l}, ϕ (F_{S, c n n}^{l})))) . \end{matrix}

(21)

Here, the mean squared error term enforces magnitude alignment, while cosine similarity encourages directional consistency of feature vectors. This constraint enables the student’s CNN stream to implicitly acquire context awareness without altering its architectural efficiency or increasing inference cost.

Beyond intermediate feature alignment, high-level decision knowledge is transferred through decoupled logit distillation at the task-specific heads. Given the dual-task nature of the framework, distillation is applied separately to disease detection and yield regression. Specifically, softened class probabilities are matched using kullback–Leibler divergence for detection, while regression outputs are aligned using mean squared error:

L_{l o g i t} = L_{K L} ({\hat{Y}}_{d e t}^{S} / τ, {\hat{Y}}_{d e t}^{T} / τ) + L_{M S E} ({\hat{y}}_{y i e l d}^{S}, {\hat{y}}_{y i e l d}^{T}),

(22)

where

τ

denotes the temperature parameter controlling the softness of the probability distributions.

The overall training objective integrates ground-truth supervision with the proposed distillation regularization:

L_{t o t a l} = λ_{d e t} L_{d e t} + λ_{r e g} L_{r e g} + α L_{t o p o} + β L_{l o g i t} .

(23)

Here,

L_{d e t}

consists of classification and bounding box regression losses (e.g., Focal Loss and GIoU), while

L_{r e g}

represents the yield regression loss (e.g., Smooth L1). The weighting coefficients

λ_{d e t}

,

λ_{r e g}

,

α

, and

β

balance task supervision and distillation objectives. Through this unified optimization, the student network effectively inherits the teacher’s global structural knowledge while retaining the efficiency required for real-time UAV edge deployment.

3.8. Evaluation Metrics

To comprehensively evaluate the performance of the SymbioMamba framework, a multi-dimensional evaluation protocol was adopted, covering disease detection accuracy, yield-prediction precision, and edge-inference efficiency. For the disease severity classification, task, Precision (P), Recall (R), F1-score (F1), mAP@0.5, and mAP@0.5:0.95 were employed as evaluation metrics. First, to quantify the localization accuracy of predicted bounding boxes, the intersection over union (IoU) metric was introduced. Let

B_{p r e d}

and

B_{g t}

denote the predicted and ground-truth bounding boxes, respectively. IoU is defined as the ratio between the area of their intersection and the area of their union:

IoU = \frac{Area (B_{p r e d} \cap B_{g t})}{Area (B_{p r e d} \cup B_{g t})} .

(24)

Based on a specified IoU threshold, predictions are categorized into true positives (TPs), false positives (FPs), and false negatives (FNs). Precision measures the proportion of correctly predicted positive samples among all predicted positives, while Recall measures the proportion of actual positive samples that are correctly identified. To further synthesize the balance between Precision and Recall—particularly in the presence of natural class imbalance among disease grades in field conditions—the F1-score is calculated as the harmonic mean of these two metrics:

\begin{matrix} P & = \frac{T P}{T P + F P}, \end{matrix}

(25)

\begin{matrix} R & = \frac{T P}{T P + F N}, \end{matrix}

(26)

\begin{matrix} F 1 & = 2 \cdot \frac{P \cdot R}{P + R} . \end{matrix}

(27)

To evaluate the overall performance across all C classes (with

C = 5

in this study), the mean Average Precision (mAP) was computed. The Average Precision (AP) for the i-th class was obtained by calculating the area under the Precision–Recall curve

P (R)

. This study reports mAP@0.5 computed at a fixed IoU threshold of 0.5, as well as mAP@0.5:0.95, which averages mAP values over IoU thresholds ranging from 0.5 to 0.95, with a step size of 0.05:

\begin{matrix} m A P & = \frac{1}{C} \sum_{i = 1}^{C} \int_{0}^{1} P_{i} (R_{i}) d R_{i}, \end{matrix}

(28)

\begin{matrix} m A P @ 0.5 : 0.95 & = \frac{1}{10} \sum_{k = 0}^{9} m A P @ (0.5 + 0.05 k) . \end{matrix}

(29)

For the yield regression task, three statistical metrics were used to assess the agreement between predicted yield

{\hat{y}}_{i}

and ground truth

y_{i}

: the coefficient of determination (

R^{2}

), root mean square error (RMSE), and mean absolute error (MAE). The

R^{2}

metric indicates goodness of fit by representing the proportion of variance in the dependent variable explained by the model:

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}} .

(30)

RMSE and MAE quantify the magnitude of prediction errors, with RMSE being more sensitive to outliers:

\begin{matrix} R M S E & = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}, \end{matrix}

(31)

\begin{matrix} M A E & = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} | . \end{matrix}

(32)

Here, N denotes the total number of test samples, and

\bar{y}

represents the mean of the observed values.

To verify the feasibility of deploying the model on a UAV edge platform (NVIDIA Jetson AGX Orin), three efficiency-related metrics were adopted:

Parameter count (Params): the total number of learnable weights in the model, measured in millions (M).
Floating-point operations (FLOPs): the computational complexity of the model, measured in billions of floating-point operations (G).
Frames per second (FPS): the actual inference speed measured on the target hardware, where FPS $> 30$ is considered to satisfy real-time performance requirements.

4. Results and Discussion

4.1. Experiment Settings

4.1.1. Implementation Details

To ensure reproducibility and fair comparison, all algorithms were implemented using the PyTorch 2.1.0 deep learning framework under a CUDA 11.8 environment. The dataset was randomly divided into training, validation, and test sets with a ratio of 7:1:2. This strict partitioning strategy ensures that performance evaluation on unseen data reliably reflects the generalization capability of the model in real agricultural scenarios.

Regarding network configuration, the SymbioMamba framework first applies a shared convolutional stem layer to perform initial feature embedding on UAV images of size

640 \times 640 \times 3

. The stem layer consists of two stacked

3 \times 3

convolutional layers with a stride of 2, effectively downsampling the feature map to

160 \times 160

and mapping the initial channel dimension to

C = 96

. The selection of

C = 96

follows the architectural conventions of established hierarchical vision backbones, as it provides sufficient representative capacity for capturing fine-grained pathological textures while maintaining a computational footprint compatible with real-time inference on edge devices. The embedded features are then forwarded to the heterogeneous dual-stream encoder. Specifically, the micro-texture stream is composed of four stages of stacked inverted residual blocks. To balance local lesion texture representation capability and model complexity, the channel expansion ratio within each block is uniformly fixed at

γ = 4

. In parallel, the macro-context-scan stream is implemented based on the Mamba architecture, employing a two-dimensional selective scanning mechanism to model field-scale global spatial dependencies. The internal state-space model is configured with a hidden state dimension of

d_{s t a t e} = 16

, a local one-dimensional convolution kernel size of

k = 4

, and an expansion factor of

E = 2

. Outputs from the two streams are strictly aligned in both spatial resolution (

20 \times 20

) and channel dimension (

C = 768

).

Within the subsequent PBCI module, simple feature concatenation is discarded in favor of deeper feature refinement strategies. In the biomass branch, a bidirectional Mamba module is introduced to eliminate scanning-direction bias, while the pathology branch integrates a linear attention mechanism to efficiently aggregate key discriminative features. Both branches are equipped with standard Layer Normalization and feedforward MLP blocks to enhance nonlinear representational capacity. The health-aware gate is directly generated from the refined pathology features via Sigmoid activation, enabling spatial health probability modeling and causal calibration without introducing additional parameter overhead. Within the biomass branch of the PBCI module, the state-space parameters are configured with a hidden state dimension of

d_{s t a t e} = 16

and a local convolution kernel size of

d_{c o n v} = 4

, while maintaining an expansion factor of

E = 2

to balance global dependency modeling and computational cost. The pathology branch is configured with eight attention heads, each with a head dimension of

d_{h e a d} = 96

. Both branches employ feedforward MLP blocks with an expansion ratio of

γ_{m l p} = 4

, along with standard Layer Normalization, to further enhance nonlinear expressiveness.

Model training was conducted on a high-performance server running Ubuntu 22.04 LTS and equipped with an NVIDIA GeForce RTX 4090 GPU (24 GB), using an end-to-end multi-task joint optimization strategy. The total loss function

L_{t o t a l}

was formulated to simultaneously supervise disease localization and yield estimation, defined as a weighted combination of the detection loss

L_{d e t}

(incorporating Focal Loss for classification and GIoU loss for bounding-box regression) and the regression loss

L_{r e g}

(Smooth L1 loss for yield prediction). To maintain an unbiased optimization trajectory and avoid the empirical complexity of manual hyperparameter tuning, the weighting coefficients were standardized at an equal magnitude (

λ_{d e t} = λ_{r e g} = 1.0

) and remained fixed throughout all training phases. This static weighting strategy was adopted to ensure a balanced gradient flow between the convolutional micro-texture stream and the state-space macro-context stream, allowing both tasks to reach convergence at a consistent rate without the instability often introduced by dynamic weighting schemes.

The AdamW optimizer was adopted with momentum parameters set to

β_{1} = 0.9

and

β_{2} = 0.999

, and a weight decay of

0.05

. Learning-rate scheduling followed a cosine annealing strategy, with an initial learning rate of

1 \times 10^{- 3}

, which decayed to

1 \times 10^{- 5}

over 300 epochs, including a linear warm-up phase during the first five epochs. The training batch size was fixed at 32, and a composite data augmentation strategy was applied, including random cropping, Mosaic augmentation with probability

p = 1.0

, HSV color perturbation (H: 0.015, S: 0.7, V: 0.4), and random flipping. For experiments incorporating TACAD, the teacher model parameters were kept frozen, and the distillation hyperparameters were set as follows: temperature

τ = 3.0

, topological alignment weight

α = 0.5

, and logit distillation weight

β = 1.0

.

Finally, to validate the applicability of the model in practical UAV operational scenarios, runtime performance on edge devices was extensively evaluated. All inference speed and latency measurements were conducted on an NVIDIA Jetson AGX Orin (64 GB) embedded platform. To ensure the reproducibility and comparative fairness of the results, the device was locked in the MAXN (60 W) high-performance power mode. This configuration eliminates computational jitter caused by dynamic frequency scaling, thereby providing a consistent hardware baseline for benchmarking. Furthermore, all non-essential background processes were disabled to isolate the model’s intrinsic computational footprint from operating system overhead. By utilizing FP32 single-device inference and strictly avoiding test-time augmentation (TTA) or model ensemble strategies, we ensured that the reported metrics reflect the raw, real-time processing capability required for autonomous aerial navigation. This rigorous protocol prioritizes the low-latency demands of field-scale monitoring, where the delayed decision-making inherent in multi-pass inference strategies would be impractical for high-speed UAV flight.

4.1.2. Baselines

To systematically evaluate the performance of the SymbioMamba framework and substantiate its architectural innovations, a curated set of state-of-the-art (SOTA) models was selected as comparative baselines. The selection criteria focused on three dimensions: architectural representative capacity, real-time edge feasibility, and relevance to the “disease–yield” collaborative task. These baselines encompass the primary evolutionary stages of computer vision backbones and specialized agricultural sensing frameworks, providing a rigorous benchmark for our dual-stream SSM approach.

Task-Specific Agricultural Baselines: YOLOv11 [42] and YOLO-Mamba [43] serve as the primary benchmarks for disease perception, representing high-speed edge detection and contemporary SSM-based detection, respectively. Additionally, AgriTransformer [44] is included as a specialized baseline that utilizes self-attention mechanisms tailored for agricultural scene understanding, allowing us to evaluate SymbioMamba against models specifically optimized for crop phenotypic variations.
Agricultural Yield Regression Baselines: For yield estimation, we selected ConvLSTM [45] and TasselNetV3 [46]. These represent two divergent paradigms in agronomic modeling: the former focuses on spatiotemporal gating for growth dynamics, while the latter represents the SOTA in morphological counting and density-based regression. Their inclusion allows us to demonstrate how SymbioMamba bridges the gap between purely structural modeling and stress-aware yield calibration.
Architectural Evolution Baselines: To verify whether the observed performance gains stem from our symbiotic logic rather than mere backbone capacity, we included a spectrum of general-purpose encoders. This includes the convolution-centered ConvNeXt [47] and the mobile-optimized EfficientNet-Lite4 [48], which represents the pinnacle of NAS-driven CNN efficiency for edge devices. We also compared against the self-attention-driven Swin Transformer [49], the hybrid MobileViT [50]—which integrates local convolutions with global Transformers for mobile deployment—and the pure SSM-based VMamba [51].

A critical aspect of our experimental protocol was the enforcement of **hyperparameter fairness** to eliminate the “straw man” effect. Rather than relying solely on default configurations, which may be suboptimal for specific UAV-based agricultural datasets, we conducted a localized grid search for each baseline to determine optimal learning rates and weight decays. All models were standardized to an identical input resolution (

640 \times 640

) and trained using the same data augmentation pipeline and hardware environment. Furthermore, to isolate the efficacy of the feature encoders, all baselines were equipped with the same task-specific heads and loss functions as SymbioMamba. This ensures that any measured variance in detection

m A P

or yield

R^{2}

is strictly attributable to the model’s intrinsic ability to represent and interact with heterogeneous agricultural features.

4.2. Performance Comparison on Disease Detection Task Across Different Models

To quantitatively evaluate the overall performance of the SymbioMamba framework for edge-side disease detection, a systematic comparison was conducted against representative state-of-the-art (SOTA) methods. The experiments were designed to verify whether the heterogeneous dual-stream architecture can achieve an optimal balance between detection accuracy and inference efficiency. Table 3 summarizes the quantitative results of all models on the collected maize disease dataset.

As shown in Table 3 and Figure 7 and Figure 8, SymbioMamba demonstrates a pronounced Pareto-optimal characteristic between detection accuracy and edge-deployment efficiency. In terms of accuracy, the proposed method achieves the highest mAP@0.5 (89.4%) and the most stringent metric mAP@0.5:0.95 (65.8%). Compared with heavyweight or domain-specific backbones such as Swin Transformer, VMamba, and AgriTransformer, the mAP@0.5 is improved by 3.3%, 2.5%, and 1.9%, respectively. This observation indicates that while attention-based models like AgriTransformer excel at capturing agricultural scene dependencies, they often struggle with the high-frequency local texture extraction required for fine-grained lesion grading—a deficiency effectively compensated for by the micro-texture stream in SymbioMamba. Furthermore, compared with the industry-standard YOLOv11 and the Mamba-based YOLO-Mamba, improvements of 2.4% and 1.7% are achieved respectively in mAP@0.5:0.95. This gain is primarily attributed to the health-aware gating mechanism introduced by the PBCI module, which leverages global contextual information to suppress false positives in complex farmland backgrounds, thereby significantly improving localization quality under high IoU thresholds. In terms of inference efficiency, SymbioMamba successfully overcomes the conventional trade-off between accuracy and speed. Although EfficientNet-Lite4 and MobileViT v3 achieve higher throughputs (44.5 FPS and 39.2 FPS, respectively) due to their streamlined convolutional or hybrid designs, their mAP@0.5:0.95 scores remain limited to 52.6% and 55.4%, representing a significant performance gap. In contrast, SymbioMamba delivers a substantial accuracy improvement of over 10% compared to these lightweight baselines, while maintaining a highly competitive real-time speed of 38.2 FPS. Notably, compared with YOLO-Mamba, our heterogeneous dual-stream design reduces the parameter count by approximately 51.6% (6.2 M vs. 12.8 M) while achieving superior mAP. This result suggests that rather than embedding state-space models into deep monolithic networks, the proposed parallel dual-stream strategy more effectively exploits the complementary strengths of CNN-based texture perception and Mamba-based long-range modeling. Ultimately, the performance profile of SymbioMamba fully satisfies the real-time requirements of UAV operations, offering a robust solution for next-generation intelligent agricultural machinery.

4.3. Performance Comparison on Yield-Prediction Task Across Different Models

Beyond disease perception accuracy, the precision of yield prediction directly determines the practical value of agronomic decision-making. In this subsection, SymbioMamba is compared with mainstream visual backbone networks and agriculture-specific regression models. The evaluation focuses on the capability of different models to capture field-scale biomass distribution and to leverage pathological information for yield bias correction. Quantitative comparison results are summarized in Table 4.

As summarized in Table 4, SymbioMamba establishes a new performance benchmark in the maize yield regression task, achieving a superior coefficient of determination (

R^{2} = 0.915

) and a notable reduction in prediction error (

R M S E = 485.6

kg/ha). To elucidate the technical drivers behind these empirical gains, we analyze the model’s advantages through two lenses: (1) the enhancement of global context modeling via state-space scanning, and (2) the integration of agronomic causal reasoning through pathology–yield interaction. First, regarding global context modeling, convolution-based ConvNeXt and lightweight MobileViT exhibit relatively limited performance (

R^{2}

of 0.845 and 0.812, respectively), confirming that purely convolutional architectures are often constrained by local receptive fields and struggle to integrate biomass continuity features spanning entire fields. In contrast, Swin Transformer and VMamba successfully capture long-range dependencies, achieving

R^{2}

values above 0.89. Notably, SymbioMamba surpasses these by using a dual-stream architecture that preserves leaf-level micro-texture cues—such as chlorophyll density patterns—closely related to biomass accumulation, offering superior fitting accuracy over homogeneous Mamba models. Second, SymbioMamba demonstrates a clear advantage in agronomic consistency. While specialized models like TasselNetV3 (

R^{2} = 0.868

) focus on morphological counting, they neglect the yield reduction induced by disease stress. ConvLSTM, although capable of spatiotemporal modeling, suffers from high computational overhead (12.6 G FLOPs and 8.5 FPS), failing edge deployment requirements. Conversely, SymbioMamba explicitly performs causal calibration through the PBCI module, leveraging disease detection features to adjust biomass representations. This interaction mechanism—where more severe disease logically corresponds to lower predicted yield—enables more accurate estimation with only 6.2 M parameters. Ultimately, SymbioMamba overcomes the limitation of conventional vision models that observe biomass while ignoring stress, providing a solution that simultaneously optimizes accuracy, biological consistency, and computational efficiency.

Based on Figure 9, SymbioMamba consistently outperforms other baseline models in predicting maize yield, evidenced by the tight clustering of data points along the

y = x

line and the narrower 95% confidence ellipses compared to models like MobileViT and ConvNeXt. Specifically, SymbioMamba achieves the highest coefficient of determination (

R^{2} = 0.915

) and the lowest root mean square error (RMSE = 485.6 kg/ha), demonstrating its superior capability in capturing complex yield variations. While Transformer-based models like Swin Transformer (

R^{2} = 0.892

) and state-space models like VMamba (

R^{2} = 0.901

) also show strong performance, SymbioMamba’s integration of disease-aware calibration allows for more accurate predictions, particularly in lower yield ranges where stress impact is significant. This validates the effectiveness of the dual-stream architecture in leveraging both local pathology features and global biomass context for precise yield estimation.

To further enhance interpretability and practical applicability in agricultural production, this study introduced a heatmap visualization method based on the integration of micro-plot spatial structures and model-predicted results, as shown in Figure 10. By mapping the yield results predicted by the model to the spatial distribution area of the corresponding sampling units, a discrete spatial heat map with clear agronomic significance is formed. As qualitatively observed in the comparative results, SymbioMamba exhibits a superior ability to mirror the “Actual” yield distribution across the entire field trial compared to baselines. Specifically, while models like ConvNeXt and MobileViT frequently misclassify the “Mid-High” yield segments as “Avg” or “Mid-Low” in diseased plots due to a lack of global context, SymbioMamba maintains high fidelity in representing these subtle transitions. This visualization effectively reveals the characteristics of spatial differentiation in yield at the plot scale—specifically reconstructing local low-yield areas induced by disease stress—while demonstrating the model’s capability to capture yield fluctuations within fine-grained plots. The sharp preservation of high-yield clusters alongside disease-induced low-yield zones intuitively validates the effectiveness of the PBCI-driven causal calibration, thereby significantly strengthening the reference value of the model’s output for actual precision field management.

4.4. Ablation Studies

To thoroughly investigate the contribution of each core component within the SymbioMamba framework, a series of ablation experiments were conducted under identical experimental settings. The complete model was compared against several variants with specific modules removed, with a particular focus on evaluating the complementarity of the heterogeneous dual-stream architecture, the interaction effectiveness of the PBCI module, and the performance gains introduced by the TACAD distillation strategy. Quantitative results are summarized in Table 5 and Table 6, respectively.

Impact of heterogeneous dual-stream architecture. The complementarity between CNN and Mamba was first examined by decoupling the dual-stream structure. As shown in Table 5 and Table 6, SymbioMamba w/o Macro Stream (retaining only the Micro Stream) achieves reasonable disease detection performance by exploiting the local inductive bias of CNNs (mAP = 87.5%), yet performs poorly on yield prediction (

R^{2}

drops to 0.821). This result indicates that local texture features alone are insufficient to capture field-scale biomass distribution patterns. Conversely, SymbioMamba w/o Micro Stream (retaining only the Macro Stream) benefits from Mamba’s long-sequence modeling capability and exhibits strong yield estimation performance, but suffers a substantial decline in disease detection accuracy (mAP decreases to 82.3%). This observation confirms that purely state-space model-based architectures tend to lose critical high-frequency edge information when handling pixel-level small lesions due to the absence of local convolutional operations. The complete model effectively integrates the strengths of both streams, demonstrating that assigning CNNs to “perceive lesion morphology” and Mamba to “understand field context” constitutes an optimal architecture for collaborative perception.

Effectiveness of PBCI mechanism. To evaluate the value of the PBCI module, PBCI was replaced with naive feature concatenation, yielding the variant SymbioMamba w/o PBCI. The results indicate that removing PBCI causes

R^{2}

to decrease from 0.915 to 0.891, accompanied by an RMSE increase of 59.7 kg/ha. This pronounced degradation demonstrates that simple feature concatenation is insufficient for modeling the intrinsic relationship between disease and yield. The health-aware gating and causal calibration mechanisms embedded in PBCI successfully inject biological priors (i.e., disease stress leads to yield reduction) into the feature space, enabling the model to dynamically adjust yield predictions based on detected disease severity. In addition, a modest improvement in mAP (+0.8%) suggests that explicit biomass context also helps suppress false detections in non-crop background regions (e.g., bare soil).

Contribution of TACAD. Finally, the contribution of the TACAD strategy was assessed. SymbioMamba w/o TACAD represents the model trained without teacher guidance. Although this variant already benefits from an effective architectural design, its performance remains constrained by the limited capacity of the lightweight backbone. After introducing distillation from the heavyweight VMamba teacher, the full SymbioMamba model achieves consistent performance gains without increasing inference-time parameter count (Params remain unchanged): mAP improves by 0.3% and

R^{2}

increases by 0.007. As depicted in Figure 11, the training dynamics further corroborate this advantage. Specifically, subplot (a) illustrates that the full model converges to a lower training loss with reduced volatility compared to the student model alone. Furthermore, the zoomed-in views in subplots (b) and (c) reveal that teacher guidance not only consistently elevates the upper bounds of both disease detection accuracy and yield prediction fitness but also enhances training stability in the final epochs. These results indicate that TACAD successfully transfers global “dark knowledge” from the teacher network to the student, enabling the lightweight model to implicitly acquire deeper feature abstraction capability.

4.5. Robustness and Cross-Scenario Adaptability Analysis

To verify the generalizability of SymbioMamba across diverse pathological and environmental conditions, we conducted a “Stress-Test” evaluation. The dataset was expanded to include 1500 additional UAV images covering two additional maize diseases: Common Rust and Banded Leaf Blight. Furthermore, to quantify the impact of real-world deployment challenges, we curated three specific interference subsets: (1) Heavy Weed Occlusion, where maize leaves are partially obscured by dense inter-row weeds; (2) Rainy/Low-Contrast, consisting of images captured under light rain or heavy overcast conditions with significant droplet noise and reduced luminance; and (3) Over-Exposure, featuring high-noon sunlight causing severe specular reflection on the leaf cuticle. All robustness tests were executed on the NVIDIA Jetson AGX Orin platform to evaluate the joint stability of accuracy and inference speed. Table 7 presents the performance of SymbioMamba compared to the most competitive baselines (YOLOv11, VMamba, and AgriTransformer) under these challenging conditions.

The experimental data indicates that SymbioMamba maintains high diagnostic reliability even when facing unprecedented pathological textures and environmental noise. In the Common Rust and Banded Leaf Blight tasks, our model outperformed AgriTransformer by 1.2% and 2.1% mAP, respectively. This suggests that the dual-stream architecture’s ability to decouple local textures is particularly effective for Rust’s small pustules and the large, irregular “water-soaked” spots characteristic of Banded Blight. Crucially, under complex environmental interference (Weeds/Rain), most baselines experienced an accuracy degradation of over 8–10%. However, SymbioMamba exhibited superior resilience, with the mAP dropping by only 6.0% (from 89.4% to 83.4%). The robustness is primarily attributed to the Mamba-based macro-context stream, which effectively maintains long-range spatial continuity in low-contrast rainy scenes, and the PBCI module, which filters out weed-induced false positives by enforcing biological causal priors. In terms of yield estimation,

R^{2}

under environmental stress remained at a high level of 0.886, significantly outperforming VMamba (

0.825

) and confirming that our causality-aware interaction provides a “buffer” against visual noise. Despite the increased complexity of the scenes, the inference speed remained stable at 38.2 FPS, proving that the lightweight topology of SymbioMamba is robust not only in accuracy but also in operational throughput for real-world autonomous UAV missions.

4.6. Edge Deployment and Field Application Validation

To bridge the gap between high-accuracy deep learning models and practical agricultural production requirements, and to validate the real-time processing capability of SymbioMamba under resource-constrained conditions, the trained model was deployed on an onboard UAV edge computing platform (NVIDIA Jetson AGX Orin, 64 GB), and an integrated intelligent monitoring software system was developed.

In the practical deployment pipeline, as illustrated in Figure 12, the trained weights in PyTorch format were first exported to the Open Neural Network Exchange (ONNX) universal intermediate representation—an open-source ecosystem that provides a hardware-agnostic format to facilitate model interoperability across diverse deep learning frameworks [52]. This conversion allowed the model to be subsequently accelerated using the NVIDIA TensorRT inference engine with FP16 half-precision quantization to maximize edge-side throughput. The software interaction interface was developed based on the PyQt5 framework, with OpenCV integrated for real-time preprocessing and postprocessing of video streams. The system was configured to simultaneously handle two data streams: one stream was used to display the RGB video feedback with overlaid detection bounding boxes in real time, while the other stream continuously accumulated field-scale spatial information in the background. Notably, the software interface showcased in Figure 12 displays the real-time spatial distribution of disease severity captured during an actual field flight mission, illustrating the system’s capability to map pathology at the field scale.

To enable actionable agronomic decision-making, a hierarchical warning logic was designed on top of the model inference outputs. Unlike simple single-frame detection or localized surveys via mobile cameras, the UAV platform enables continuous, high-throughput spatial sensing across the entire field. A spatiotemporal smoothing strategy was adopted to suppress false alarms caused by UAV vibration or transient illumination changes. Specifically, prediction results from consecutive K frames (set to

K = 5

in this study) were aggregated using weighted integration to improve decision stability and reliability. For disease alerts, a region was marked as a high-priority hotspot in the geographic information system (GIS) only when it was consistently identified as infected with NCLB across multiple consecutive observations and when the disease area ratio persistently exceeded a predefined threshold (e.g., reaching Grade 2 or higher). This strategy effectively reduces false positives induced by short-term noise or local occlusions and provides precise spatial guidance for subsequent targeted spraying operations by plant protection UAVs.

During multiple flight tests conducted at the Inner Mongolia experimental site, the edge-side system demonstrated high operational stability and minimal impact on aircraft endurance. The NVIDIA Jetson AGX Orin module, operating under the MAXN power mode, maintained a power draw of approximately 45–55 W during active multi-task inference. Given the high battery capacity of the DJI Matrice 300 RTK platform, this additional computational load resulted in a reduction in flight time of less than 8%, ensuring that total operational endurance remained above 32 min per set of batteries. To bridge the discrepancy between the native camera resolution (

5184 \times 3888

) and the model’s input dimension, an aspect ratio-preserving resizing strategy was employed. Specifically, the original frames were downsampled to

640 \times 480

and subsequently integrated into a

640 \times 640

tensor via symmetric gray-padding (letterboxing). To counteract the potential for motion blur and leaf instability caused by rotor wash at a cruising speed of

5 m / s

, the Zenmuse H20T camera was configured with a high electronic shutter speed (1/1000 s or faster). Furthermore, at the designated flight altitude of 35 m, the downward airflow from the rotors is significantly dissipated before reaching the maize canopy, and the 1.5 cm/pixel ground sampling distance (GSD) provides sufficient spatial detail for identifying the characteristic elliptical lesions of Northern Corn Leaf Blight (NCLB), which typically span several centimeters in length. As shown in the system status panel (Figure 12), the SymbioMamba framework maintained a sustainable resource footprint, with a GPU load of 78% and an average inference speed of 38.2 FPS. These results indicate that, by combining high-speed shutter control with the efficient dual-stream architecture of SymbioMamba, complex “disease–yield” collaborative perception can be successfully realized on embedded devices without compromising the physical stability or operational efficiency of the UAV.

To evaluate the model’s generalization capability beyond the specific conditions of the primary study site, we conducted a cross-dataset validation by incorporating 426 high-resolution maize leaf images from multiple public agricultural repositories (e.g., PlantVillage and open-access UAV datasets from different geographic regions). These samples encompass a wide variety of sensor types, including CMOS sensors from different manufacturers, and diverse ambient lighting conditions across multiple growing seasons. As summarized in Table 8, the SymbioMamba framework maintained high diagnostic accuracy on this external data, achieving an F1-score of 86.4% and an mAP@0.5 of 87.2%. This stability indicates that the heterogeneous dual-stream architecture, particularly the micro-texture stream, effectively extracts invariant pathological features that are robust to cross-regional and cross-sensor variations.

Furthermore, we conducted a rigorous hardware benchmark on the NVIDIA Jetson AGX Orin (64 GB) to address energy consumption and latency under realistic UAV operating conditions. During continuous field flight missions at a cruising speed of 5 m/s, the system’s power draw was measured using an external power analyzer. The results, detailed in Table 8, show that the system consumes 45–55 W during peak multi-task inference in MAXN mode. This energy footprint accounts for less than 8% of the total battery capacity of an enterprise-level UAV, ensuring that flight endurance is maintained above 90% of the manufacturer’s rated specification. Latency tests revealed an end-to-end processing time of 26.2 ms per frame, ensuring that the decision-making pipeline remains synchronized with the high-speed spatial sensing requirements of real-time disease mapping.

4.7. Discussion

Existing UAV-based visual studies on maize generally follow two main technical paradigms. One line of research focuses on disease detection or segmentation, typically adopting YOLO-based or lightweight CNN architectures and enhancing the recognition of small-scale lesions through multi-scale feature fusion. However, disease perception is often treated as an end task, with limited integration into yield-related representation and decision-making processes [53,54]. The other line targets yield estimation or yield proxy modeling (e.g., ears, tassels, or canopy structure), relying on CNNs, ConvLSTM, or Transformer-based models to capture global growth patterns. These approaches, however, are usually insensitive to disease stress and tend to overestimate yield in regions with severe infection [55,56]. In contrast, disease perception and yield calibration are treated in this study as a collaborative problem with a clear causal direction. Through a heterogeneous dual-stream architecture, the conflict between local texture discrimination and global biomass modeling is explicitly decoupled at the system level. Furthermore, the PBCI module embeds the agronomic prior that “disease stress suppresses biomass accumulation” directly into the feature interaction process. As a result, yield prediction is improved not only in numerical accuracy but also in biological consistency, which is particularly suitable for maize scenarios characterized by high planting density and spatially continuous disease diffusion.

Beyond the architectural advantages, the practical deployment of SymbioMamba considers several critical real-world constraints to ensure its field applicability. From an economic standpoint, the reliance on standard consumer-grade RGB sensors instead of specialized hyperspectral or LiDAR systems significantly reduces the hardware threshold for farmers, providing a cost-feasible solution for precision management. To address sensor calibration requirements across diverse lighting conditions, we employ a standardized preprocessing pipeline that includes photometric normalization and aspect ratio-preserving resizing, ensuring consistent feature extraction regardless of the time of day or camera model used. Furthermore, the framework’s robustness is inherently bolstered by the PBCI module, which acts as a physiological “filter” to handle agronomic variability; it ensures that localized noise—such as soil interference or shadows—is suppressed by the causal logic that requires pathological evidence to justify yield reductions. While UAV operational constraints like battery life and flight speed were optimized via TensorRT acceleration to maintain high-frequency throughput, future iterations could further enhance robustness by incorporating self-supervised domain adaptation to handle unprecedented cross-regional environmental shifts.

While the current validation focuses on a maize UAV dataset, the core architecture of SymbioMamba is designed with inherent generalizability. The heterogeneous dual-stream encoder establishes a universal phenotypic template—“local disease cues + global growth context”—that is theoretically transferable to staple crops such as rice, wheat, soybean, and cotton. To successfully adapt this framework to other species, several key incorporations are recommended. Primarily, the micro-texture stream should be fine-tuned or re-trained on crop-specific pathological datasets to accommodate the distinct morphological signatures of different diseases, such as the linear rust patterns in wheat or the elliptical blast spots in rice. Moreover, the learnable parameters within the PBCI health-gating mechanism must be recalibrated through end-to-end training to reflect the unique physiological stress-response curves and biomass-to-yield conversion rates of the target crop. Notably, in scenarios involving compound environmental stresses, the framework could be further augmented by integrating multispectral indices or phenological data into the macro-context-scan stream. By leveraging the cross-architecture knowledge transfer provided by TACAD, these adaptations can be implemented without compromising the real-time inference efficiency required for cross-region UAV deployment.

Limitations and future work. Several limitations remain to be addressed in future research. First, yield is inherently a plot-level variable, and the ground-truth acquisition process is subject to measurement noise and management variability, which may constrain the upper bound of regression performance. Second, under different crop varieties and management conditions, the relationship between disease and yield may exhibit more complex nonlinear or threshold effects. The current health-gating-based causal calibration primarily captures a suppression trend and may be further refined under extreme or compound stress scenarios. Future work may consider incorporating crop phenological information, environmental variables, or multispectral data to further enhance model adaptability to complex agricultural conditions and improve robustness across crops.

5. Conclusions

This study introduces SymbioMamba, a novel symbiotic framework that fundamentally shifts UAV-based maize phenotypic analysis from independent task modeling toward a causality-aware integrated paradigm. The core innovation resides in the heterogeneous dual-stream encoder, which architecturally resolves the deep-seated scale conflict between microscopic lesion textures and macroscopic biomass continuity. By explicitly embedding the agronomic prior that “pathological stress suppresses yield” into the latent feature space, the proposed PBCI module transforms statistical correlations into biologically consistent causal logic. This methodological shift allowed us to quantify the nonlinear yield loss thresholds, revealing that while early-stage infections (Grade 1–2) result in marginal losses (3.5–8.2%), severe progression (Grade 4) triggers an accelerated decline exceeding 27.4%. Beyond its structural novelty, the framework leverages the TACAD strategy to bridge the representation gap between Mamba-based global scanning and lightweight convolutional structures, achieving a Pareto-optimal trade-off on edge-side hardware. Experimental results substantiate that SymbioMamba not only sets new benchmarks for accuracy (89.4% mAP and

0.915

R^{2}

) but also establishes a resilient operational boundary under extreme environmental stresses, such as high-density weed occlusion and rainy conditions (maintaining

R^{2} = 0.886

). Ultimately, this work provides a scalable and interpretable technical template for the transition of agricultural UAVs toward active, real-time decision-making in next-generation intelligent farming systems.

Author Contributions

Conceptualization, Z.W., Y.W., B.Z. and Y.S.; data curation, P.G. and H.Y.; formal analysis, X.Y.; funding acquisition, Y.S.; investigation, X.Y.; methodology, Z.W., Y.W. and B.Z.; project administration, Y.S.; resources, P.G. and H.Y.; software, Z.W., Y.W. and B.Z.; supervision, Y.S.; validation, X.Y.; visualization, P.G. and H.Y.; writing—original draft, Z.W., Y.W., B.Z., X.Y., P.G., H.Y. and Y.S., Z.W., Y.W. and B.Z. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 61202479.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cole, M.B.; Augustin, M.A.; Robertson, M.J.; Manners, J.M. The science of food security. NPJ Sci. Food 2018, 2, 14. [Google Scholar] [CrossRef] [PubMed]
Primicerio, J.; Di Gennaro, S.F.; Fiorillo, E.; Genesio, L.; Lugato, E.; Matese, A.; Vaccari, F.P. A flexible unmanned aerial vehicle for precision agriculture. Precis. Agric. 2012, 13, 517–523. [Google Scholar] [CrossRef]
Toscano, F.; Fiorentino, C.; Capece, N.; Erra, U.; Travascia, D.; Scopa, A.; Drosos, M.; D’Antonio, P. Unmanned aerial vehicle for precision agriculture: A review. IEEE Access 2024, 12, 69188–69205. [Google Scholar] [CrossRef]
Velusamy, P.; Rajendran, S.; Mahendran, R.K.; Naseer, S.; Shafiq, M.; Choi, J.G. Unmanned Aerial Vehicles (UAV) in precision agriculture: Applications and challenges. Energies 2021, 15, 217. [Google Scholar] [CrossRef]
Chriki, A.; Touati, H.; Snoussi, H.; Kamoun, F. Deep learning and handcrafted features for one-class anomaly detection in UAV video. Multimed. Tools Appl. 2021, 80, 2599–2620. [Google Scholar] [CrossRef]
Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Deep learning techniques to classify agricultural crops through UAV imagery: A review. Neural Comput. Appl. 2022, 34, 9511–9536. [Google Scholar] [CrossRef]
MirhoseiniNejad, S.M.; Abbasi-Moghadam, D.; Sharifi, A. ConvLSTM-ViT: A deep neural network for crop yield prediction using Earth observations and remotely sensed data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17489–17502. [Google Scholar] [CrossRef]
Senarathna, J.I. Enhancing Rice Production Through a Multi-Task Neural Network Framework: A Smart Agricultural Solution for Growth Monitoring, Disease Detection, and Yield Prediction. Preprints 2025. [Google Scholar] [CrossRef]
Alirezazadeh, P.; Schirrmann, M.; Stolzenburg, F. A comparative analysis of deep learning methods for weed classification of high-resolution UAV images. J. Plant Dis. Prot. 2024, 131, 227–236. [Google Scholar] [CrossRef]
Castellano, G.; De Marinis, P.; Vessio, G. Applying knowledge distillation to improve weed mapping with drones. In Proceedings of the 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), Warsaw, Poland, 17–20 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 393–400. [Google Scholar]
Zhao, M.; Wang, D.; Zhang, G.; Cao, W.; Xu, S.; Li, Z.; Liu, X. Evaluating Maize Emergence Quality with Multi-task YOLO11-Mamba and UAV-RGB Remote Sensing. Smart Agric. Technol. 2025, 12, 101351. [Google Scholar] [CrossRef]
Mittal, P. A comprehensive survey of deep learning-based lightweight object detection models for edge devices. Artif. Intell. Rev. 2024, 57, 242. [Google Scholar] [CrossRef]
Liu, S.; Liang, Y.; Gitter, A. Loss-balanced task weighting to reduce negative transfer in multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Palo Alto, CA, USA, 2019; Volume 33, pp. 9977–9978. [Google Scholar]
Tahir, M.N.; Lan, Y.; Zhang, Y.; Wenjiang, H.; Wang, Y.; Naqvi, S.M.Z.A. Application of unmanned aerial vehicles in precision agriculture. In Precision Agriculture; Elsevier: San Diego, CA, USA, 2023; pp. 55–70. [Google Scholar]
Bai, X.; Liu, P.; Cao, Z.; Lu, H.; Xiong, H.; Yang, A.; Cai, Z.; Wang, J.; Yao, J. Rice plant counting, locating, and sizing method based on high-throughput UAV RGB images. Plant Phenomics 2023, 5, 0020. [Google Scholar] [CrossRef] [PubMed]
Mora, J.J.; Selvaraj, M.G.; Alvarez, C.I.; Safari, N.; Blomme, G. From pixels to plant health: Accurate detection of banana Xanthomonas wilt in complex African landscapes using high-resolution UAV images and deep learning. Discov. Appl. Sci. 2024, 6, 377. [Google Scholar] [CrossRef]
Lyu, Y.; Han, X.; Wang, P.; Shin, J.Y.; Ju, M.W. Unmanned aerial vehicle-based rgb imaging and lightweight deep learning for downy mildew detection in kimchi cabbage. Remote Sens. 2025, 17, 2388. [Google Scholar] [CrossRef]
You, S.; Li, B.; Chen, Y.; Ren, Z.; Liu, Y.; Wu, Q.; Tao, J.; Zhang, Z.; Zhang, C.; Xue, F.; et al. Rose-Mamba-YOLO: An enhanced framework for efficient and accurate greenhouse rose monitoring. Front. Plant Sci. 2025, 16, 1607582. [Google Scholar] [CrossRef]
Sharma, V.; Patel, V.K.; Sahu, Y.; Vyas, M. AgriTransformer: Synergizing Robust Deep Transformer Model for Intelligent Farming. In Communication and Intelligent Systems, Proceedings of ICCIS 2024; Springer: Singapore, 2024; pp. 397–406. [Google Scholar]
Bhuvaneswari, P.; Srilatha, K.; Kavya, A.; Anitha, T.; Sasikala, V. Enhanced Plant Pest Classification by Leveraging CBAM Attention in ResNet-9. In Proceedings of the 2025 Fourth International Conference on Smart Technologies, Communication and Robotics (STCR), Sathyamangalam, India, 9–10 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Ray, R.K.; Chakravarty, S.; Dash, S.; Ghosh, A.; Mohanty, S.N.; Chirra, V.R.R.; Ayouni, S.; Khan, M.I. Precision pest management in agriculture using Inception V3 and EfficientNet B4: A deep learning approach for crop protection. Inf. Process. Agric. 2025, 13, 142–161. [Google Scholar] [CrossRef]
Barman, U.; Sarma, P.; Rahman, M.; Deka, V.; Lahkar, S.; Sharma, V.; Saikia, M.J. Vit-SmartAgri: Vision transformer and smartphone-based plant disease detection for smart agriculture. Agronomy 2024, 14, 327. [Google Scholar] [CrossRef]
Duc, C.M.; Fukui, H. SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models. arXiv 2025, arXiv:2502.00435. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Quan, C.; Zhao, T.; Huo, W.; Huang, Y. Mamba-STFM: A Mamba-Based Spatiotemporal Fusion Method for Remote Sensing Images. Remote Sens. 2025, 17, 2135. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Cao, Y.; Liu, C.; Wu, Z.; Zhang, L.; Yang, L. Remote sensing image segmentation using vision mamba and multi-scale multi-frequency feature fusion. Remote Sens. 2025, 17, 1390. [Google Scholar] [CrossRef]
Li, D.; Sun, J.; Liu, Y. Hierarchical semantic alignment heterogeneous knowledge distillation model for smart agriculture crop leaf disease recognition. Expert Syst. Appl. 2025, 296, 129100. [Google Scholar] [CrossRef]
Chowdhury, R.H.; Ahmed, S. MangoLeafViT: Leveraging Lightweight Vision Transformer with Runtime Augmentation for Efficient Mango Leaf Disease Classification. In 2024 27th International Conference on Computer and Information Technology (ICCIT); IEEE: Piscataway, NJ, USA, 2024; pp. 699–704. [Google Scholar]
Yang, Z.; Li, Z.; Zeng, A.; Li, Z.; Yuan, C.; Li, Y. Vitkd: Feature-based knowledge distillation for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 1379–1388. [Google Scholar]
Patel, D.J.; Patel, P.S.; Patel, T.J.; Viradiya, M.D.; Patel, J.B.; Garg, D. Real-Time Object Detection and Recognition on Jetson Nano. In ICT Analysis and Applications, Proceedings of ICT4SD 2024; Springer: Singapore, 2024; pp. 349–360. [Google Scholar]
Espejo-Garcia, B.; Güldenring, R.; Nalpantidis, L.; Fountas, S. Foundation vision models in agriculture: DINOv2, LoRA and knowledge distillation for disease and weed identification. Comput. Electron. Agric. 2025, 239, 110900. [Google Scholar] [CrossRef]
He, J.; Jiang, J.; Zhang, C. A Survey of Lightweight Methods for Object Detection Networks. Array 2025, 29, 100589. [Google Scholar] [CrossRef]
Cheng, S.; Das, S.; Qu, S.; Ballan, L. KD-Mamba: Selective state space models with knowledge distillation for trajectory prediction. Comput. Vis. Image Underst. 2025, 261, 104499. [Google Scholar] [CrossRef]
Sanabria-Velazquez, A.D.; Enciso-Maldonado, G.A.; Maidana-Ojeda, M.; Diaz-Najera, J.F.; Thiessen, L.D.; Shew, H.D. Validation of standard area diagrams to estimate the severity of Septoria leaf spot on stevia in Paraguay, Mexico, and the United States. Plant Dis. 2023, 107, 1829–1838. [Google Scholar] [CrossRef]
Xie, D.; Ye, W.; Pan, Y.; Wang, J.; Qiu, H.; Wang, H.; Li, Z.; Chen, T. GCPDFFNet: Small Object Detection for Rice Blast Recognition. Phytopathology 2024, 114, 1490–1501. [Google Scholar] [CrossRef]
Dehghani, A.; Sarbishei, O.; Glatard, T.; Shihab, E. A quantitative comparison of overlapping and non-overlapping sliding windows for human activity recognition using inertial sensors. Sensors 2019, 19, 5026. [Google Scholar] [CrossRef]
Hamilton, A.; Culhane, M. Spherical redshift distortions. arXiv 1995, arXiv:astro-ph/9507021. [Google Scholar]
Finkelstein, A.; Range, M. Image mosaics. In Electronic Publishing, Artistic Imaging, and Digital Typography, Proceedings of the International Conference on Raster Imaging and Digital Typography; Springer: Berlin/Heidelberg, Germany, 1998; pp. 11–22. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Wang, Z.; Li, C.; Xu, H.; Zhu, X.; Li, H. Mamba yolo: A simple baseline for object detection with state space model. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; AAAI Press: Palo Alto, CA, USA, 2025; Volume 39, pp. 8205–8213. [Google Scholar]
Jácome Galarza, L.; Realpe, M.; Viñán-Ludeña, M.S.; Calderón, M.F.; Jaramillo, S. Agritransformer: A transformer-based model with attention mechanisms for enhanced multimodal crop yield prediction. Electronics 2025, 14, 2466. [Google Scholar] [CrossRef]
Nejad, S.M.M.; Abbasi-Moghadam, D.; Sharifi, A.; Farmonov, N.; Amankulova, K.; Lászlź, M. Multispectral crop yield prediction using 3D-convolutional neural networks and attention convolutional LSTM approaches. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 254–266. [Google Scholar] [CrossRef]
Lu, H.; Liu, L.; Li, Y.N.; Zhao, X.M.; Wang, X.Q.; Cao, Z.G. TasselNetV3: Explainable plant counting with guided upsampling and background suppression. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4700515. [Google Scholar] [CrossRef]
Zhang, P.; Zhang, S.; Wang, J.; Sun, X. Identifying rice lodging based on semantic segmentation architecture optimization with UAV remote sensing imaging. Comput. Electron. Agric. 2024, 227, 109570. [Google Scholar] [CrossRef]
Dionisio, M.A.C.; Salazar, I.J.D.; Hortinela, C.C. EfficientNet-Lite 4-Based Classification System for Grading Philippine Strawberries. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota Kinabalu, Sabah, 26–28 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 366–371. [Google Scholar]
Liang, W.; Tan, J.; He, H.; Xu, H.; Li, J. Detection of small objects from UAV imagery via an improved Swin transformer. In Proceedings of the 2024 IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 9134–9138. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Rahman, M.M.; Tutul, A.A.; Nath, A.; Laishram, L.; Jung, S.K.; Hammond, T. Mamba in vision: A comprehensive survey of techniques and applications. arXiv 2024, arXiv:2410.03105. [Google Scholar] [CrossRef]
Jin, T.; Bercea, G.T.; Le, T.D.; Chen, T.; Su, G.; Imai, H.; Negishi, Y.; Leu, A.; O’Brien, K.; Kawachiya, K.; et al. Compiling onnx neural network models using mlir. arXiv 2020, arXiv:2008.08272. [Google Scholar] [CrossRef]
Yan, Y.; Song, F.; Sun, J. The application of UAV technology in maize crop protection strategies: A review. Comput. Electron. Agric. 2025, 237, 110679. [Google Scholar] [CrossRef]
Gao, C.; He, B.; Guo, W.; Qu, Y.; Wang, Q.; Dong, W. SCS-YOLO: A real-time detection model for agricultural diseases—A case study of wheat fusarium head blight. Comput. Electron. Agric. 2025, 238, 110794. [Google Scholar] [CrossRef]
Wang, J.; Wang, P.; Tian, H.; Tansey, K.; Liu, J.; Quan, W. A deep learning framework combining CNN and GRU for improving wheat yield estimates using time series remotely sensed multi-variables. Comput. Electron. Agric. 2023, 206, 107705. [Google Scholar] [CrossRef]
Song, D.; Sun, H.; Ngumbi, E.; Kamruzzaman, M. Multispectral image reconstruction from RGB image for maize growth status monitoring based on window-adaptive spatial-spectral attention transformer. Comput. Electron. Agric. 2025, 239, 111062. [Google Scholar] [CrossRef]

Figure 1. Challenges hindering the unified integration and deployment of collaborative disease perception and yield-estimation approaches on UAV edge platforms.

Figure 2. Schematic illustration of the UAV-based data acquisition campaign across heterogeneous agronomic environments. The platform (DJI Matrice 300 RTK) dynamically adjusts flight altitude (30–50 m) to maintain a consistent ground sampling distance (GSD) of 1.5 cm/pixel, enabling high-resolution capture of northern corn leaf blight (NCLB) lesions across both the flat, irrigated Hetao Irrigation District and the hilly, rainfed Jungar Banner site.

Figure 3. Visual examples of maize leaf disease severity grades, including the healthy state and four progressive infection levels. From left to right, the samples correspond to Healthy, Grade 1, Grade 2, Grade 3, and Grade 4.

Figure 4. Structure of the micro-texture stream for efficient extraction of fine-grained disease texture features.

Figure 5. Macro-context-scan stream based on a visual state space with two-dimensional selective scanning to capture global spatial dependencies with linear complexity.

Figure 6. Pathology–biomass collaborative interaction module integrating bidirectional Mamba, linear attention, and a health-aware gating mechanism for causal feature calibration.

Figure 7. Visualization of confusion matrices for the proposed SymbioMamba and comparative baseline models on the maize disease severity classification task. The matrices illustrate the per-class classification performance across five disease grades (Healthy, Grade 1–4). Darker blue on the diagonal indicates higher prediction accuracy, demonstrating SymbioMamba’s superior capability in distinguishing fine-grained disease severity levels compared to SOTA methods like YOLOv11 and VMamba.

Figure 8. Speed–accuracy trade-off comparison of different detection models on the maize dataset. The x-axis represents the inference speed (FPS) measured on an embedded edge device, while the y-axis denotes the detection accuracy (mAP@0.5). The bubble size corresponds to the model’s parameter count (Params). For the comparative baselines, circle colors indicate architectural categories: gray signifies general-purpose hierarchical backbones (ConvNeXt, Swin Transformer, and VMamba), dark blue represents YOLO-series detection frameworks (YOLOv11 and YOLO-Mamba), and green denotes dedicated lightweight architectures (MobileViT). The green-shaded region highlights the real-time operating zone (FPS > 30). SymbioMamba (marked with a red star) achieves the optimal balance, delivering state-of-the-art accuracy with real-time performance and a low parameter footprint compared to other baselines.

Figure 9. Scatter plots comparing predicted versus measured maize yield (kg/ha) for SymbioMamba and state-of-the-art baseline models. The solid black line represents the ideal 1:1 prediction (

y = x

), while the shaded blue ellipses indicate the 95% confidence intervals of the prediction distribution. Statistical metrics (n,

R^{2}

, and RMSE) are provided for each model, highlighting the superior goodness-of-fit (

R^{2} = 0.915

) and reduced error achieved by the proposed SymbioMamba framework in field-scale yield estimation.

Figure 9. Scatter plots comparing predicted versus measured maize yield (kg/ha) for SymbioMamba and state-of-the-art baseline models. The solid black line represents the ideal 1:1 prediction (

y = x

), while the shaded blue ellipses indicate the 95% confidence intervals of the prediction distribution. Statistical metrics (n,

R^{2}

, and RMSE) are provided for each model, highlighting the superior goodness-of-fit (

R^{2} = 0.915

) and reduced error achieved by the proposed SymbioMamba framework in field-scale yield estimation.

Figure 10. Heat map visualization of the fitting results of SymbioMamba and baselines for maize yield prediction in the test set.

Figure 11. Training dynamics and ablation analysis of the TACAD strategy. The subplots display: (a) the convergence of total training loss, (b) the evolution of disease detection accuracy (mAP@0.5), and (c) the progression of yield prediction performance (

R^{2}

) over 300 epochs. Shaded areas represent the standard deviation across multiple runs. Inset plots highlight the performance gap in the final 50 epochs, demonstrating the stability and accuracy gains provided by transferring global knowledge from the teacher model to the student network.

Figure 11. Training dynamics and ablation analysis of the TACAD strategy. The subplots display: (a) the convergence of total training loss, (b) the evolution of disease detection accuracy (mAP@0.5), and (c) the progression of yield prediction performance (

R^{2}

) over 300 epochs. Shaded areas represent the standard deviation across multiple runs. Inset plots highlight the performance gap in the final 50 epochs, demonstrating the stability and accuracy gains provided by transferring global knowledge from the teacher model to the student network.

Figure 12. User interface of the onboard intelligent monitoring system deployed on NVIDIA Jetson AGX Orin, displaying the real-time field-scale disease mapping result.

Table 1. Technical specifications of the DJI Matrice 300 RTK UAV platform.

Parameter	Specification
Dimensions (unfolded)	$810 \times 670 \times 430$ mm
Weight (with batteries)	Approx. 6.3 kg
Max Takeoff Weight	9.0 kg
Max Flight Time	55 min (no payload)
Max Flight Speed	23 m/s (S Mode)
Hovering Accuracy	Vertical: ±0.1 m; Horizontal: ±0.1 m (RTK enabled)
Operating Temperature	−20 °C to 50 °C
Ingress Protection	IP45
Max Payload	2.7 kg

Table 2. Statistical distribution of the collected maize dataset across disease grades. The yield data represents the mean value ± standard deviation measured from ground-truth sampling plots corresponding to each disease severity level.

Grade	Infection Ratio (%)	No. of Patches	Percentage (%)	Avg. Yield (kg/ha)
Healthy	0	3547	28.0	11,250 ± 420
Grade 1	<10	3203	25.6	10,840 ± 510
Grade 2	10–25	2819	22.4	9650 ± 680
Grade 3	25–50	1459	14.4	7820 ± 850
Grade 4	>50	1046	9.6	5430 ± 940
Total	-	12,074	100.0%	-

Table 3. Quantitative comparison of disease-detection performance with SOTA methods on the collected maize dataset (Mean ± Std). All models were tested on an embedded edge computing device. Bold indicates the best performance, underline indicates the second best, and the asterisk (*) denotes statistically significant improvements (

p < 0.05

) compared to the second-best baseline according to a two-tailed paired t-test.

Table 3. Quantitative comparison of disease-detection performance with SOTA methods on the collected maize dataset (Mean ± Std). All models were tested on an embedded edge computing device. Bold indicates the best performance, underline indicates the second best, and the asterisk (*) denotes statistically significant improvements (

p < 0.05

) compared to the second-best baseline according to a two-tailed paired t-test.

Method	P (%)	R (%)	F1 (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	FLOPs (G)	FPS
ConvNeXt	$84.5 \pm 0.4$	$82.1 \pm 0.5$	$83.3 \pm 0.4$	$85.3 \pm 0.3$	$58.4 \pm 0.4$	$28.6$	$4.5$	$24.1 \pm 0.8$
EfficientNet-Lite4	$82.1 \pm 0.5$	$79.8 \pm 0.6$	$80.9 \pm 0.5$	$81.5 \pm 0.4$	$52.6 \pm 0.5$	$5.4$	$1.6$	$44.5 \pm 1.1$
Swin Transformer	$85.2 \pm 0.3$	$83.8 \pm 0.4$	$84.5 \pm 0.3$	$86.1 \pm 0.3$	$60.2 \pm 0.3$	$29.3$	$4.8$	$14.2 \pm 0.5$
MobileViT	$83.2 \pm 0.4$	$81.5 \pm 0.5$	$82.3 \pm 0.4$	$83.8 \pm 0.4$	$55.4 \pm 0.5$	$6.0$	$\underset{̲}{2.1}$	$\underset{̲}{39.2 \pm 0.9}$
VMamba	$85.8 \pm 0.4$	$84.5 \pm 0.3$	$85.1 \pm 0.3$	$86.9 \pm 0.3$	$61.5 \pm 0.4$	$26.2$	$4.3$	$28.5 \pm 0.9$
AgriTransformer	$86.8 \pm 0.3$	$85.4 \pm 0.4$	$86.1 \pm 0.3$	$87.5 \pm 0.3$	$62.8 \pm 0.3$	$32.4$	$6.2$	$12.8 \pm 0.4$
YOLOv11	$86.5 \pm 0.3$	$85.1 \pm 0.4$	$85.8 \pm 0.3$	$\underset{̲}{88.2 \pm 0.2}$	$63.4 \pm 0.3$	$9.4$	$3.2$	$37.6 \pm 1.1$
YOLO-Mamba	$\underset{̲}{87.1 \pm 0.3}$	$\underset{̲}{86.3 \pm 0.3}$	$\underset{̲}{86.7 \pm 0.3}$	$88.0 \pm 0.2$	$\underset{̲}{64.1 \pm 0.2}$	$12.8$	$3.8$	$35.8 \pm 1.0$
SymbioMamba (Ours)	$88.3 \pm 0.2 *$	$87.5 \pm 0.2 *$	$87.9 \pm 0.2 *$	$89.4 \pm 0.2 *$	$65.8 \pm 0.2 *$	$\underset{̲}{5.8}$	$2.4$	$38.2 \pm 0.7$

Table 4. Quantitative comparison of yield prediction performance with SOTA methods on the collected maize dataset (Mean ± Std). All models were tested on an embedded edge computing device. Bold indicates the best performance, underline indicates the second best, and the asterisk (*) denotes statistically significant improvements (

p < 0.05

) compared to the second-best baseline according to a two-tailed paired t-test.

Table 4. Quantitative comparison of yield prediction performance with SOTA methods on the collected maize dataset (Mean ± Std). All models were tested on an embedded edge computing device. Bold indicates the best performance, underline indicates the second best, and the asterisk (*) denotes statistically significant improvements (

p < 0.05

) compared to the second-best baseline according to a two-tailed paired t-test.

Method	$R^{2}$	RMSE (kg/ha)	MAE (kg/ha)	Params (M)	FLOPs (G)	FPS
ConvNeXt	$0.845 \pm 0.005$	$685.4 \pm 12.5$	$512.3 \pm 10.2$	$28.6$	$4.5$	$24.1 \pm 0.8$
EfficientNet-Lite4	$0.830 \pm 0.006$	$712.4 \pm 14.2$	$535.6 \pm 11.5$	$5.4$	$1.6$	$44.5 \pm 1.1$
Swin Transformer	$0.892 \pm 0.004$	$540.2 \pm 9.8$	$405.8 \pm 8.5$	$29.3$	$4.8$	$14.2 \pm 0.5$
MobileViT	$0.838 \pm 0.005$	$698.2 \pm 11.6$	$522.4 \pm 9.4$	$6.0$	$\underset{̲}{2.1}$	$\underset{̲}{39.2 \pm 0.9}$
VMamba	$0.901 \pm 0.003$	$515.8 \pm 8.2$	$388.4 \pm 6.9$	$26.2$	$4.3$	$28.5 \pm 0.9$
AgriTransformer	$\underset{̲}{0.905 \pm 0.003}$	$\underset{̲}{502.4 \pm 7.9}$	$\underset{̲}{375.8 \pm 6.2}$	$32.4$	$6.2$	$12.8 \pm 0.4$
ConvLSTM	$0.875 \pm 0.004$	$598.5 \pm 11.4$	$442.7 \pm 9.3$	$34.5$	$12.6$	$8.5 \pm 0.3$
TasselNetV3	$0.868 \pm 0.005$	$612.3 \pm 10.8$	$465.2 \pm 8.8$	$14.2$	$3.9$	$35.6 \pm 1.1$
SymbioMamba (Ours)	$0.915 \pm 0.002 *$	$485.6 \pm 7.5 *$	$362.1 \pm 5.4 *$	$\underset{̲}{5.8}$	$2.4$	$38.2 \pm 0.7$

Table 5. Ablation study on the disease detection task (mean ± SD).

Model Variant	P (%)	R (%)	mAP@0.5 (%)
SymbioMamba w/o Macro Stream	$86.2 \pm 0.3$	$84.8 \pm 0.4$	$87.5 \pm 0.3$
SymbioMamba w/o Micro Stream	$81.5 \pm 0.5$	$79.2 \pm 0.6$	$82.3 \pm 0.5$
SymbioMamba w/o PBCI	$87.5 \pm 0.3$	$86.4 \pm 0.3$	$88.6 \pm 0.2$
SymbioMamba w/o TACAD	$88.0 \pm 0.3$	$87.1 \pm 0.3$	$89.1 \pm 0.2$
SymbioMamba (Full)	$88.3 \pm 0.2$	$87.5 \pm 0.2$	$89.4 \pm 0.2$

Table 6. Ablation study on the yield prediction task (mean ± SD).

Model Variant	$R^{2}$	RMSE (kg/ha)
SymbioMamba w/o Macro Stream	$0.821 \pm 0.005$	$724.5 \pm 14.2$
SymbioMamba w/o Micro Stream	$0.884 \pm 0.004$	$568.2 \pm 10.5$
SymbioMamba w/o PBCI	$0.891 \pm 0.003$	$545.3 \pm 9.6$
SymbioMamba w/o TACAD	$0.908 \pm 0.003$	$498.7 \pm 8.2$
SymbioMamba (Full)	$0.915 \pm 0.002$	$485.6 \pm 7.5$

Table 7. Quantification of model robustness under multi-disease and complex environmental interference (mean ± SD). Statistical significance (

p < 0.05

) compared to the best baseline is marked with an asterisk (*).

Table 7. Quantification of model robustness under multi-disease and complex environmental interference (mean ± SD). Statistical significance (

p < 0.05

) compared to the best baseline is marked with an asterisk (*).

Method	Disease mAP@0.5 (%)			Yield $R^{2}$		FPS
Method	Common Rust	Banded Blight	Weed/Rain Interference	Normal	Environmental Stress	FPS
YOLOv11	$85.4 \pm 0.4$	$82.1 \pm 0.6$	$76.2 \pm 0.8$	$0.882 \pm 0.005$	$0.784 \pm 0.009$	$37.6 \pm 1.1$
VMamba	$86.2 \pm 0.3$	$83.5 \pm 0.5$	$79.4 \pm 0.7$	$0.901 \pm 0.003$	$0.825 \pm 0.007$	$28.5 \pm 0.9$
AgriTransformer	$\underset{̲}{86.9 \pm 0.3}$	$\underset{̲}{84.2 \pm 0.4}$	$\underset{̲}{80.1 \pm 0.6}$	$\underset{̲}{0.905 \pm 0.003}$	$\underset{̲}{0.842 \pm 0.006}$	$12.8 \pm 0.4$
SymbioMamba (Ours)	$88.1 \pm 0.2 *$	$86.3 \pm 0.3 *$	$83.4 \pm 0.5 *$	$0.915 \pm 0.002 *$	$0.886 \pm 0.004 *$	$\underset{̲}{38.2 \pm 0.7}$

Table 8. Rigorous hardware and generalization benchmarks for SymbioMamba.

Category	Parameter/Metric	Observed Value
Generalization	Cross-Sensor mAP@0.5	87.2%
Generalization	Cross-Regional F1-score	86.4%
Energy Profile	Peak Power Consumption	45–55 W
Energy Profile	Battery Endurance Retention	>90%
Deployment Latency	Inference Speed	38.2 FPS
Deployment Latency	End-to-End Latency	26.2 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Wang, Y.; Zhou, B.; Yan, X.; Guo, P.; Yang, H.; Song, Y. SymbioMamba: An Efficient Dual-Stream State-Space Framework for Real-Time Maize Disease and Yield Analysis on UAV Platforms. Agriculture 2026, 16, 801. https://doi.org/10.3390/agriculture16070801

AMA Style

Wang Z, Wang Y, Zhou B, Yan X, Guo P, Yang H, Song Y. SymbioMamba: An Efficient Dual-Stream State-Space Framework for Real-Time Maize Disease and Yield Analysis on UAV Platforms. Agriculture. 2026; 16(7):801. https://doi.org/10.3390/agriculture16070801

Chicago/Turabian Style

Wang, Zihuan, Yuru Wang, Bocheng Zhou, Xu Yan, Peijiang Guo, Hanyu Yang, and Yihong Song. 2026. "SymbioMamba: An Efficient Dual-Stream State-Space Framework for Real-Time Maize Disease and Yield Analysis on UAV Platforms" Agriculture 16, no. 7: 801. https://doi.org/10.3390/agriculture16070801

APA Style

Wang, Z., Wang, Y., Zhou, B., Yan, X., Guo, P., Yang, H., & Song, Y. (2026). SymbioMamba: An Efficient Dual-Stream State-Space Framework for Real-Time Maize Disease and Yield Analysis on UAV Platforms. Agriculture, 16(7), 801. https://doi.org/10.3390/agriculture16070801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SymbioMamba: An Efficient Dual-Stream State-Space Framework for Real-Time Maize Disease and Yield Analysis on UAV Platforms

Abstract

1. Introduction

2. Related Work

2.1. UAV-Based Deep Learning Methods for Crop Phenotyping

2.2. Evolution of Visual Backbone Networks

2.3. Efficient Computation and Knowledge Transfer for UAV Edge Deployment

3. Materials and Method

3.1. Dataset Acquisition

3.2. Data Preprocessing and Augmentation

3.3. The SymbioMamba Framework

3.3.1. Overall Architecture

3.3.2. Stem Layer: Visual Embedding

3.4. Heterogeneous Dual-Stream Encoder

3.4.1. Micro-Texture Stream

3.4.2. Macro-Context-Scan Stream

3.5. Pathology–Biomass Collaborative Interaction

3.6. Task-Specific Decoupled Prediction Heads

3.7. Topology-Aligning Cross-Architecture Distillation

3.8. Evaluation Metrics

4. Results and Discussion

4.1. Experiment Settings

4.1.1. Implementation Details

4.1.2. Baselines

4.2. Performance Comparison on Disease Detection Task Across Different Models

4.3. Performance Comparison on Yield-Prediction Task Across Different Models

4.4. Ablation Studies

4.5. Robustness and Cross-Scenario Adaptability Analysis

4.6. Edge Deployment and Field Application Validation

4.7. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI