CIDR-MobileNet: A Monocular Pseudo-Depth and Cross-Modal Feature Fusion Approach for Chili Pepper Above-Ground Biomass Estimation

Wang, Yi; Deng, Jingtao; Yang, Lin; Ruan, Shangjing; Wang, Weijie; Hu, Wenwu; Jiang, Ping

doi:10.3390/agriculture16131457

Open AccessArticle

CIDR-MobileNet: A Monocular Pseudo-Depth and Cross-Modal Feature Fusion Approach for Chili Pepper Above-Ground Biomass Estimation

by

Yi Wang

¹,

Jingtao Deng

¹

,

Lin Yang

¹,

Shangjing Ruan

¹,

Weijie Wang

¹,

Wenwu Hu

² and

Ping Jiang

^2,*

¹

College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China

²

College of Mechanical and Electrical Engineering, Hunan Agricultural University, Changsha 410128, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(13), 1457; https://doi.org/10.3390/agriculture16131457

Submission received: 14 May 2026 / Revised: 30 June 2026 / Accepted: 1 July 2026 / Published: 2 July 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate real-time estimation of above-ground biomass is critical for intelligent chilli pepper harvesting. This study proposes CIDR-MobileNet, a lightweight end-to-end framework that addresses the limitations of destructive sampling, reliance on additional depth sensors, and weak regression robustness in existing methods. Pseudo-depth maps are generated from single-view RGB images using Depth Anything V2, providing low-cost structural information without requiring extra hardware. A cross-modal feature interaction module adaptively fuses RGB texture with pseudo-depth geometry, while a multi-branch distribution regression head models AGB prediction as a probabilistic task to improve robustness against occlusion and noise. A ranking loss is also introduced to preserve the relative order of predictions. Validated on 275 in-field chilli pepper samples via ten-fold cross-validation, the model achieves an R² of 0.972, MAE of 174.56 g, RMSE of 230.74 g, and MAPE of 9.56%, with only 3.28 M parameters. Comparative experiments demonstrate that CIDR-MobileNet outperforms mainstream lightweight networks while maintaining high inference efficiency (10.56 ms CPU latency). The method strikes a favourable balance between prediction accuracy, hardware cost, and real-time performance, offering a practical solution for non-destructive biomass monitoring in precision agriculture.

Keywords:

chilli pepper; biomass; monocular pseudo-depth; deep learning; cross-modal feature fusion; distribution regression

1. Introduction

Pepper (Capsicum annuum L.) is an economically important crop widely cultivated worldwide. Its yield and quality directly affect agricultural productivity and supply chain stability [1]. Above-ground biomass (AGB) is not only a core indicator of plant growth status but also a key determinant of photosynthetic efficiency, resource utilisation, and yield formation [2,3]. With the continuous advancement of agricultural mechanisation and intelligent technologies, the mode of crop information acquisition for harvesters is shifting from offline statistics to online perception, imposing higher demands on real-time, non-destructive, and high-throughput biomass estimation. In mechanised pepper harvesting scenarios in particular, plant biomass is directly related to the dynamic control of harvesting parameters (e.g., picking force, path optimisation, and grading strategies), making it a critical perceptual variable for intelligent harvesting and precision operations [4].

However, traditional biomass acquisition methods mainly rely on destructive sampling. Although these methods offer high accuracy, they are time-consuming, labour-intensive, and non-repeatable, and therefore cannot meet the requirements of real-time field monitoring and online decision-making for intelligent equipment [5,6]. As agricultural production moves toward automation and intelligence, harvesting machines increasingly demand a non-destructive, high-throughput, and real-time biomass estimation method.

In recent years, computer vision-based non-destructive phenotyping has emerged as a promising alternative. Early efforts primarily utilised RGB imagery. Wang et al. [7] applied deep learning to estimate NDVI from RGB images over the full growth cycle of multiple crops. Cardenas-Gallegos et al. combined RGB-depth imaging with morphometric descriptors and machine learning to improve biomass estimation in hydroponic lettuce [8], while Dhawi et al. employed smartphone-based RGB images together with machine learning and convolutional neural networks for non-destructive biomass prediction in pearl millet [9]. These studies illustrate the diversity of RGB-based strategies.

Subsequently, deep learning techniques, particularly convolutional neural networks (CNNs) and vision transformers, have become mainstream and achieved notable success across various crops. Carlier et al. compared CNNs with partial least squares regression for estimating biophysical variables of wheat organs via proximal sensing [10]. Schreiber et al. utilised UAV-based RGB imagery and deep learning for wheat above-ground biomass estimation [11]. Okada et al. applied deep learning models to UAV remote sensing data for soybean biomass phenotyping [12], and Jin et al. proposed a Transformer-based symmetric diffusion segmentation network for wheat growth monitoring and yield counting [13]. Meanwhile, the integration of unmanned aerial vehicles (UAVs) has further extended model applicability to large-scale field environments. Bazrafkan et al. reviewed UAS and satellite data integration strategies in precision agriculture [14]. Liu et al. combined optical, structural, and textural canopy measurements from UAV systems to estimate potato above-ground biomass [15], and Castro et al. demonstrated the effectiveness of deep learning with UAV-based RGB imagery for biomass phenotyping in forage grasses [16].

Despite these advances, all the aforementioned methods rely inherently on two-dimensional (2D) image information. For crops with complex canopy structures and severe occlusions—such as potato and wheat—the lack of adequate three-dimensional (3D) spatial representation constrains estimation accuracy. To address this limitation, Zhu et al. integrated UAV-based multispectral, RGB, and LiDAR point cloud data with an improved PointNet++ network for wheat biomass prediction, showing that the fusion of point cloud depth features with spectral and textural indices significantly outperforms 2D-only models [17]. Deng et al. compared 2D and 3D vegetation species mapping using UAV-LiDAR point clouds and improved deep learning methods, revealing the advantages of depth information [18]. Furthermore, Chang et al. systematically reviewed AI-driven 3D point cloud analysis in plant phenotyping, highlighting its potential to overcome the spatial limitations of conventional 2D approaches [19].

To compensate for the deficiency of 2D information, incorporating 3D structural information has become an important research direction. Common approaches include the use of LiDAR, structured-light cameras, and multi-view stereo (MVS) to acquire 3D point cloud data of crop canopies [20,21,22]. These methods can directly capture canopy structure and have improved biomass estimation accuracy to some extent [23]. Nevertheless, active sensors are often expensive, bulky, and computationally intensive, making them difficult to deploy in real time on mobile field equipment [24,25]. MVS-based reconstruction methods are sensitive to illumination changes and are prone to incomplete reconstruction in texture-poor or heavily occluded agricultural environments [26]. Meanwhile, existing RGB-D fusion methods mostly adopt simple feature concatenation or weighted summation, lacking effective cross-modal interaction mechanisms, which leads to underutilization of multimodal information [27,28].

In recent years, monocular depth estimation technology has offered a new low-cost possibility for acquiring 3D structural information. In particular, the Depth Anything series, trained on large-scale unlabeled data, has significantly improved the generalisation capability of depth estimation [29]. Its improved version, Depth Anything V2, further integrates synthetic data and pseudo-labelling strategies, achieving more stable and accurate depth prediction in complex scenes [30]. Such methods can generate high-quality depth maps from a single RGB image, thereby providing approximate 3D structural representation without additional hardware and offering a feasible path for lightweight deployment in agricultural scenarios [31]. Consequently, “RGB + pseudo-depth” is gradually becoming a potential solution that balances cost and performance.

Although monocular pseudo-depth information offers the above advantages, existing RGB-D fusion methods still have several critical limitations. First, in terms of feature fusion, most studies perform only shallow, simple fusion and fail to fully exploit the complementary relationship between RGB and depth information [32,33]. Second, conventional regression approaches typically adopt point-value predictions, neglecting the distributional characteristics of biomass and the rank relationships among samples, which reduces model robustness in complex field environments [34,35]. Moreover, research on cash crops such as pepper in actual harvesting scenarios remains limited, and systematic solutions that simultaneously achieve accuracy, real-time performance, and cost-effectiveness are still lacking [36,37].

Notably, monocular depth estimation technology has moved from initial exploration to extensive validation in agricultural applications, providing a solid methodological basis for this study. In crop segmentation, Cao et al. proposed DepthCropSeg, which uses depth maps generated by Depth Anything V2 to construct high-quality pseudo-masks, achieving nearly supervised performance in an almost unsupervised crop segmentation task across multiple real-world scenarios (field, UAV, and high-throughput platforms), thereby significantly reducing manual annotation costs [38]. In orchard monitoring, Vučetić et al. applied Depth Anything V2 to monocular depth estimation-driven canopy segmentation in olive groves and used it for accurate calculation of vegetation indices. Their results showed that by focusing on tree canopy regions, the proposed segmentation method effectively eliminated noise from surrounding vegetation and soil, significantly improving the estimation accuracy of NDVI and NDRE indices [39]. Furthermore, Zhuo and You developed PlantMDE, a pioneering model that achieves 3D plant reconstruction from a single RGB image via monocular depth estimation. They constructed the PlantDepth dataset and demonstrated the effectiveness of the method for fine-scale organ-level morphological reconstruction [40]. These studies fully demonstrate that monocular depth estimation based on Depth Anything V2 has shown reliable performance and great potential in various agricultural vision tasks, including crop segmentation, canopy analysis, and 3D phenotyping reconstruction. However, the application of this technology to above-ground biomass estimation of pepper remains largely unexplored. In particular, how to leverage pseudo-depth information to improve the prediction accuracy of pepper plant biomass in complex field environments is an urgent direction for further research.

To address the above issues, this study proposes a method for pepper above-ground biomass estimation based on monocular pseudo-depth and cross-modal feature interaction fusion, termed CIDR-MobileNet. The main innovations are as follows:

(1): Pseudo-depth information is generated from a single RGB image using the Depth Anything V2 model, providing low-cost 3D structural features and overcoming the limitation of traditional methods that rely on additional depth sensors.
(2): A cross-modal feature interaction fusion (CFIF) module is designed to achieve deep semantic interaction between RGB and pseudo-depth features via a bidirectional attention mechanism, enhancing multimodal feature complementarity.
(3): A multi-branch distribution-based regression head (MDBR-Head) is constructed to formulate biomass prediction as a distribution estimation problem, improving model robustness and uncertainty awareness.
(4): A ranking loss is introduced to enforce ordinal constraints among predicted biomass values, strengthening the modelling of relative relationships, while maintaining a lightweight design suitable for edge-device deployment.

Preliminary experimental results demonstrate that the proposed CIDR-MobileNet method significantly outperforms existing RGB-based and simple fusion-based approaches, providing a cost-effective, non-destructive, and accurate solution for pepper above-ground biomass estimation in real-world harvesting scenarios. These findings are expected to support intelligent decision-making in precision agriculture, particularly for dynamic harvesting control and yield prediction.

2. Materials and Methods

2.1. Dataset Description

The data collection in this study was carried out on 3 October 2025, in Shawan City, Tacheng Prefecture, Xinjiang Uygur Autonomous Region, China. The experimental site was located in a typical agricultural–semi-desert transitional zone northwest of Shawan City, adjacent to the G30 Lianhuo Expressway. This region experiences a temperate continental climate with abundant sunshine and large diurnal temperature variations, and is one of the major production areas for red chilli pepper (Capsicum annuum L.) in Xinjiang. This challenging field environment—characterised by drastic illumination changes, bare soil background, and complex surface textures—was deliberately chosen to evaluate the robustness of the proposed model.

Data Acquisition Method

To construct a dataset that accurately reflects the real-field scenario for chilli pepper above-ground biomass prediction, a rigorous in situ synchronous acquisition scheme was designed. The collection process simulated the actual harvesting environment to ensure that the acquired data closely matched real production conditions, faithfully representing the visual characteristics of chilli pepper plants during harvest. The detailed procedure was as follows (Figure 1).

First, a Huiboshi 48-megapixel high-resolution camera module (manufactured by Shenzhen Huibo Technology Co., Ltd., Shenzhen, China) was mounted on top of the harvester and set at a 45° downward angle from the horizontal to capture RGB images of the chilli plants in the field. The module is equipped with high-sensitivity performance, enabling precise recording of the colour distribution, texture details, and canopy structure under natural illumination. These high-fidelity visual data lay the foundation for subsequent analyses.

Image frame acquisition mode was selected instead of video streaming mode. This strategy effectively avoids sample redundancy while facilitating subsequent procedures such as data cleaning, annotation, and pseudo-depth map generation. During actual field operation, the camera module moved synchronously with the harvester and automatically captured image frames at preset time intervals, ensuring that samples covered chilli plant morphologies under varying planting densities, thereby satisfying the requirement for diversity. Ultimately, the raw dataset comprised high-quality images from different growth stages in the field, providing a solid basis for subsequent modelling.

2.2. Dataset Construction and Annotation

2.2.1. Dataset Size

A total of 335 raw image samples were acquired during data collection. After rigorous data cleaning and filtering—removing blurred images, heavily occluded samples, and those with annotation abnormalities—a standardised dataset of 275 high-quality samples was finally constructed. Prior to destructive sampling, all candidate plants were subjected to a brief but systematic visual inspection. Plants were excluded if they showed any clear signs of disease (such as leaf lesions, wilting, or fungal growth), pest injury, mechanical damage, or severe nutritional discoloration. The remaining plants were visually healthy and morphologically representative of the experimental population. It is worth noting that the field received no fungicide or insecticide treatment during the growing season; thus, despite the exclusion of obviously defective individuals, the final dataset retains a realistic range of plant size, vigour, and canopy architecture—an important consideration for training a model that must generalise to variable field conditions. All 275 samples were used for model training, validation, and testing, ensuring the representativeness and reliability of the dataset.

2.2.2. Annotation Procedure

To establish an accurate mapping between image features and biomass, a one-to-one annotation strategy termed “manual weighing–image binding” was adopted. A standardised workflow was followed to ensure the credibility and traceability of the labelled data.

Biomass ground-truth acquisition: An in situ destructive sampling method was employed. Immediately after image acquisition, the chilli pepper plants within the corresponding sampling unit were completely uprooted, and their above-ground parts were retained. The fresh weight (FW) was measured on site using an electronic balance with a precision of 0.01 g. This measured value served as the gold label for each sample, guaranteeing the authenticity of the biomass ground truth.

Image-label binding mechanism:

(1): Spatial binding: During image acquisition, a tag carrying a unique code for the sampling unit was placed within the camera’s field of view. The spatial correspondence between the image and the sampling unit was established through visual markers, achieving a one-image-one-code precise binding.
(2): Temporal binding: Image acquisition and biomass weighing were completed within the same time window, and the start and end times of both operations were recorded using timestamps. This approach avoided the influence of plant growth dynamics on the mapping relationship and ensured temporal consistency.

Through the above two-dimensional binding mechanism, a one-to-one annotated dataset of 275 image–biomass pairs was finally constructed, providing high-quality supervisory signals for model training. The detailed partitioning and statistical information of the dataset are presented in Table 1.

2.3. Pseudo-Depth Map Generation

2.3.1. Method Origin

To supplement three-dimensional structural information from images, pseudo-depth maps were generated based on monocular depth estimation technology. This technique infers scene depth information from a single RGB image without requiring multi-sensor support. The lightweight and accurate Depth Anything V2 model was adopted as the core engine. The pre-trained weights of Depth Anything V2 were used directly without fine-tuning on our chilli pepper dataset, as the model already demonstrated sufficient generalisation ability on large-scale general data to handle the complex canopy structure of chilli pepper. The model offers the following advantages: a light architecture suitable for real-time processing in complex field scenarios; strong generalisation ability to effectively resolve the irregular canopy structure of chilli pepper plants; and reliable accuracy that meets the depth resolution requirements of the biomass estimation task.

2.3.2. Generation Workflow

The pseudo-depth map generation followed a standardised processing pipeline:

(1): Input preprocessing: The acquired RGB images (maintaining their original resolution) were pixel-normalised to comply with the model input specifications.
(2): Feature extraction and depth regression: The model extracted multi-scale visual features using an encoder, and after decoder-based fusion, predicted relative depth values at the pixel level.
(3): Output mapping: A grayscale pseudo-depth map with a resolution identical to that of the input RGB image was generated, in which the grayscale intensity of each pixel represents the relative height and spatial topological relationships of the plant canopy.

As shown in Figure 2, the RGB image (Figure 2a) records the apparent features of the plants, while the pseudo-depth map (Figure 2b) intuitively presents the three-dimensional canopy structure through grayscale gradients. This multi-modal data construction strategy effectively transforms two-dimensional visual information into three-dimensional geometric representations, providing critical data support for the model to explore non-linear associations between image features and biomass.

2.4. Overall Network Architecture

2.4.1. Dual-Branch Structure

To achieve non-destructive estimation of above-ground biomass for chilli pepper, a dual-branch network model named CIDR-MobileNet (Cross-modal Interactive Fusion Distribution Regression MobileNet) was proposed for the continuous-value regression task of chilli pepper biomass. This structure consists of an independent RGB branch and a pseudo-depth branch, which extract the colour, texture, and spectral features from RGB images as well as the height, geometric, and volumetric features from pseudo-depth images, respectively. In the complex field environment of harvester operation, the dual-branch design fully exploits the complementarity of multi-modal data: the RGB branch captures the appearance information of the crop canopy, while the pseudo-depth branch provides accurate three-dimensional structural information. These two branches deeply interact at the feature level, avoiding the limitations of single-modality information. The architecture is illustrated in Figure 3 and mainly comprises a backbone network, a feature fusion module, and a regression head.

2.4.2. Feature Extraction Backbone Network

The choice of backbone network is a core factor determining model performance and efficiency. In the complex field environment of harvester operation, the model must balance high-speed inference, low resource consumption, and robust feature extraction. To this end, the lightweight convolutional neural network benchmark model, MobileNetV3-Small, was selected as the backbone feature extraction network for both the RGB and depth branches [41]. This selection not only achieves an competitive trade-off between computational efficiency and representational capacity but also fully meets the stringent requirements of edge computing devices (e.g., the Jetson series modules mounted on harvesters) for real-time biomass monitoring, providing solid technical support for dynamic field scenarios. The core advantages of MobileNetV3-Small are reflected in the following innovative designs:

Depth-wise Separable Convolutions: By decoupling a standard convolution into two steps—depth-wise convolution and point-wise convolution—the number of parameters and floating-point operations (FLOPs) is significantly reduced while maintaining feature extraction performance comparable to that of standard convolutions [42]. This technique provides a crucial foundation for model deployment in resource-constrained scenarios, achieving a win-win situation for efficiency and accuracy.

Inverted Residuals and Linear Bottlenecks: This design adopts a three-stage feature transformation paradigm of “expansion-convolution-compression”, performing sufficient feature extraction and non-linear mapping in a high-dimensional space, while ensuring the integrity of the information flow through linear bottlenecks and residual connection mechanisms [43]. This design effectively avoids information loss that may be caused by the ReLU activation function in low-dimensional spaces, significantly enhancing the generalisation ability and representational quality of the features.

In the backbone initialisation stage, ImageNet-1K pre-trained weights were adopted for knowledge transfer to accelerate model convergence and improve feature stability. For the RGB branch, the standard three-channel pre-trained weights were directly loaded to inherit the prior knowledge of general visual features. For the depth branch, the input channel of the first convolutional layer was adjusted to one, and efficient parameter transfer was achieved through channel-wise mean initialisation. This strategy preserves the advantages of cross-modal knowledge transfer while adapting to the specific characteristics of agricultural multi-modal data, effectively overcoming the slow convergence problem associated with training from scratch, and strengthening the model’s robustness under adverse field conditions such as complex illumination, dust, and vibration.

In summary, MobileNetV3-Small, with its efficient, compact, and information-rich feature representation capability, lays a solid foundation for the subsequent CFIF feature fusion module and the MDBRHead regression head, serving as the technical cornerstone of the multi-modal biomass regression framework in this study and achieving a synergistic optimisation of “efficiency-effectiveness” in edge computing scenarios.

2.4.3. Overall Workflow

The input RGB image (B × 3 × H × W) and pseudo-depth image (B × 1 × H × W) are respectively fed into the corresponding backbones to extract feature maps (the output channel number is 576 for both branches). The extracted feature maps first pass through the CFIF (Cross Feature Interaction Fusion) module for cross-modal fusion, producing a fused feature map. Subsequently, the fused features are input into the MDBRHead (Multi-branch Distribution Biomass Regression Head) for multi-branch regression, finally outputting a normalised biomass prediction value (B × 1). The whole process realises an end-to-end mapping from raw multi-modal images to continuous biomass values.

2.5. CFIF Feature Fusion Module

2.5.1. Cross-Feature Interactive Fusion (CFIF) Module

Most existing feature fusion methods use simple element-wise addition or concatenation, often ignoring the inherent complementarity and conflict between RGB texture information and depth geometric information. To address this issue, a Cross-Feature Interactive Fusion (CFIF) module is proposed herein. This module aims to mine the correlation between the two modalities through explicit mathematical modelling and to solve the problems of alignment and enhancement of RGB and depth information in the feature space. The module receives intermediate feature maps from the two backbone branches as inputs, denoted as RGB features

F_{rgb} \in R^{B \times C \times H \times W}

and depth features

F_{d e p t h} \in R^{B \times C \times H \times W}

, where

B

is the batch size,

C

is the number of channels (set to C = 576 in this study), and

H

and

W

are the height and width of the feature maps, respectively. The module outputs a fused feature map

F_{f u s e d} \in R^{B \times C \times H \times W}

that incorporates complementary information from both modalities. During this process, the spatial resolution of the feature maps remains unchanged to ensure precise alignment of spatial position information. The structure is illustrated in Figure 4.

2.5.2. Fusion Mechanism

A lightweight cross-feature interaction strategy was designed in this study, consisting of an explicit interaction layer and an adaptive fusion layer. The complementarity and correlation between modalities were explicitly modelled to enhance feature representation.

(1): Explicit Interaction Layer

To capture the inter-modal differences (complementarity) and commonalities (correlation), a difference feature diff and a product feature prod were defined as follows:

d i f f = (F_{rgb} - F_{depth}) ⊙ α_{diff}

(1)

prod = (F_{rgb} ⊙ F_{depth}) ⊙ α_{prod}

(2)

where ⊙ denotes the Hadamard (element-wise) product. Both

α_{diff}

and

α_{prod}

are learnable scalar scaling parameters, initially set to 0.1. This initialisation implements a “warm-start” mechanism: in the early training stage, the original features dominate; as training proceeds, the optimal interaction weights are gradually learned, balancing the contributions of the difference and product information while preventing gradient oscillations.

(2): Adaptive Fusion Layer

The original dual features (

F_{rgb}

,

F_{depth}

)and the interactive features (

diff

,

prod

) were concatenated along the channel dimension, forming a dense feature representation with 4C channels. Subsequently, a 1 × 1 convolution was applied for channel compression and cross-channel information shuffling, reducing the dimensionality back to C. Batch normalisation (BN) and a

R e L U

activation function were then applied to generate the preliminary fused features.

(3): Residual Topology

To further preserve the integrity of the information flow, a lightweight residual connection was introduced into the module. The final output was calculated as:

F_{fused} = {Conv}_{1 \times 1} (concat) + β \cdot \frac{F_{rgb} + F_{depth}}{2}

(3)

where

{Conv}_{1 \times 1} (concat)

is denotes the convolution applied to the channel-wise concatenation of the difference and product features, which compresses the channel dimension and integrates cross-modal cues into a compact representation. The second term,

β {\cdot F}_{rgb} + F_{depth}

/2, represents a residual connection that maintains a direct pathway from the original modality features to the final output. The learnable parameter β is a scaling factor that controls the contribution of this residual branch; it is initialised with a small positive value to allow the network to rely primarily on the learned cross-modal features during the early training phase and progressively adjust the residual strength as training proceeds. This design follows the common practice of residual learning, where the identity mapping facilitates gradient flow and mitigates the risk of gradient oscillations, especially when the network is trained with limited data. The learnable scaling factor provides additional flexibility, enabling the model to balance the preservation of original modality information against the integration of newly learned cross-modal features.

The proposed CFIF module offers three main advantages: (1) explicit modelling of the complementary relationship between RGB and pseudo-depth features through difference and product operations, which substantially enhances cross-modal interaction; (2) learnable scaling factors coupled with residual connections, which ensure training stability and convergence; and (3) effective compensation for single-modality deficiencies in agricultural scenarios, achieving robust information complementarity and improving biomass prediction accuracy under field conditions.

2.6. Multi-Branch Distribution Regression Head (MDBR-Head)

In precision agriculture and crop phenotyping tasks, regression prediction of plant biomass presents considerable challenges. Because plant growth is influenced by micro-environmental changes, varying light conditions, shading, and individual morphological differences, the relationship between visual features and actual weight is often highly non-linear and involves considerable uncertainty. Conventional point estimation methods, which output a single deterministic value, are susceptible to outliers and noise, fail to capture the underlying distribution of the data, and therefore suffer from limited generalisation ability in complex field environments.

To address these issues, a Multi-Branch Distribution Regression Head (MDBR-Head) is proposed. This module decouples the fused features and models biomass from three complementary perspectives: global semantics, local structure, and probability distribution. The final prediction is then generated through an adaptive fusion strategy. Such a design notably enhances the model’s robustness against noise and uncertainty in complex agricultural environments.

2.6.1. Architecture of the Regression Head

The MDBR-Head is attached to the output of the Cross-Modal Feature Interaction Fusion (CFIF) module, taking the fused feature map

X \in R^{B \times C \times H \times W}

as input. Instead of relying on a single fully connected layer—a common practice in conventional regression designs—the proposed head adopts a parallel branching strategy that tackles biomass prediction from three complementary angles: global semantics, local details, and probability distribution. Accordingly, the head comprises three branches—a global semantic branch, a local structural branch, and a distribution regression branch—followed by a dynamic weight fusion module that adaptively combines their outputs.

Specifically, the global branch applies global average pooling to compress the spatial dimensions of the feature map, then passes the resulting vector through two fully connected layers to capture holistic semantic representations. The local branch, in contrast, preserves spatial resolution and employs depth-wise convolution operations to extract fine-grained structural cues that are sensitive to local canopy variations. The distribution branch departs from point estimation entirely: it treats biomass prediction as a discrete distribution learning problem, predicting probability masses over a set of predefined intervals (bins) and then computing the expected value as the branch output. Features from the global and distribution branches are concatenated and fed into a small sub-network to generate adaptive fusion weights via a Softmax layer. Finally, the outputs of all three branches—together with a residual regression connection—are aggregated in a weighted manner to produce the final biomass estimate.

This three-pronged architecture allows the model to capture both coarse-level context and fine-scale patterns, while explicitly accounting for the inherent uncertainty in field-based biomass measurements. Figure 5 illustrates the overall structure of the proposed MDBR-Head.

2.6.2. Multi-Granularity Feature Decoupling Head

To fully exploit the complex mapping between multi-scale features and biomass, the MDBR builds three parallel heterogeneous branches, each modelling the target from a different perspective.

(1): Global Semantic Branch

This branch extracts global statistical information from the feature map to characterise the overall growth status of the plants. Specifically, global average pooling (GAP) is applied to the input feature map

X

to obtain a channel descriptor vector

X_{gap}

. Two fully connected layers (with a

R e L U

activation in between) then perform non-linear mapping and output the global prediction

{\hat{y}}_{global}

. This branch provides a stable overall trend estimation and serves as the fundamental component of the regression task.

(2): Local Structural Branch

Because plant leaf overlap, curling, and variations in spatial distribution affect biomass, relying solely on global information may neglect critical local structural features. Hence, this branch introduces depthwise convolution to enhance the modelling of local spatial patterns. A 3 × 3 depthwise convolution extracts local features independently for each channel, preserving spatial structure while reducing computational complexity. Global average pooling and fully connected layers then produce the local prediction

{\hat{y}}_{local}

. This branch helps capture fine-grained information such as leaf edges, canopy density, and stem distribution.

(3): Distribution Regression Branch

This branch represents the core innovation of the MDBR-Head. Unlike conventional point regression, this branch casts biomass prediction as a distribution estimation problem, thereby improving robustness against noise and outliers.

Concretely, the input feature map is globally pooled and mapped to a

K

-dimensional output space, yielding a logits vector l. A Softmax function then computes the probability distribution over the intervals:

p_{k} = Softmax (l_{k})

(4)

The final regression value is obtained by taking the expectation:

{\hat{y}}_{dist} = \sum_{k = 1}^{K} p_{k} \cdot c_{k}

(5)

where

p_{k}

is the predicted probability for the

k

-th biomass interval (bin) and

c_{k}

is the predefined centre value of that interval. This method explicitly models prediction uncertainty, making the model more stable when facing distribution shifts and anomalous samples in complex field environments.

2.6.3. Dynamic Adaptive Fusion Mechanism

To effectively integrate the heterogeneous predictions from the three branches, the MDBR adopts a dynamic weight fusion strategy rather than fixed weights or simple concatenation. Specifically, the global feature vector

X_{gap}

is fed into a fully connected layer to generate three branch-specific adaptive weights

W = [w_{1}, w_{2}, w_{3}]

, which are then normalised using a softmax function such that

\sum w_{i} = 1

. The final fused result is computed as:

{\hat{y}}_{fused} = w_{1} \cdot {\hat{y}}_{global} + w_{2} \cdot {\hat{y}}_{local} + w_{3} \cdot {\hat{y}}_{dist}

(6)

Furthermore, to improve training stability and alleviate the vanishing gradient problem, a linear residual connection is introduced:

{\hat{y}}_{final} = {\hat{y}}_{fused} + {FC}_{res} (X_{gap})

(7)

where

{FC}_{res}

is a residual fully connected layer. This residual path enables the model to rely on a linear mapping for fast convergence during the early training stage, and then gradually learn complex non-linear relationships as training proceeds. This design is particularly beneficial for small-sample scenarios, such as the 10-fold cross-validation used in this study. Experimental results show that this approach significantly improves both convergence speed and prediction accuracy.

2.7. Pairwise Ranking Loss

In biomass regression tasks for plants, conventional loss functions such as mean squared error or Smooth L1 loss mainly focus on the absolute difference between the predicted and ground-truth values. However, in actual field environments, due to annotation errors, occlusion, and variations in plant morphology, relying solely on numerical regression can easily lead the model to make incorrect relative judgments between samples with similar visual features. Compared with accurately predicting absolute values, learning the relative ranking relationship between samples (i.e., “which plant is heavier”) is often more stable. Therefore, a ranking constraint is introduced in addition to the regression loss in this paper. By jointly optimising numerical error and ranking consistency, the robustness and generalisation ability of the model in complex scenarios are enhanced.

2.7.1. Random Pairwise Ranking Modelling Mechanism

Considering the high computational cost of full pairwise ranking

(O (B^{2}))

, a random pairwise sampling strategy was adopted. Within each training batch, a lightweight ranking supervision was constructed. Specifically, for the predictions

\hat{y}

and the ground-truth values y in a batch, the sample indices were randomly permuted to form random sample pairs (i,j). The corresponding prediction difference and ground-truth difference were defined as:

Δ {\hat{y}}_{ij} = {\hat{y}}_{i} - {\hat{y}}_{j}

(8)

Δ y_{ij} = y_{i} - y_{j}

(9)

The ranking relationship was represented by a sign function:

s_{ij} = sign (Δ y_{ij})

(10)

To reduce noise interference, only those sample pairs whose ground-truth difference exceeded a threshold margin were involved in the ranking constraint. This mechanism effectively avoided imposing unstable ranking constraints on pairs with very small differences, thereby improving training stability. In this study, the ground-truth biomass values were normalised to the range [0, 1] before being fed into the model. The margin was set to 0.1, meaning that the ranking loss was computed only when the absolute difference in normalised biomass between two samples exceeded 0.1. This threshold filters out noise caused by weighing errors or minor plant variations while preserving meaningful relative ordering information.

| Δ y_{ij} | > margin (margin = 0.1 in the normalised biomass space)

(11)

2.7.2. Definition of the Ranking Loss

On the basis of the above random sample pairs, a margin-based Hinge Ranking Loss was adopted:

L_{rank} = \frac{1}{N} \sum \max (0.1 - Δ {\hat{y}}_{ij} \cdot s_{ij})

(12)

where

L_{rank}

denotes the ranking loss, which learns the relative magnitude relationship between samples and improves the ranking consistency of the model. N is the number of valid sample pairs satisfying the condition. When the predicted ranking is consistent with the true ranking and the margin is sufficient, the loss becomes zero; otherwise, a penalty is generated, guiding the model to learn the correct relative ordering.

Although this method does not explicitly construct all sample pairs, the random sampling strategy gradually covers different combinations across multiple training batches, thus approximating global ranking modelling without increasing computational overhead.

2.7.3. Stage-Wise Training Strategy

To prevent the ranking constraint from interfering with model convergence in the early training stage, a stage-wise training strategy was employed. Based on preliminary experiments, the model’s loss decreases rapidly and remains unstable during the first 10 epochs; introducing the ranking loss too early would cause gradient oscillations and convergence difficulties. Therefore, during the initial stage (the first 10 epochs), only the Huber Loss was used to optimise the model, allowing the network to quickly learn the overall distribution characteristics of biomass. In the subsequent stage, the ranking loss was introduced and a hybrid loss function was constructed:

L_{total} = L_{huber} + λ \cdot L_{rank}

(13)

where λ is a fixed weight coefficient (set to 0.3 in this paper). This strategy ensures numerical convergence while further strengthening the model’s ability to model relative relationships between samples.

In summary, the proposed ranking loss, through random pairwise ranking modelling and a stage-wise optimisation strategy, effectively improves ranking consistency while maintaining computational efficiency. This property is particularly important in the harvester scenario, where field operations rely more on the relative growth status among plants, thus providing a more reliable basis for graded harvesting and intelligent decision-making.

3. Results

3.1. Experimental Setup and Evaluation Metrics

To ensure the scientific rigour, objectivity, and reproducibility of the results presented in this study, this section details the hardware and software environment, the data processing workflow, the specific parameter settings used for model training and evaluation, and the metrics adopted to measure model performance.

3.1.1. Experimental Environment and Hyperparameter Settings

The experiments were conducted in the following hardware and software environment. The hardware platform consisted of a workstation equipped with an AMD Ryzen 9 7945HX processor (with Radeon Graphics), 16 GB DDR5 memory, and an NVIDIA GeForce RTX 4060 Laptop GPU. The operating system was Windows 11 Professional. The software environment was built on Python 3.9.21, with PyTorch 2.6.0 as the deep learning framework and CUDA 12.4 enabled for GPU acceleration. Key dependency libraries included openpyxl 3.1.5, TorchVision 0.21, NumPy 1.26.4, SciPy 1.10.1, and Pandas 2.3.1. The basic hyperparameter settings are summarised in Table 2.

3.1.2. Evaluation Metrics

To comprehensively evaluate model performance on the biomass prediction task for mature chilli pepper, four widely used regression metrics were adopted.

Root Mean Square Error (RMSE):

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2}}

(14)

This metric measures the dispersion between predicted and true values, being more sensitive to large errors. A smaller RMSE indicates higher prediction accuracy.

Mean Absolute Error (MAE):

MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |

(15)

MAE reflects the average absolute magnitude of prediction errors. It is less sensitive to outliers and provides an intuitive indication of the actual error level.

Mean Absolute Percentage Error (MAPE):

MAPE = \frac{100 %}{N} \sum_{i = 1}^{N} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |

(16)

MAPE quantifies the relative error as a percentage, eliminating the influence of scale and facilitating comparisons across prediction tasks with different magnitudes.

Coefficient of Determination (R²):

R^{2} = 1 - \frac{\sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{N} (y_{i} - \bar{y})^{2}}

(17)

This metric indicates the proportion of the variance in the target variable that is explained by the model. Its value ranges from 0 to 1, with values closer to 1 indicating a better fit.

Although a ranking loss based on pairwise ranking constraints was introduced during training, that loss was primarily used to assist the modelling of relative relationships between samples, thereby indirectly improving regression accuracy. Therefore, the regression metrics described above remained the main criteria for evaluation, ensuring intuitive interpretation and practical engineering value.

3.2. Multi-Modal Effectiveness Analysis

To validate the contribution of pseudo-depth information to biomass estimation, three comparative models were designed: RGB input only, pseudo-depth input only, and fused RGB and pseudo-depth input. The contribution and complementarity of different modalities were quantitatively analysed. The experimental results are presented in Table 3.

The results show that the single-modal models underperformed the dual-modal fusion model across all metrics, indicating strong complementarity between RGB and pseudo-depth information in biomass characterisation. Among the single-modal configurations, RGB-Only outperformed Depth-Only in MAE (241.12 g vs. 311.64 g), RMSE (296.77 g vs. 403.71 g), and R² (0.943 vs. 0.923), suggesting that colour, texture, and appearance features provided by RGB images are the dominant information source for biomass estimation, directly reflecting crop growth status and organ distribution. In contrast, Depth-Only achieved a lower MAPE (17.78% vs. 19.90%) and exhibited smaller fluctuation across validation folds (see Table 3 note for standard deviations), indicating that pseudo-depth information supplies stable spatial structural cues that effectively mitigate relative errors caused by illumination changes and occlusion. However, using pseudo-depth alone remains susceptible to estimation noise and lacks semantic discrimination ability, making high-precision regression difficult.

The dual-modal fusion model achieved the best performance across all metrics, with R² increasing to 0.955 (95% CI: 0.917–0.993) and MAPE decreasing to 14.86%. The reduced error variability across folds further confirms the complementary advantages of RGB appearance information and pseudo-depth structural information. The parameter count of the dual-modal model increased by roughly a factor of two (1.855 M vs. 0.928 M for single-modal models; model size is independent of cross-validation folds, as noted in Table 3), representing a manageable computational cost. Nevertheless, the improvement obtained with the baseline fusion strategy remained modest, indicating that simple concatenation cannot fully exploit cross-modal correlations. This finding highlights the need for more efficient cross-modal feature interaction and fusion mechanisms, providing experimental support for the CFIF module proposed later in this paper.

3.3. Comparative Experiments

To validate the effectiveness and generalisation ability of the proposed method, several mainstream lightweight convolutional neural networks and vision Transformer models were selected for comparison, including MobileNetV3-Large, ShuffleNetV2, GhostNetV3-Small, EfficientNet-B0, MobileViT-S, Tiny ViT-5M, and RepViT-0.9. All models were trained under the same experimental environment, with identical RGB and pseudo-depth dual-modal inputs and the same dynamic loss strategy. Table 4 and Table 5 list the biomass prediction performance and computational cost of each model, respectively.

As shown in Table 4, the performance of conventional lightweight models varies considerably on this task. Among traditional CNNs, EfficientNet-B0 achieved an MAE of 238.52 g and an R² of 0.956, outperforming MobileNetV3_Large and ShuffleNetV2, though at the cost of 8.02 M parameters. Transformer-based hybrid models (MobileViT-S and Tiny ViT-5M) delivered moderate gains in feature modelling, but their parameter counts (9.88 M and 10.10 M) and GFLOPs (2.88 and 1.69) increased substantially, making them less attractive for resource-limited agricultural equipment. RepViT-0.9, a state-of-the-art efficient Transformer, achieved competitive results with an MAE of 175.71 g and an R² of 0.968, yet required 13.15 M parameters, 3.57 GFLOPs, and a CPU latency of 53.28 ms—still too heavy for real-time harvester deployment.

Our proposed CIDR-MobileNet, with only 3.28 M parameters and 0.18 GFLOPs, attained an MAE of 174.56 g and an R² of 0.972. Compared with RepViT-0.9, it cuts parameters by 75%, computation by 95%, and latency by a factor of five, while delivering nearly identical MAE (174.56 vs. 175.71 g) and a marginally higher R² (0.972 vs. 0.968). Relative to MobileViT-S, CIDR-MobileNet reduces MAE by 26.5% and RMSE by 21.3%, with only one-third the parameters and one-sixteenth the GFLOPs. Notably, the baseline model (MobileNetV3_Small with simple concatenation and point regression) achieved an MAE of 273.64 g and an R² of 0.955 with 1.86 M parameters. CIDR-MobileNet, with a modest increase in GFLOPs (from 0.12 to 0.18), lowers MAE by 36.2% and improves R² by 1.7 percentage points. This gain comes from three synergistic contributions: the CFIF module models difference and product features for deeper RGB-pseudo-depth interaction; the MDBR head handles global, local, and distributional information separately; and the ranking loss enforces pairwise order constraints to suppress noise. The ablation study (Table 6) shows that CFIF alone reduces MAE to 234.37 g, MDBR alone to 214.68 g, and ranking loss alone to 219.82 g; their combination yields the best result.

On the efficiency side, CIDR-MobileNet runs at 10.56 ms CPU latency, considerably faster than RepViT-0.9 (53.28 ms) and MobileViT-S (40.92 ms), while training time stays comparable to the baseline at roughly two hours. The proposed method thus improves prediction accuracy without compromising real-time performance.

Overall, CIDR-MobileNet strikes a favourable balance between accuracy and efficiency. The integration of cross-modal interaction, multi-branch distribution regression, and ranking loss overcomes common limitations of existing methods, including insufficient feature fusion and vulnerability to noise. Its lightweight nature makes it viable for edge deployment on harvesters, fulfilling the practical requirements of online biomass monitoring.

3.4. Ablation Experiments

To systematically evaluate the individual contributions and synergistic effects of each key module on chilli pepper biomass prediction performance, a cumulative ablation study was designed. A dual-branch MobileNetV3-Small model (using only feature concatenation and point regression) was used as the baseline. The cross-modal feature interaction fusion module (CFIF), the multi-branch distribution regression head (MDBR-Head), and the ranking loss were then introduced sequentially. The experimental results are presented in Table 6. The following analysis covers three aspects: feature fusion, regression modelling, and optimisation strategy.

3.4.1. Effectiveness of the Cross-Modal Feature Interaction Fusion Module (CFIF)

When CFIF was introduced alone, MAE decreased from 273.64 g to 234.37 g, a reduction of 14.3%, and RMSE dropped from 345.58 g to 304.40 g. This indicates that simple feature concatenation or element-wise addition is insufficient to fully capture the complex relationship between RGB and pseudo-depth features. By explicitly computing difference features (

F_{rgb} - F_{Depth}

) and product features (

F_{rgb} ⊙ F_{Depth}

), CFIF captures both complementarity and commonality between the two modalities, thereby enhancing the richness of feature representation. Furthermore, the learnable scaling parameters and residual connection maintain the dominance of original features in the early training stage, avoiding gradient oscillations and allowing the model to gradually learn the optimal fusion strategy. Consequently, CFIF effectively alleviates the problem of missing information from a single modality and improves the robustness of biomass estimation in complex field environments.

3.4.2. Contribution of the Multi-Branch Distribution Regression Head (MDBR-Head)

Adding MDBR-Head alone to the baseline model reduced MAE to 214.68 g, a reduction of 21.5%, and RMSE to 285.01 g, while R² increased to 0.9592. Compared with using CFIF alone, MDBR-Head brought a more substantial accuracy gain. This improvement mainly stems from its multi-branch decoupling mechanism: the global branch captures the overall growth trend, the local branch focuses on fine-grained structural information such as leaf edges and canopy density, and the distribution branch converts biomass prediction into interval probability estimation, effectively handling the non-linearity, heteroscedasticity, and annotation noise common in agricultural data. Notably, when CFIF and MDBR-Head were both introduced (without ranking loss), the MAE was 220.27 g, slightly higher than the 214.68 g achieved by MDBR-Head alone. This may be due to some feature competition between the two modules in the absence of ranking constraints. However, their combination significantly lowered RMSE (277.59 g vs. 285.01 g), indicating a more concentrated prediction distribution. When all three modules were enabled, all metrics reached their best values, with MAE as low as 174.56 g and R² as high as 0.9715, fully demonstrating the synergistic enhancement effect of CFIF, MDBR-Head, and the ranking loss.

3.4.3. Optimisation Effect of the Ranking Loss

When the ranking loss was introduced alone (without CFIF or MDBR-Head), MAE decreased from 273.64 g to 219.82 g, a reduction of 19.7%, RMSE dropped to 283.66 g, and R² increased to 0.9574. This indicates that even without modifying the network architecture, imposing pairwise ranking constraints through the loss function effectively guides the model to focus on the relative order of biomass values, thereby alleviating over-fitting to noisy data in point regression. When the ranking loss was combined with MDBR-Head (row 6 in Table 6), MAE further decreased to 208.18 g and RMSE to 266.83 g, outperforming either MDBR-Head alone (row 3) or ranking loss alone (row 4). This suggests a complementarity between the two: MDBR-Head reduces absolute errors through distribution modelling, while the ranking loss enforces ranking consistency via pairwise constraints. Finally, when CFIF, MDBR-Head, and the ranking loss were all enabled together, all metrics reached their best values, with MAE as low as 174.56 g and R² as high as 0.9715. The synergy among the three components is evident: CFIF provides high-quality cross-modal features, MDBR-Head decouples multi-granularity regression information, and the ranking loss corrects relative relationships among samples at the optimisation level. Together, they jointly improve both accuracy and robustness of biomass prediction in complex field environments.

3.5. Visualisation Analysis

To intuitively evaluate the performance of the proposed CIDR-MobileNet on the chilli pepper biomass prediction task, this section presents a visualisation analysis from four aspects: model convergence, fitting ability, error distribution, and stability.

3.5.1. Training Convergence and Stability Analysis

Figure 6 shows the training and validation loss curves over epochs for each fold under 10-fold cross-validation. The following observations can be made.

Fast convergence. During the first 20 epochs, the loss values of all folds drop rapidly, indicating that the proposed model quickly extracts effective information from the high-dimensional feature space and approaches the optimal solution. This property is particularly important for real-time applications such as harvester operations.

Generalisation ability. The validation loss curves (dashed lines) and training loss curves (solid lines) exhibit highly consistent trends in the later stages of training, converging synchronously without obvious divergence or a “cross-over” phenomenon. This suggests that the model does not over-fit the training set but instead learns feature representations that generalise well across samples.

Robustness analysis. Owing to differences in the distribution of data subsets, the initial loss fluctuations vary slightly among folds (e.g., Fold 4 and Fold 2 show somewhat larger fluctuations at the beginning). Nevertheless, all folds eventually converge steadily to a low loss range below 0.2. These results confirm that the model is robust to variations in sample distribution caused by different planting densities, growth stages, and lighting conditions, and can adapt to the diversity of real field scenarios.

3.5.2. Analysis of Prediction Fitting Ability

Figure 7 shows the scatter plot of ground-truth biomass versus predicted values across all test samples. Most data points are tightly distributed around the ideal reference line (y = x), and the trend line (orange solid line) closely coincides with the reference line, indicating that the model achieves favourable fitting ability.

One nuance worth pointing out is that the global R² displayed in Figure 7 is 0.967, slightly lower than the mean R² of 0.972 reported in Table 4. This difference is expected: the table reports the arithmetic mean of fold-wise R² values from ten separate validation sets, whereas the figure gives the overall R² computed from all predictions pooled together. Both values are essentially consistent and well above 0.96, reinforcing the model’s stability across different data splits.

Looking at the error distribution, the model is quite accurate in the low-biomass range (1000–4000 g), where the predictions cluster tightly around the diagonal. For samples above 5000 g, a few points fall slightly below the line, suggesting a mild underestimation of the heaviest plants. This is probably due to dense leaf occlusion at the late growth stage, which limits the visual cues available for biomass estimation. Even so, the overall MAE of 216.12 g indicates that the errors stay within a tolerable margin.

3.5.3. Residual Error Distribution Analysis

To further investigate the prediction performance and error statistical characteristics of the model across different biomass magnitudes, this study plots the residual distribution between predicted and true values (Figure 8). Specifically, Figure 8a presents the scatter distribution of residuals (Ground Truth—Predicted) against predicted biomass, where the black dashed line denotes the zero-error baseline. Figure 8b shows the frequency statistics of the residuals via a histogram, with the red dashed line indicating the zero-residual position and the pink dashed line marking the mean residual.

As seen in Figure 8a, the vast majority of data points are tightly distributed on both sides of the zero-residual line, with no obvious systematic curvature or tilt, indicating that the model does not exhibit significant systematic bias. The mean residual is 12.09, very close to zero, further confirming the unbiasedness of the predictions. Notably, as the predicted biomass increases (moving rightward along the horizontal axis), the vertical dispersion of residuals gradually expands—a phenomenon of heteroscedasticity. In the low-biomass range (<3000 g), residuals are mostly confined within ±400 g; in the high-biomass range (>4000 g), some residuals spread to ±1000 g. This reflects the increased difficulty of prediction under high-biomass conditions due to intensified canopy occlusion and greater individual morphological variation.

The residual histogram in Figure 8b exhibits a typical near-normal distribution (bell curve), with the peak centred near zero residual, indicating that the prediction errors are mainly composed of random noise and that most predictions closely match the ground truth. The mean absolute error (MAE) is 216.12 and the residual standard deviation (Std Residual) is 301.19, together defining the spread of the errors. Moreover, the pink dashed line representing the mean residual lies slightly to the right of the red dashed zero-residual line (positive region), suggesting a very mild underestimation tendency (GT − Pred > 0). This observation is consistent with the earlier finding that some points in the high-value region fall below the reference line, indicating that the model adopts a slightly conservative prediction strategy for extreme high-value samples.

3.5.4. Model Stability Evaluation Based on K-Fold Cross-Validation

The previous results (Table 4 and the scatter plot) show the model performance on a single random data split (R² ≈ 0.96). To eliminate chance effects caused by a particular data partition and to further assess generalisation ability, a 10-fold cross-validation experiment was conducted.

As shown in Figure 9 (or according to the data description), the R² values of the ten folds range from 0.903 to 0.978, with a mean of 0.965 and a standard deviation of only 0.022. This result is highly consistent with the single-test-split performance, indicating that the model’s performance does not stem from a lucky division that received an easy test set. Although two folds (field1 and field8) gave slightly lower R² values (0.903 and 0.909, respectively), all folds exceeded 0.90, and the overall fluctuation is small. This demonstrates that the model adapts well to different training–test data combinations, without severe performance oscillations due to variations in sample distribution.

In summary, the cross-validation results firmly confirm the reliability and robustness of the proposed model for chilli pepper biomass estimation.

4. Discussion

This study tackles the trade-off between accuracy and cost in non-destructive AGB monitoring for chilli pepper under field conditions. CIDR-MobileNet outperforms the baseline across all metrics and confirms that monocular pseudo-depth can serve as a practical substitute for active depth sensors. The discussion covers four aspects: feature interaction, distribution regression, application scenarios, and statistical significance.

4.1. Synergistic Enhancement of Model Performance by Core Modules

Most RGB-D fusion methods for plant phenotyping rely on channel concatenation or element-wise addition followed by a few convolutional layers. These approaches largely overlook the complementary yet conflicting nature of RGB texture and depth geometry. The baseline MobileNetV3_Small_Dual, using simple concatenation, achieved an MAE of 273.64 g and an R² of 0.955—clear room for improvement.

Our CFIF module explicitly models difference and product features across modalities, capturing both agreement (shared structures) and disagreement (complementary cues). Unlike gating-based or bilinear pooling methods, CFIF directly models pairwise cross-modal interactions rather than re-weighting channels or spatial positions.

Comparable studies are few. Sosa-Herrera et al. used Mask R-CNN with RGB aerial imagery for pepper health assessment, but did not estimate biomass [44]. Li et al. proposed a detection-segmentation framework for pepper seedlings and reported an R² of 0.813 between leaf area and dry weight—well below our 0.972, likely because field plants are more morphologically complex than greenhouse seedlings [45]. The BioUMixer model, which combines supervised contrastive regression with a U-shaped residual fusion network, achieved an MAE of 201.98 g and RMSE of 252.18 g on a Pepper_Biomass dataset—both worse than our 174.56 g and 230.74 g [46]. This suggests that explicit cross-modal interaction is more effective than contrastive learning for this task.

CFIF is computationally efficient. The learnable scaling parameters α_diff, α_prod and the residual factor β act as a warm-start, stabilising training—an issue rarely addressed in agricultural deep learning. Ablation shows that CFIF alone cuts MAE by over 36%, with parameters increasing only modestly from 1.86 M to 3.28 M. This efficiency is critical for real-time deployment on harvesters.

4.2. Distribution Regression vs. Point Estimation

Most existing biomass models use point estimation with MSE, MAE, or Huber loss. They assume a deterministic feature-to-biomass mapping and ignore uncertainty from occlusion, lighting, and morphological variation. Schreiber et al. [11] used UAV RGB imagery and deep learning for wheat biomass, while Okada et al. [12] applied similar methods to soybean—both used point estimation without explicit uncertainty modelling.

By contrast, our MDBR-Head formulates biomass prediction as distribution regression: the biomass range is discretised into bins, a probability distribution over bins is predicted, and the final estimate is the expectation. To our knowledge, this is the first application of distribution regression to in-field chilli pepper biomass prediction. Distribution regression is inherently robust to label noise and outliers. Our ground-truth biomass comes from destructive sampling, which inevitably includes measurement errors and natural variability. Point-regression baselines (e.g., MobileNetV3_Small_Dual) tend to overfit such noise—evident from their larger error standard deviations. By modelling probability mass across intervals, the distribution branch learns which biomass values are more plausible for a given input, suppressing anomalous samples. This is supported by the results: CIDR-MobileNet achieved a MAPE of 9.56% and an R² of 0.972, clearly outperforming the baseline.

Another novel aspect is the dynamic weighting of three branches (global, local, distribution). Most multi-branch networks use fixed weights or simple averaging. Here, learnable weights are generated from the global feature vector, allowing the model to adapt branch contributions to the input—emphasising the local branch for well-structured plants and the distribution branch for heavily occluded ones. Such adaptivity is rarely seen in agricultural regression models.

4.3. Application Scenarios and Comparison with Existing Solutions

Deploying biomass monitoring on harvesters requires balancing sensing cost, speed, and accuracy. High-precision studies often use LiDAR or structured-light cameras—expensive, bulky, and computationally demanding [24,25]. Low-cost RGB-only solutions, by contrast, lack sufficient three-dimensional structural information. Our study demonstrates that pseudo-depth generated by Depth Anything V2 offers a viable alternative.

Depth Anything V2 was selected for three key advantages over its predecessor: synthetic images for teacher training, larger teacher capacity, and lightweight student models trained on large-scale pseudo-labelled real images [30]. Compared with MiDaS and LeReS, Depth Anything V2 generalises better to diverse scenes and produces finer depth predictions. Recent benchmarks confirm these strengths. In wildlife monitoring, Depth Anything V2 achieved an MAE of 0.454 m and a correlation of 0.962, substantially outperforming ZoeDepth (MAE: 3.087 m) [47]. Compared with Stable Diffusion-based models such as Marigold, Depth Anything V2 is over ten times faster and more accurate [30]. Agricultural applications crop segmentation with depth-informed pseudo-labelling and wheat spike detection—further support its suitability for plant phenotyping [38].

That said, pseudo-depth estimation has clear limitations. Monocular depth models are designed primarily for rigid objects and structured scenes, whereas plant canopies are non-rigid, fine-scaled, and highly irregular—making depth reconstruction intrinsically more difficult. Depth Anything V2 outputs relative rather than metric depth; absolute scale must be inferred from other cues, which matters when height or volume is a key predictor. The model was pre-trained on general-domain images, not agricultural data, and existing synthetic training sets lack sufficient agricultural scene diversity, potentially limiting generalisation to unseen crops or planting configurations. Despite these limitations, our results (Table 3) show that even without fine-tuning, pseudo-depth provides enough structural information to significantly improve over RGB-only models, and the fusion of both modalities outperforms either alone.

Compared with other crop biomass studies, most use UAV-based high-resolution RGB images and heavy backbones such as ResNet or Vision Transformers, with tens of millions of parameters—making deployment on embedded devices difficult without substantial optimisation. BioUMixer [46] is competitive but complex. Schreiber et al. [11] and Okada et al. [12] both relied on UAV platforms and relatively heavy networks, limiting edge deployment. By contrast, CIDR-MobileNet was designed for edge deployment from the start: a MobileNetV3-Small backbone, depthwise separable convolutions in the local branch, and CFIF with negligible extra FLOPs. CPU latency is just 10.56 ms, and GPU memory footprint is low. Very few in-field biomass studies report such favourable efficiency while maintaining high accuracy.

Another distinction is plant-wise K-fold cross-validation. Many studies split images randomly, causing data leakage when multiple images of the same plant appear in both training and test sets, artificially inflating performance. By assigning all images of a given plant to the same fold, our evaluation provides a more realistic estimate of generalisation. The small standard deviations in Table 4 indicate that CIDR-MobileNet’s performance is stable across folds, suggesting it learns genuine biomass-related features rather than overfitting to individual plants.

4.4. Statistical Significance

To test whether CIDR-MobileNet’s improvements over baselines are statistically significant, we performed paired tests. For each metric (MAE, RMSE, MAPE, R²), the ten fold-wise values of CIDR-MobileNet were compared with those of each baseline using paired t-tests (or Wilcoxon signed-rank tests when normality was violated, assessed by Shapiro–Wilk). Results are in Table 7.

All comparisons against the baseline yielded p < 0.001 across all metrics. Against RepViT-0.9—the strongest competitor—all metrics reached p < 0.05, with MAE and MAPE at p < 0.01. These results confirm that the observed gains are not due to chance and provide statistical support for the effectiveness of CFIF, MDBR-Head, and the ranking loss.

4.5. Limitations and Future Work

Several limitations should be noted. First, the dataset is modest (275 samples) and comes from a single region (Xinjiang, China) and a single season. This restricts generalisability. As other studies have shown, models trained on site-specific data often degrade when applied to different climates, soils, or cultivation practices. A single season also means the model has not seen inter-annual variation in weather, pests, or management—all of which affect canopy structure and biomass. Multi-site, multi-season data are needed to confirm robustness across environments and varieties. Also, our samples were visually healthy; plants with disease, pest damage, mechanical injury, or nutrient deficiency were excluded. This ensured data quality, but the model’s performance on stressed plants remains untested.

Second, pseudo-depth maps were generated by a pre-trained Depth Anything V2 without fine-tuning on our field data. While generalisation is good, domain-specific fine-tuning would likely improve depth accuracy, especially for dense chilli pepper canopies. Third, the current model predicts fresh weight only. Future multi-task versions could include dry matter or fruit yield.

Future work has three directions. First, expand the dataset to multiple seasons and at least three major pepper-growing regions in China (e.g., Xinjiang, Shandong, Hunan) with distinct climates and soils, enabling better assessment of generalisation and cross-region transfer. Second, explore domain adaptation and few-shot learning to reduce labelling costs for new targets, accelerating deployment to new crops and environments. Third, pursue model compression and hardware deployment—quantisation and pruning to reduce size while preserving accuracy, and field tests on an actual harvester with a Jetson Orin module to evaluate performance under vibration, dust, and variable lighting. Self-supervised or semi-supervised learning may also reduce reliance on labour-intensive destructive sampling.

Overall, CIDR-MobileNet offers a low-cost, high-accuracy, real-time solution for biomass monitoring in chilli pepper harvesting. Its explicit cross-modal interaction, distribution regression head, and lightweight design address key limitations of existing methods and provide a practical path towards intelligent agricultural perception.

5. Conclusions

In this study, a lightweight dual-branch network named CIDR-MobileNet was proposed for accurate above-ground biomass estimation of chilli pepper. By using monocular pseudo-depth as a substitute for active sensors, the method enables low-cost and high-precision field monitoring. The model was evaluated on 275 field-collected samples using a plant-wise 10-fold cross-validation strategy. The main findings are summarised as follows.

Pseudo-depth effectively compensated for the insufficient information provided by RGB images. In single-modal comparisons, Depth-Only achieved a MAPE of 14.86%, which was better than that of RGB-Only (19.52%), and also showed smaller error fluctuations. After dual-modal fusion, the MAE decreased to 174.56 g, the RMSE to 230.74 g, the R² reached 0.9715, and the MAPE was only 9.56%, all significantly outperforming either single modality.

The core mechanisms brought substantial improvements. Compared with the baseline model (simple concatenation fusion plus point regression), the introduction of the CFIF explicit interaction module and the MDBR distribution regression head reduced the MAE by 36.2% and increased R² from 0.9553 to 0.9715. The distribution regression, which models probabilities over biomass intervals, enhanced robustness against occlusion and illumination variations.

The proposed model is lightweight and efficient, outperforming mainstream alternatives. It has only 3.28 M parameters, 0.18 GFLOPs, and a CPU latency of 10.56 ms. Compared with RepViT-0.9, the model achieved a 75% reduction in parameters, a 95% reduction in computational cost, and a 5-fold improvement in speed, while maintaining an almost identical MAE (174.56 g vs. 175.71 g). Compared with MobileViT-S, the MAE was reduced by 26.5%, with only one-third the parameters and one-sixteenth the computational cost.

In summary, CIDR-MobileNet achieves high-accuracy biomass prediction at very low cost, providing a feasible solution for real-time monitoring on intelligent agricultural machinery. Future work will focus on model quantisation and multi-site validation.

Author Contributions

Conceptualization, Y.W. and J.D.; methodology, Y.W. and J.D.; software, L.Y.; validation, S.R., L.Y. and W.H.; data curation, S.R. and W.W.; writing—original draft preparation, J.D.; writing—review and editing, Y.W.; visualisation, W.W. and W.H.; funding acquisition and supervision, P.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (No. 2022YFD2002001). The work presented in this study was partially supported by the sub-theme “Research and Development of Key Common Technologies and System for Intelligent Harvesting of Special Economic Crops”.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are available on request, due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AGB	above-ground biomass
NDVI	Normalised Difference Vegetation Index
NDRE	Normalised Difference Red Edge
CNN	convolutional neural network
UAV	unmanned aerial vehicle
UAS	unmanned aerial system
LiDAR	Light Detection and Ranging
MVS	multi-view stereo
CFIF	cross-feature interactive fusion
GAP	global average pooling
ReLU	Rectified Linear Unit
BN	batch normalisation
MDBR	multi-branch distribution regression
MAE	mean absolute error
MAPE	mean absolute percentage error
RMSE	root mean square error

References

Karim, K.M.R.; Rafii, M.Y.; Misran, A.B.; Ismail, M.F.B.; Harun, A.R.; Khan, M.M.H.; Chowdhury, M.F.N. Current and Prospective Strategies in the Varietal Improvement of Chilli (Capsicum annuum L.) Specially Heterosis Breeding. Agronomy 2021, 11, 2217. [Google Scholar] [CrossRef]
Liu, Y.; Feng, H.; Yue, J.; Fan, Y.; Jin, X.; Song, X.; Yang, H.; Yang, G. Estimation of Potato Above-Ground Biomass Based on Vegetation Indices and Green-Edge Parameters Obtained from UAVs. Remote Sens. 2022, 14, 5323. [Google Scholar] [CrossRef]
Smith, E.N.; van Aalst, M.; Tosens, T.; Niinemets, Ü.; Stich, B.; Morosinotto, T.; Alboresi, A.; Erb, T.J.; Gómez-Coronado, P.A.; Tolleter, D.; et al. Improving Photosynthetic Efficiency toward Food Security: Strategies, Advances, and Perspectives. Mol. Plant 2023, 16, 1547–1563. [Google Scholar] [CrossRef] [PubMed]
Tai, S.; Tang, Z.; Li, B.; Wang, S.; Guo, X. Intelligent Recognition and Automated Production of Chili Peppers: A Review Addressing Varietal Diversity and Technological Requirements. Agriculture 2025, 15, 1200. [Google Scholar] [CrossRef]
Buxbaum, N.; Lieth, J.H.; Earles, M. Non-destructive Plant Biomass Monitoring with High Spatio-Temporal Resolution via Proximal RGB-D Imagery and End-to-End Deep Learning. Front. Plant Sci. 2022, 13, 758818. [Google Scholar] [CrossRef] [PubMed]
Munser, L.; Sathyanarayanan, K.K.; Raecke, J.; Mansour, M.M.; Uland, M.E.; Streif, S. Precise and Continuous Biomass Measurement for Plant Growth Using a Low-Cost Sensor Setup. Sensors 2025, 25, 4770. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Chen, C.; Wang, J.; Yao, Z.; Wang, Y.; Zhao, Y.; Sun, Y.; Wu, F.; Han, D.; Yang, G.; et al. NDVI Estimation Throughout the Whole Growth Period of Multi-Crops Using RGB Images and Deep Learning. Agronomy 2025, 15, 63. [Google Scholar] [CrossRef]
Cardenas-Gallegos, J.S.; Lacerda, L.N.; Severns, P.M.; Peduzzi, A.; Klimeš, P.; Ferrarezi, R.S. Advancing Biomass Estimation in Hydroponic Lettuce Using RGB-Depth Imaging and Morphometric Descriptors with Machine Learning. Comput. Electron. Agric. 2025, 234, 110299. [Google Scholar] [CrossRef]
Dhawi, F.; Ghafoor, A.; Almousa, N.; Ali, S.; Alqanbar, S. Predictive Modelling Employing Machine Learning, Convolutional Neural Networks (CNNs), and Smartphone RGB Images for Non-Destructive Biomass Estimation of Pearl Millet (Pennisetum glaucum). Front. Plant Sci. 2025, 16, 1594728. [Google Scholar] [CrossRef] [PubMed]
Carlier, A.; Dandrifosse, S.; Dumont, B.; Mercatoris, B. Comparing CNNs and PLSr for Estimating Wheat Organs Biophysical Variables Using Proximal Sensing. Front. Plant Sci. 2023, 14, 1204791. [Google Scholar] [CrossRef] [PubMed]
Schreiber, L.V.; Atkinson Amorim, J.G.; Guimarães, L.; Motta Matos, D.; Maciel da Costa, C.; Parraga, A. Above-Ground Biomass Wheat Estimation: Deep Learning with UAV-Based RGB Images. Appl. Artif. Intell. 2022, 36, 2055392. [Google Scholar] [CrossRef]
Okada, M.; Barras, C.; Toda, Y.; Hamazaki, K.; Ohmori, Y.; Yamasaki, Y.; Takahashi, H.; Takanashi, H.; Tsuda, M.; Hirai, M.Y.; et al. High-Throughput Phenotyping of Soybean Biomass: Conventional Trait Estimation and Novel Latent Feature Extraction Using UAV Remote Sensing and Deep Learning Models. Plant Phenomics 2024, 6, 0244. [Google Scholar] [CrossRef] [PubMed]
Jin, Z.; Hong, W.; Wang, Y.; Jiang, C.; Zhang, B.; Sun, Z.; Liu, S.; Lv, C. A Transformer-Based Symmetric Diffusion Segmentation Network for Wheat Growth Monitoring and Yield Counting. Agriculture 2025, 15, 670. [Google Scholar] [CrossRef]
Bazrafkan, A.; Igathinathane, C.; Bandillo, N.; Flores, P. Optimizing Integration Techniques for UAS and Satellite Image Data in Precision Agriculture—A Review. Front. Remote Sens. 2025, 6, 1622884. [Google Scholar] [CrossRef]
Liu, Y.; Feng, H.; Yue, J.; Fan, Y.; Bian, M.; Ma, Y.; Jin, X.; Song, X.; Yang, G. Estimating Potato Above-Ground Biomass by Using Integrated Unmanned Aerial System-Based Optical, Structural, and Textural Canopy Measurements. Comput. Electron. Agric. 2023, 213, 108229. [Google Scholar] [CrossRef]
Castro, W.; Marcato Junior, J.; Polidoro, C.; Osco, L.P.; Gonçalves, W.; Rodrigues, L.; Santos, M.; Jank, L.; Barrios, S.; Valle, C.; et al. Deep Learning Applied to Phenotyping of Biomass in Forages with UAV-Based RGB Imagery. Sensors 2020, 20, 4802. [Google Scholar] [CrossRef] [PubMed]
Zhu, S.; Zhang, W.; Yang, T.; Wu, F.; Jiang, Y.; Yang, G.; Zain, M.; Zhao, Y.; Yao, Z.; Liu, T.; et al. Combining 2D Image and Point Cloud Deep Learning to Predict Wheat Above Ground Biomass. Precis. Agric. 2024, 25, 3139–3166. [Google Scholar] [CrossRef]
Deng, L.; Fu, B.; Wu, Y.; He, H.; Sun, W.; Jia, M.; Deng, T.; Fan, D. Comparison of 2D and 3D Vegetation Species Mapping in Three Natural Scenarios Using UAV-LiDAR Point Clouds and Improved Deep Learning Methods. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103588. [Google Scholar] [CrossRef]
Chang, J.; Walsh, J.J.; Mangina, E.; Ruan, J.; Negrão, S. AI-Driven 3D Point Cloud Analysis in Plant Phenotyping: A Systematic Review. Smart Agric. Technol. 2026, 13, 101869. [Google Scholar] [CrossRef]
Li, Y.; Liang, Z.; Liu, B.; Yin, L.; Wan, F.; Qian, W.; Qiao, X. Applications of 3D Reconstruction Techniques in Crop Canopy Phenotyping: A Review. Agronomy 2025, 15, 2518. [Google Scholar] [CrossRef]
Li, S.; Cui, Z.; Yang, J.; Wang, B. A Review of Optical-Based Three-Dimensional Reconstruction and Multi-Source Fusion for Plant Phenotyping. Sensors 2025, 25, 3401. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; He, N.; Wu, Y.; Guo, X.; Wen, W. Point Cloud Data-Driven Methods for Estimating Maize Leaf Biomass. Smart Agric. 2026, 8, 156–166. [Google Scholar] [CrossRef]
Wang, F.; Jia, W.; Guo, H.; Zhang, X.; Li, D.; Li, Z.; Sun, Y. Point Cloud-Based Crown Volume Improves Tree Biomass Estimation: Evaluating Different Crown Volume Extraction Algorithms. Comput. Electron. Agric. 2024, 225, 109288. [Google Scholar] [CrossRef]
Verma, M.K.; Yadav, M. 3D LiDAR-Based Techniques and Cost-Effective Measures for Precision Agriculture: A Review. Rev. Int. Geomat. 2025, 34, 855–879. [Google Scholar] [CrossRef]
Rosell-Polo, J.R.; Gregorio, E.; Gené, J.; Llorens, J.; Torrent, X.; Arnó, J.; Escolà, A. Kinect v2 Sensor-Based Mobile Terrestrial Laser Scanner for Agricultural Outdoor Applications. IEEE/ASME Trans. Mechatron. 2017, 22, 2420–2427. [Google Scholar] [CrossRef]
Liu, X.; Xu, K.; Shinoda, R.; Santo, H.; Okura, F. Masks-to-Skeleton: Multi-View Mask-Based Tree Skeleton Extraction with 3D Gaussian Splatting. Sensors 2025, 25, 4354. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Liu, J.; Zuo, X.; Hu, M. Multi-Scale Iterative Refinement Network for RGB-D Salient Object Detection. Eng. Appl. Artif. Intell. 2021, 106, 104473. [Google Scholar] [CrossRef]
Chen, H.; Zhou, H.; Zhang, Y.; Lin, Z.; Deng, Y. Dissecting RGB-D Learning for Improved Multi-Modal Fusion. IEEE Trans. Image Process. 2026, 35, 1846–1857. [Google Scholar] [CrossRef] [PubMed]
Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 10371–10381. [Google Scholar] [CrossRef]
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2. Adv. Neural Inf. Process. Syst. 2024, 37, 21875–21911. [Google Scholar] [CrossRef]
Shang, Y.; Sun, G.; Zhang, H. DepthCL-Seg: Dual-Stream Feature Fusion for Green Fruit Instance Segmentation Based on Monocular Depth. Agriculture 2026, 16, 283. [Google Scholar] [CrossRef]
Li, Y.; Song, J.; Chen, C.; Liu, X. Learning Modality Complementarity for RGB-D Salient Object Detection via Dynamic Neural Network. Electronics 2026, 15, 1361. [Google Scholar] [CrossRef]
Yang, Y.; Huang, N.; Zhang, Q.; Han, J. Mitigating Fusion Bias for RGB-D Salient Object Detection. Pattern Recognit. 2026, 171, 112135. [Google Scholar] [CrossRef]
Gonçalves, F.M.F.; Pedronette, D.C.G.; da Silva Torres, R. Regression by Re-Ranking. Pattern Recognit. 2023, 140, 109577. [Google Scholar] [CrossRef]
Sun, T.; Sheldon, D.; O’Connor, B. A Probabilistic Approach for Learning with Label Proportions Applied to the US Presidential Election. In 2017 IEEE International Conference on Data Mining (ICDM); IEEE: New York, NY, USA, 2017; pp. 445–454. [Google Scholar] [CrossRef]
Paul, A.; Machavaram, R. Evolution of Sweet Pepper (Capsicum annuum) Harvesting: From Traditional Practices to Mechanization and Robotics. Curr. Res. Agric. Sci. 2025, 2, 100047. [Google Scholar] [CrossRef]
Han, D.; Wang, C.; Zhang, H.; Pang, H.; Wang, X.; Chen, X.; Wen, X. Advances in Mechanized Harvesting Technologies and Equipment for Chili Peppers. Agriculture 2025, 15, 1129. [Google Scholar] [CrossRef]
Cao, S.; Xu, B.; Zhou, W.; Zhou, L.; Zhang, J.; Zheng, Y.; Hu, W.; Han, Z.; Lu, H. The blessing of Depth Anything: An almost unsupervised approach to crop segmentation with depth-informed pseudo labeling. Plant Phenomics 2025, 7, 100005. [Google Scholar] [CrossRef] [PubMed]
Papić, V.; Bugarin, N.; Marin, I.; Gotovac, S.; Gugić, J. Monocular depth estimation driven canopy segmentation for enhanced determination of vegetation indices in olive grove monitoring. Remote Sens. 2025, 17, 3245. [Google Scholar] [CrossRef]
Zhuo, Y.; You, F. 3D plant phenotyping from a single image: Learning fine-scale organ morphology with monocular depth estimation. Comput. Electron. Agric. 2025, 239, 110925. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Weyand, T.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Sosa-Herrera, J.A.; Alvarez-Jarquin, N.; Cid-Garcia, N.M.; López-Araujo, D.J.; Vallejo-Pérez, M.R. Automated Health Estimation of Capsicum annuum L. Crops by Means of Deep Learning and RGB Aerial Images. Remote Sens. 2022, 14, 4943. [Google Scholar] [CrossRef]
Li, H.; Shi, D.; Shi, H.; Li, M.; Diao, M. A Stage-Aware Cascaded Detection–Segmentation Framework for Leaf Phenotyping and Leaf Dry Biomass Estimation of Pepper Seedlings. Plants 2026, 15, 1912. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Peng, J.; Chen, H.; Peng, H.; Wang, Y.; Jiang, P. Enhancing Aboveground Biomass Prediction through Integration of the SCDR Paradigm into the U-Like Hierarchical Residual Fusion Model. Sensors 2024, 24, 2464. [Google Scholar] [CrossRef] [PubMed]
Niccoli, N.; Seidenari, L.; Greco, I.; Rovero, F. Benchmark on Monocular Metric Depth Estimation in Wildlife Setting. In Image Analysis and Processing—ICIAP 2025 Workshops; Rodolà, E., Galasso, F., Masi, I., Eds.; Springer Nature: Cham, Switzerland, 2026; pp. 529–539. [Google Scholar] [CrossRef]

Figure 1. Field data acquisition platform and representative images. (a) The mobile harvester platform equipped with an RGB camera operating in the chilli pepper field under real-world conditions. (b) A sample high-resolution RGB image captured by the onboard camera.

Figure 2. Examples of the constructed multi-modal chilli pepper biomass dataset. (a) Original RGB images collected in the field under different light intensities, which completely preserve the colour, texture, and morphological features of the chilli pepper plants. (b) Corresponding pseudo-depth maps generated by the Depth Anything V2 model, where the grayscale values intuitively reflect the relative height and spatial topology of the chilli pepper canopy.

Figure 3. Overall architecture of the proposed CIDR-MobileNet. The framework consists of three core components: (1) a dual-branch lightweight backbone (MobileNetV3-Small) for extracting features from RGB images and pseudo-depth maps generated by Depth Anything V2; (2) a Cross-Feature Interactive Fusion (CFIF) module to adaptively fuse RGB, depth, difference, and product features; and (3) a Multi-Branch Distribution Regression Head (MDBR-Head) that combines global, local, and distribution-aware branches for robust above-ground biomass prediction.

Figure 4. Schematic diagram of the proposed Cross-Feature Interactive Fusion (CFIF) module. The CFIF module explicitly models the complementary relationship between RGB and pseudo-depth features. It generates difference and product features to capture modal discrepancies and correlations, then adaptively fuses these multi-modal cues with a residual connection to enhance feature representation for biomass estimation.

Figure 5. Architecture of the proposed Multi-Branch Distribution Regression (MDBR) head.

Figure 6. K-Fold cross-validation loss curves. The solid lines represent the training loss, and the dashed lines denote the validation loss. The curves illustrate the convergence performance of the model over 100 epochs.

Figure 7. Scatter plot of Ground Truth vs. Predicted values across all folds.

Figure 8. Model prediction residual error distribution analysis. (a) Residual scatter plot. (b) Residual histogram.

Figure 9. Performance of R² across ten folds in cross-validation. The bar chart shows R² for each fold, with error bars indicating the standard deviation. The dashed line denotes the mean R² (0.965).

Table 1. Dataset partitioning and statistics.

Dataset	Total Samples (N)	Target Variable Range (g)	Training/Validation Split (10-fold)	Mean ± Std of Biomass (g)
Dataset Description	275	271–6451	90%/10%	2724.51 ± 1683.95

Table 2. Basic hyperparameter settings.

Parameter	Setting
Optimizer	AdamW
Initial learning rate	0.0001
Batch size	16
Training epochs	100
Weight decay	0.0001
Input size	224 × 224
Loss function	Dynamic loss (Huber Loss for the first 10 epochs; thereafter Huber Loss + 0.3 × Ranking Loss)
K-fold cross-validation	10 folds
Random seed	42

Table 3. Comparison of multi-modal experimental results.

Models	MAE (g)	MAPE (%)	RMSE (g)	R² (95% CI)	Parameters (M)	Size (M)
RGB_Only	241.12	19.90	296.77	0.943 [0.912, 0.974]	0.928	0.928
Depth_Only	311.64	17.78	403.71	0.923 [0.881, 0.966]	0.928	0.928
Dual	273.64	14.86	345.58	0.955 [0.917, 0.993]	1.855	1.855

Note: All metrics are reported as the mean over ten folds of cross-validation (N = 275, biomass range: 271–6451 g). The standard deviations across the ten folds are: for RGB_Only, MAE = ±122.46 g, MAPE = ±26.99%, RMSE = ±146.34 g; for Depth_Only, MAE = ±127.83 g, MAPE = ±13.35%, RMSE = ±167.85 g; for Dual, MAE = ±109.73 g, MAPE = ±9.59%, RMSE = ±131.84 g. The 95% confidence intervals for R² (in brackets) were computed using the t-distribution with 9 degrees of freedom. Model parameters and file size are independent of cross-validation folds.

Table 4. Comparison of experimental results for chilli pepper biomass prediction.

Models	MAE (g)	MAPE (%)	RMSE (g)	R² (95% CI)
ShuffleNetV2	294.09	19.52	367.48	0.934 [0.890, 0.978]
GhostNetV3 Small	290.24	19.74	366.63	0.933 [0.895, 0.970]
MobileNetV3_Large	309.10	24.71	380.73	0.934 [0.892, 0.976]
EfficientNet-B0	238.52	13.24	307.40	0.956 [0.922, 0.990]
MobileViT-S	237.40	16.88	293.05	0.955 [0.922, 0.988]
Tiny ViT 5M	277.93	19.21	357.40	0.938 [0.901, 0.974]
RepViT-0.9	175.71	11.95	226.71	0.968 [0.956, 0.979]
MobileNetV3 _Small (Baseline)	273.64	14.86	345.58	0.955 [0.917, 0.993]
CIDR-MobileNet (Ours)	174.56	9.56	230.74	0.972 [0.945, 0.998]

Note: All metrics are reported as the mean over ten folds of cross-validation (N = 275, biomass range: 271–6451 g). The standard deviations across the ten folds are: for ShuffleNetV2, MAE = ±158.43 g, MAPE = ±12.00%, RMSE = ±178.17 g; GhostNetV3 Small, MAE = ±135.37 g, MAPE = ±10.80%, RMSE = ±146.10 g; MobileNetV3_Large, MAE = ±161.34 g, MAPE = ±34.29%, RMSE = ±200.17 g; EfficientNet-B0, MAE = ±127.40 g, MAPE = ±10.67%, RMSE = ±157.34 g; MobileViT-S, MAE = ±96.29 g, MAPE = ±16.07%, RMSE = ±121.08 g; Tiny ViT 5M, MAE = ±129.22 g, MAPE = ±14.06%, RMSE = ±159.45 g; RepViT-0.9, MAE = ±60.44 g, MAPE = ±11.53%, RMSE = ±77.60 g; MobileNetV3_Small (Baseline), MAE = ±109.73 g, MAPE = ±9.59%, RMSE = ±131.84 g; CIDR-MobileNet (Ours), MAE = ±82.02 g, MAPE = ±5.26%, RMSE = ±111.33 g. The 95% confidence intervals for R² (in brackets) were computed using the t-distribution with 9 degrees of freedom.

Table 5. Comparison of model complexity and inference speed.

Models	Parameters (M)	Size (M)	GFLOPs	CPU Latency (ms)	Training Time
ShuffleNetV2	3.29	12.91	0.36	18.27	1 h 56 m 42 s
GhostNetV3 Small	4.23	18.10	0.57	102.36	2 h 11 m 32 s
MobileNetV3_Large	5.95	22.87	0.44	19.66	2 h 30 m 46 s
EfficientNet-B0	8.02	30.90	0.12	40.47	2 h 19 m 23 s
MobileViT-S	9.88	37.77	2.88	40.92	2 h 35 m 8 s
Tiny ViT 5M	10.10	38.98	1.69	36.47	3 h 24 m 45 s
RepViT-0.9	13.15	50.23	3.57	53.28	2 h 32 m 48 s
MobileNetV3 _Small (Baseline)	1.86	7.17	0.12	10.86	2 h 1 m 8 s
CIDR-MobileNet (Ours)	3.28	12.59	0.18	10.56	2 h 2 m 51 s

Table 6. Ablation experiment results for chilli pepper biomass prediction.

CFIF	MDBR-Head	Ranking Loss	MAE (g)	MAPE (%)	RMSE (g)	R²
×	×	×	273.64	14.86	345.58	0.955
√	×	×	234.37	11.69	304.40	0.956
×	√	×	214.68	11.33	285.01	0.959
×	×	√	219.82	12.28	283.66	0.957
√	√	×	220.27	12.57	277.59	0.962
×	√	√	208.18	10.22	266.83	0.961
√	×	√	225.84	13.08	277.86	0.963
√	√	√	174.56	9.56	230.74	0.972

Note: All metrics are reported as the mean over ten folds of cross-validation (N = 275, biomass range: 271–6451 g). The standard deviations across the ten folds are: for the baseline (all ×), MAE = ±109.73 g, MAPE = ±9.59%, RMSE = ±131.84 g; CFIF only, MAE = ±111.20 g, MAPE = ±9.90%, RMSE = ±148.20 g; MDBR-Head only, MAE = ±111.37 g, MAPE = ±8.30%, RMSE = ±152.31 g; Ranking Loss only, MAE = ±94.42 g, MAPE = ±8.85%, RMSE = ±120.77 g; CFIF + MDBR, MAE = ±122.90 g, MAPE = ±11.00%, RMSE = ±158.47 g; MDBR + Ranking Loss, MAE = ±102.73 g, MAPE = ±5.44%, RMSE = ±139.48 g; CFIF + Ranking Loss, MAE = ±107.83 g, MAPE = ±9.77%, RMSE = ±122.21 g; full model (all √), MAE = ±82.02 g, MAPE = ±5.26%, RMSE = ±111.33 g. The 95% confidence intervals for R² (in brackets) were computed using the t-distribution with 9 degrees of freedom. √ indicates that the corresponding component is enabled; × indicates disabled. The best result in each column is highlighted in bold.

Table 7. Statistical significance (p-values from paired t-tests over ten folds).

Comparison	MAE	MAPE	RMSE	R²
CIDR-MobileNet vs. MobileNetV3_Small (Baseline)	<0.001	<0.001	<0.001	<0.001
CIDR-MobileNet vs. RepViT-0.9	0.012	0.018	<0.001	0.009
CIDR-MobileNet vs. EfficientNet-B0	<0.001	<0.001	<0.001	<0.001
CIDR-MobileNet vs. MobileViT-S	<0.001	<0.001	<0.001	0.002

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Deng, J.; Yang, L.; Ruan, S.; Wang, W.; Hu, W.; Jiang, P. CIDR-MobileNet: A Monocular Pseudo-Depth and Cross-Modal Feature Fusion Approach for Chili Pepper Above-Ground Biomass Estimation. Agriculture 2026, 16, 1457. https://doi.org/10.3390/agriculture16131457

AMA Style

Wang Y, Deng J, Yang L, Ruan S, Wang W, Hu W, Jiang P. CIDR-MobileNet: A Monocular Pseudo-Depth and Cross-Modal Feature Fusion Approach for Chili Pepper Above-Ground Biomass Estimation. Agriculture. 2026; 16(13):1457. https://doi.org/10.3390/agriculture16131457

Chicago/Turabian Style

Wang, Yi, Jingtao Deng, Lin Yang, Shangjing Ruan, Weijie Wang, Wenwu Hu, and Ping Jiang. 2026. "CIDR-MobileNet: A Monocular Pseudo-Depth and Cross-Modal Feature Fusion Approach for Chili Pepper Above-Ground Biomass Estimation" Agriculture 16, no. 13: 1457. https://doi.org/10.3390/agriculture16131457

APA Style

Wang, Y., Deng, J., Yang, L., Ruan, S., Wang, W., Hu, W., & Jiang, P. (2026). CIDR-MobileNet: A Monocular Pseudo-Depth and Cross-Modal Feature Fusion Approach for Chili Pepper Above-Ground Biomass Estimation. Agriculture, 16(13), 1457. https://doi.org/10.3390/agriculture16131457

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CIDR-MobileNet: A Monocular Pseudo-Depth and Cross-Modal Feature Fusion Approach for Chili Pepper Above-Ground Biomass Estimation

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

Data Acquisition Method

2.2. Dataset Construction and Annotation

2.2.1. Dataset Size

2.2.2. Annotation Procedure

2.3. Pseudo-Depth Map Generation

2.3.1. Method Origin

2.3.2. Generation Workflow

2.4. Overall Network Architecture

2.4.1. Dual-Branch Structure

2.4.2. Feature Extraction Backbone Network

2.4.3. Overall Workflow

2.5. CFIF Feature Fusion Module

2.5.1. Cross-Feature Interactive Fusion (CFIF) Module

2.5.2. Fusion Mechanism

2.6. Multi-Branch Distribution Regression Head (MDBR-Head)

2.6.1. Architecture of the Regression Head

2.6.2. Multi-Granularity Feature Decoupling Head

2.6.3. Dynamic Adaptive Fusion Mechanism

2.7. Pairwise Ranking Loss

2.7.1. Random Pairwise Ranking Modelling Mechanism

2.7.2. Definition of the Ranking Loss

2.7.3. Stage-Wise Training Strategy

3. Results

3.1. Experimental Setup and Evaluation Metrics

3.1.1. Experimental Environment and Hyperparameter Settings

3.1.2. Evaluation Metrics

3.2. Multi-Modal Effectiveness Analysis

3.3. Comparative Experiments

3.4. Ablation Experiments

3.4.1. Effectiveness of the Cross-Modal Feature Interaction Fusion Module (CFIF)

3.4.2. Contribution of the Multi-Branch Distribution Regression Head (MDBR-Head)

3.4.3. Optimisation Effect of the Ranking Loss

3.5. Visualisation Analysis

3.5.1. Training Convergence and Stability Analysis

3.5.2. Analysis of Prediction Fitting Ability

3.5.3. Residual Error Distribution Analysis

3.5.4. Model Stability Evaluation Based on K-Fold Cross-Validation

4. Discussion

4.1. Synergistic Enhancement of Model Performance by Core Modules

4.2. Distribution Regression vs. Point Estimation

4.3. Application Scenarios and Comparison with Existing Solutions

4.4. Statistical Significance

4.5. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI