Wide + Tiles Vision Transformer Framework for Smartphone-Based Grassland Biomass Prediction in Heterogeneous Field Conditions

Arystanova, Ranida; Zeinulla, Darkhan; Kabzhanova, Gulnara; Bissembayev, Anuarbek; Bekseitova, Roza; Sarsekova, Dani; Saule, Bakhbayeva; Arystanov, Asset; Sagin, Janay; Nurtay, Margulan

doi:10.3390/agriculture16131401

Open AccessArticle

Wide + Tiles Vision Transformer Framework for Smartphone-Based Grassland Biomass Prediction in Heterogeneous Field Conditions

by

Ranida Arystanova

¹,

Darkhan Zeinulla

²

,

Gulnara Kabzhanova

³

,

Anuarbek Bissembayev

⁴

,

Roza Bekseitova

¹,

Dani Sarsekova

⁵

,

Bakhbayeva Saule

⁶,

Asset Arystanov

¹,

Janay Sagin

^7,8,*

and

Margulan Nurtay

^2,*

¹

Faculty of Geography and Environmental Sciences, Farabi University, 71 Al-Farabi, 050040 Almaty, Kazakhstan

²

Information and Computing Systems Department, Abylkas Saginov Karagandy Technical University, 100027 Karagandy, Kazakhstan

³

LLP “Skyterra”, 38 Alikhan Bokeikhan St., 010000 Astana, Kazakhstan

⁴

LLP “Scientific and Production Centre for Animal Husbandry and Veterinary”, Kenesary 40, 010000 Astana, Kazakhstan

⁵

Faculty of Forestry and Land Resources, Kazakh National Agrarian Research University, 8 Abay, 050010 Almaty, Kazakhstan

⁶

Department of Biology and Ecology, Toraighyrov University, 64 Lomov St., 140008 Pavlodar, Kazakhstan

⁷

School of Information Technology and Engineering (SITE), Kazakh British Technical University, 050005 Almaty, Kazakhstan

⁸

Department of Geological and Environmental Sciences, Western Michigan University, Kalamazoo, MI 49008, USA

^*

Authors to whom correspondence should be addressed.

Agriculture 2026, 16(13), 1401; https://doi.org/10.3390/agriculture16131401 (registering DOI)

Submission received: 18 May 2026 / Revised: 23 June 2026 / Accepted: 23 June 2026 / Published: 27 June 2026

(This article belongs to the Topic Advances in Smart Agriculture with Remote Sensing as the Core and Its Applications in Crops Field, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study addresses the issue of accurate and rapid aboveground biomass estimation in rangeland ecosystems, as traditional grazing methods are labor-intensive, while modern remote sensing techniques often require expensive equipment and controlled conditions. The goal of this work is to develop an efficient and accessible approach for biomass estimation of natural pastures based on ground-level RGB images captured with smartphones. For this purpose, a dataset consisting of 1196 field images and corresponding biomass values collected from 40 districts in southern Kazakhstan was used, and a wide + tiles architecture based on the DINOv3 model of Vision Transformer was proposed. The model utilized attention pooling and feature fusion mechanisms to integrate both global and local features, and various preprocessing and augmentation strategies were comparatively examined. Experimental results demonstrated that the proposed method exhibits high accuracy (with the best result being R² = 0.733, MAE ≈ 0.779 c/ha), where the DINOv3 model showed clear advantages over ConvNeXtV2. Furthermore, the impact of preprocessing strategies was minimal, and the importance of high-resolution images was clearly established. The obtained results show that the proposed method performs consistently under heterogeneous field conditions and allows for reliable biomass estimation without the need for specialized equipment. This makes it a practical tool for monitoring pastures, planning forage supply, and supporting agronomic decision-making.

Keywords:

grassland biomass estimation; vision transformer; DINOv3; smartphone-based monitoring; rangeland management; deep learning; RGB imagery; Kazakhstan pastures

1. Introduction

Pasture ecosystems are among the most widespread on the planet and perform key functions in the global carbon cycle, the maintenance of biodiversity, and the provision of a forage base for livestock production [1,2]. The Republic of Kazakhstan ranks fifth in the world in terms of pasture area, and approximately 40% of the country’s population depends directly or indirectly on these resources [3]. Under conditions of increasing anthropogenic pressure, climate change, and post-crisis transformations of post-Soviet land use, monitoring the state of vegetation cover and assessing aboveground biomass acquire strategic importance for the sustainable management of rangelands [3]. Accurate and timely biomass assessment is essential for determining permissible grazing pressure, identifying degraded areas, and supporting evidence-based agronomic decision-making at both the farm and regional levels [4,5]. Yet, despite this need, reliable and scalable biomass assessment methods remain inaccessible for many practitioners in remote pastoral regions.

The traditional method for biomass estimation—clipping, drying, and weighing plant samples within sampling quadrats—is considered the most accurate reference approach; however, it is labor-intensive, destructive to vegetation cover, and unsuitable for regular large-scale monitoring [6,7,8]. This limitation has stimulated the rapid development of remote sensing methods employing unmanned aerial vehicles (UAVs), satellite sensors, and ground-based instrumental systems. Among UAV-based approaches, photogrammetric reconstruction using the Structure from Motion (SfM) technique has become widely adopted, enabling the generation of three-dimensional models of vegetation structure for estimating canopy height and biomass. Each of these directions offers certain advantages, yet also introduces specific constraints in terms of cost, equipment requirements, or applicability under heterogeneous field conditions.

UAV-based approaches have provided important benchmarks for grassland and forage biomass estimation because they can capture high-resolution structural information. UAV/SfM and UAV RGB methods have achieved strong performance in grassland, forage, and crop systems, including canopy-height-based biomass estimation, Random Forest models using RGB-derived indices, and neural-network-based prediction from UAV imagery [1,6,7,8,9,10,11,12]. In parallel, RGB-based deep learning models have shown that visual cues such as canopy density, color, texture, and structural patterns can be informative for biomass prediction [13,14,15,16]. However, many of these studies rely on UAV imagery, managed experimental plots, crop-specific datasets, or structured acquisition protocols, which limits their direct transferability to low-cost ground-level monitoring of heterogeneous natural rangelands.

Satellite-based and multisource remote sensing approaches provide broader spatial coverage and are useful for regional-scale biomass monitoring [17,18,19,20,21]. However, they are constrained by spatial resolution, cloud contamination, mixed pixels, limited sensitivity to fine-scale canopy structure, and the need for field calibration. Recent RGB-based deep learning studies also show that model robustness across different locations, vegetation communities, soil backgrounds, illumination conditions, and sampling seasons remains a major challenge [22,23]. In addition, attention-based image analysis methods have shown potential for extracting complex spatial patterns from remote-sensing imagery [24], but such approaches still need to be adapted and validated for ground-level pasture biomass estimation.

The studies most directly related to the present work are those using ground-based or smartphone-captured RGB imagery. Smartphone RGB images combined with transfer learning and machine learning regressors have been used for non-destructive biomass estimation of pearl millet, with XGBoost achieving R² = 0.98 and RMSE = 0.26 when using a comprehensive feature set [25]. However, this work focused on a specific crop and was conducted under controlled image-acquisition conditions, which limits its direct transferability to natural multi-species pastures. A public ground-based pasture dataset from Australia provided 1162 top-view images collected across 19 locations, with each 70 cm × 30 cm quadrat paired with component-wise biomass, vegetation height, and NDVI measurements [26]. This dataset is important for pasture biomass modeling, but its standardized top-view quadrat protocol differs from wide-field smartphone images collected under natural pasture conditions.

Recent field-oriented work has begun to address this limitation. Woodrow et al. used digital photographs from grazing lands to estimate pasture biomass with continuous-output neural networks based on DenseNet121 [27]. Their dataset was created from an archive of mobile-phone quadrat photographs and corresponding hand-cut biomass measurements collected across diverse field monitoring sites. The final dataset included 1007 images from ten general locations, derived from 73 sites sampled between 2003 and 2021. Unlike laboratory or standardized datasets, the photographs were collected without a consistent imaging protocol: camera height, angle, orientation, lighting, shadows, and background conditions varied substantially. The authors reported that individual quadrat errors remained large, but site-mean biomass estimates were more promising, and model performance decreased on sites with vegetation conditions different from those represented in the training data. This study is particularly relevant because it demonstrates both the practical value and the difficulty of using ground-based RGB photographs under uncontrolled field conditions.

More recently, Mandal evaluated vision foundation models and cross-view fusion strategies on the CSIRO Pasture Biomass benchmark [28]. The benchmark consists of 357 dual-view images from 19 sites across four Australian states collected between 2014 and 2017, with laboratory-validated component-wise biomass targets obtained by harvesting, sorting, oven-drying, and weighing vegetation from 70 cm × 30 cm quadrats. Across 17 configurations, DINOv3-ViT-L with a two-layer gated depthwise convolution fusion module achieved the best weighted R² = 0.903, outperforming cross-view attention transformers (R² = 0.833), bidirectional state-space models (R² = 0.819), full Mamba fusion (R² = 0.793), and the no-fusion identity baseline (R² = 0.819). The study also showed that backbone pretraining scale was more important than increasing fusion complexity, and that the DINOv2-to-DINOv3 upgrade alone improved performance by approximately 5 R² points. Nevertheless, this benchmark still relies on quadrat-level dual-view imagery and does not fully represent wide-field smartphone photographs containing sky, terrain, distant background, bare soil, and other elements typical of natural rangeland scenes.

The literature reviewed above reveals three central gaps. First, UAV/SfM, multispectral, and satellite-based approaches can achieve strong performance, but they often require specialized sensors, structured acquisition protocols, canopy height information, or additional processing steps [1,6,7,8,9,10,11,12,17,18,19,20,21,29,30,31]. Second, many RGB-based deep learning models have been validated on crop-specific, UAV-based, laboratory, or standardized quadrat datasets, which limits their demonstrated transferability to heterogeneous natural pastures [13,14,15,16,22,23,25,26]. Third, directly relevant ground-based RGB studies remain limited: existing smartphone or proximal-image approaches either focus on controlled crop conditions, standardized top-view quadrats, or still report substantial uncertainty under uncontrolled field conditions [25,26,27,28]. Therefore, there remains a need for a validated low-cost approach for aboveground biomass estimation from ground-level smartphone RGB images in the arid and semi-arid rangelands of Kazakhstan.

In addition to these application-specific gaps, the potential of modern self-supervised Vision Transformer architectures for biomass estimation from ground-level RGB images remains insufficiently explored. DINOv3 provides strong pretrained visual representations and has shown promising performance in sparse agricultural regression settings [28,32,33]. Unlike conventional convolutional models, Vision Transformers can capture long-range spatial dependencies through self-attention, which may be useful for pasture images where biomass-relevant cues are distributed across both the full scene and local vegetation patches [34]. ConvNeXtV2 provides a strong modern convolutional baseline for evaluating whether transformer-based representations offer an advantage over advanced CNN-type features [35]. However, these architectures have not yet been sufficiently evaluated for real-world smartphone-based biomass estimation in heterogeneous natural rangelands.

The aim of this study is to develop and evaluate a low-cost, scalable approach for estimating aboveground biomass of natural pastures in Kazakhstan using ground-level RGB images acquired with smartphones and pretrained deep vision models. To address the identified gaps, this study proposes a wide + tiles architecture based on DINOv3 (ViT-L/16), in which the global context of the full image is combined with local vegetation texture features extracted from image tiles through attention pooling and feature fusion. The main contributions of this work are as follows: (1) the use of a field-collected dataset of 1196 smartphone RGB images paired with clipping-based biomass measurements from 40 administrative districts across five arid and semi-arid regions of southern Kazakhstan; (2) the development of a DINOv3-based wide + tiles framework for integrating global scene context and local vegetation structure from a single ground-level image; (3) the systematic evaluation of original, masked, and cropped preprocessing strategies, together with the effect of image resolution on prediction accuracy; and (4) the assessment of the practical potential and limitations of smartphone-based biomass estimation under heterogeneous real-world pasture conditions without the need for specialized equipment.

2. Materials and Methods

The study was conducted across five administrative regions of southern Kazakhstan—Kyzylorda, Turkestan, Zhambyl, Almaty, and Zhetysu—covering a total area of approximately 709.5 thousand km² and encompassing 40 administrative districts (Figure 1). The climate of the study area is predominantly arid and semi-arid, characterized by low annual precipitation, high summer temperatures, and pronounced seasonal variability in vegetation cover. The dominant land-use type was pasture, with vegetation cover primarily represented by grass–wormwood–mixed herbaceous communities. These conditions result in high spatial heterogeneity of aboveground biomass across districts and seasons, posing a significant challenge for image-based estimation models. The primary land-use type examined in the study was pasture. The vegetation cover was predominantly characterized by grass–wormwood–mixed herbaceous communities, while soil conditions were mainly associated with gray soils that were weakly saline and composed of fine-textured loamy materials.

For model training and comparative experiments, each sampling location was associated with a wide-field RGB image captured using a smartphone. In total, 1196 images were included in the analysis. In addition to vegetation cover, the images contained elements such as the sky, bare soil, terrain features, distant background, and occasional extraneous objects. This visual heterogeneity increases the practical value of the dataset, as it allows models to be evaluated under conditions approximating real-world field environments, rather than exclusively under laboratory or strictly controlled settings.

Field Research Methodology

To develop permissible grazing load standards for agricultural livestock on pastures in the Almaty, Zhambyl, Zhetysu, Kyzylorda, and Turkestan regions of the Republic of Kazakhstan at the district scale, the following activities were carried out [36,37,38,39,40,41,42,43]:

Development of digital maps of pasture lands by administrative districts;
Selection and thematic processing of remote sensing data to identify classification units by types of pasture vegetation;
Preparation of test plots with presumed types of pasture vegetation and preliminary analysis of the study area;
Conducting ground-based geobotanical surveys according to the approved methodology within test plots by districts;
Development of scientifically grounded livestock grazing load standards by districts based on spatial analysis of pasture vegetation types.

The object-cartographic method was applied in organizing the research. This method made it possible to generate spatial data arrays for a detailed description of vegetation with administrative referencing and subsequent classification of pastures within a geographic information system (GIS). Remote sensing data played a key role, including preliminary and thematic processing of satellite imagery, classification of vegetation cover, selection of optimal interpretation algorithms, and spatial analysis.

For geobotanical surveys, routes were designed, and test plots were selected, with at least 30 plots per district. Their selection was based on the results of remote vegetation classification and the identification of spectral homogeneity of vegetation formations. The description of control plots was carried out according to the Methodology. During the surveys, coordinates, landscape-geographical zone, soil type, plant species composition, projective cover, and other field parameters were recorded (Figure 2).

Ground-based imagery and biomass measurements were collected during the June–August period of 2024–2025. At each observation point, multiple photographs were captured; however, a wide-field RGB image was selected as the primary input for modeling. The images were acquired using a smartphone camera under natural lighting conditions, at an approximate height of 1.2–1.5 m, and oriented toward the ground surface, with an average spatial resolution of 4000 × 3000 pixels. During field acquisition, variations in illumination, shadowing, background, and terrain were not constrained, thereby preserving the full extent of natural variability. Examples of field images are presented in Figure 3.

Biomass values were determined using the clipping method [44], whereby vegetation at each observation point was completely harvested, and its productivity was calculated in centners per hectare (c/ha). Each image was paired with the corresponding biomass value, resulting in a dataset structured for a supervised regression task. The biomass measurements exhibited a wide range of variation (mean: 3.63 c/ha; standard deviation: 2.41 c/ha; minimum: 0.0 c/ha; maximum: 13.75 c/ha), reflecting the inherent complexity of the problem.

To assess the influence of background information on biomass prediction, three preprocessing strategies were considered: original full images (original), masked images (masked), and cropped images (cropped). These approaches were designed to enhance informative features associated with vegetation cover while reducing the impact of background elements irrelevant to biomass estimation. Examples of the preprocessing stages are presented in Figure 4.

In the baseline strategy, the original RGB images were used without modification, thereby preserving the spatial context of the scene. In the second strategy, semantic segmentation based on the SegFormer architecture was applied to attenuate irrelevant regions—such as the sky, mountains, roads, and other non-informative areas—while primarily retaining vegetation cover and the ground surface. In the third strategy, guided by the segmentation results, cropping was performed on the lower, relevant portion of the frame, reducing extraneous information from the upper and distant background.

Although the primary objective of preprocessing was to reduce background influence, spatial context may, in certain cases, serve as an important predictive feature. Therefore, the proposed strategies were empirically compared, and their impact on biomass prediction performance was systematically evaluated.

In the study, the **wide + tiles** strategy was employed to leverage both global and local visual information simultaneously for biomass prediction. The full image was used as the primary input, preserving the overall scene context, while additional local information was obtained by dividing the image into multiple smaller regions. This approach allows the model to learn both the general structural layout of the scene and fine-grained local textural features concurrently.

For the original and masked images, each image was partitioned into a 3 × 3 grid, resulting in 9 tiles, enabling detailed analysis of local structures. In the case of cropped images, where the relevant region had already been preselected, the image was divided into a 2 × 3 grid, producing 6 tiles, which avoids excessive fragmentation while retaining sufficient local information. Features extracted from the wide and tile branches were then combined via attention pooling and feature fusion, facilitating the model’s effective utilization of multi-scale information, as illustrated in Figure 5. The proposed architecture is based on the wide + tiles approach: a single image was provided to the backbone network both as a global context (wide) and as multiple local patches (tiles) in parallel. The backbone used was DINOv3 (ViT-L/16), from which 1024-dimensional feature vectors were extracted from the images. Tile features were aggregated into a single vector via attention pooling and then concatenated with the wide feature. The resulting 2048-dimensional vector was processed through a two-layer MLP (512 units, ReLU, Dropout), and the final linear layer predicted the biomass value. A schematic of the architecture is shown in Figure 5.

The architecture can be formally described by Equation (1):

\hat{y} = M L P ([f_{w i d e} | | A t t n P o o l ({\{f_{t_{k}}\}}_{k = 1}^{K})]),

(1)

where

f_{w i d e} = ϕ (x_{w i d e})

denotes the global feature vector,

f_{t_{k}} = ϕ (x_{t_{k}})

is the feature vector of the

k

-th tile,

ϕ

represents the shared backbone network, K is the number of tiles, and

∥

denotes the concatenation operation.

The attention pooling mechanism aggregates tile features by assigning weights proportional to their importance:

A t t n P o o l ({\{f_{t_{k}}\}}_{k = 1}^{K}) = \sum_{k = 1}^{K} α_{k} f_{t_{k}}, α_{k} = \frac{e x p (s (f_{t_{k}}))}{\sum_{j = 1}^{K} e x p (s (f_{t_{k}}))}

(2)

where

s (\cdot)

is the linear scoring function with Tanh activation.

To enhance the model’s robustness to heterogeneous field conditions and to prevent overfitting on the limited dataset, a comprehensive augmentation strategy was implemented. The augmentations combined geometric, photometric, and vegetation-specific spectral transformations. For both the wide and tile images, augmentations were applied independently, yet within the same typological framework. Table 1 presents all applied augmentations, along with their parameters and associated probabilities of application (Figure 6).

The dataset of 1196 samples was divided into training and validation subsets using a geographically constrained 80/20 split at the administrative-district level. The split was not performed by randomly assigning individual images independently. Instead, samples were first grouped according to their administrative district, and selected districts were held out entirely for validation. All samples from each held-out district were assigned only to the validation subset, while samples from the remaining districts were used for training. As a result, no administrative district was shared between the training and validation subsets. This district-level holdout strategy was used to reduce spatial leakage and to provide a more realistic assessment of model generalization to geographically distinct pasture areas. The administrative districts assigned to the training and validation subsets are listed in Table 2. The held-out validation districts were not used during model training or model selection.

The AdamW optimizer was employed with a weight decay of 10⁻² and gradient clipping (maximum norm = 1.0). During training, CUDA Automatic Mixed Precision (AMP) was enabled to improve computational efficiency. The weights of the best-performing model were automatically saved based on the maximum R² value on the validation set. All experiments were conducted on an NVIDIA GeForce RTX 4080 Laptop GPU with 12 GB of VRAM. The choice of backbone models was guided by the objective of comparing two strong but conceptually different families of modern visual representations. DINOv3 was selected as the primary backbone because self-supervised Vision Transformer representations have demonstrated strong transferability to downstream visual tasks, particularly when labeled domain-specific datasets are limited. This property is relevant for pasture biomass estimation, where collecting field images paired with destructive biomass measurements is labor-intensive. In addition, the self-attention mechanism of Vision Transformers allows the model to capture long-range spatial relationships and global scene context, which may be important for heterogeneous ground-level pasture images. ConvNeXtV2 was selected as a modern convolutional baseline because it represents a strong CNN-type architecture with improved feature extraction capacity compared with earlier convolutional models. Therefore, comparing DINOv3 with ConvNeXtV2 allowed us to evaluate whether transformer-based representations provide an advantage over a recent high-performing convolutional alternative under identical experimental conditions.

For biomass prediction, the Vision Transformer-based DINOv3 was employed as the primary model [32], while ConvNeXtV2 was introduced for comparison [35]. The inputs consisted of the wide-view image (wide image) and the corresponding local patches (tiles) derived from it. The wide branch captured the global context, whereas the tile features encoded local textural characteristics of the vegetation cover. Tile features were aggregated via attention pooling and subsequently fused with the wide feature through feature fusion [34]. The combined features were processed through a multilayer perceptron, and biomass values were predicted using a regression head. The ConvNeXtV2 model followed the same architecture, implemented solely by replacing the backbone.

Additionally, a multi-task approach was explored, where regression and interval-based classification were performed simultaneously. Target values were normalized using a log1p transformation and discretized into intervals for classification purposes. Furthermore, DINO-style self-supervised pretraining was applied: the backbone was pretrained using a multi-crop strategy and subsequently adapted via fine-tuning [33].

During model training, the Smooth L1 loss was employed as the loss function [45]. This function behaves quadratically for small errors and linearly for large deviations:

L (y, \hat{y}) = \{\begin{matrix} {\frac{1}{2} (y - \hat{y})}^{2} \\ |y - \hat{y}| - \frac{1}{2} \end{matrix}, \binom{|y - \hat{y}| < 1}{|y - \hat{y}| \geq 1}

(3)

Here,

y

denotes the true biomass value, and

\hat{y}

represents the model prediction. The Smooth L1 loss was chosen because, compared to pure mean squared error (MSE), it is less sensitive to outliers, and compared to the mean absolute error (MAE), it provides a smooth gradient near zero, ensuring stable gradient-based training. Given that the biomass dataset spans a wide range of values from 0.0 to 13.75 c/ha, this property proved practically important.

To provide simpler and more interpretable reference baselines, additional machine learning models were trained using handcrafted image features extracted only from the smartphone RGB images. No field-derived variables, such as projective cover, forage capacity, vegetation type, soil type, geographic coordinates, or other metadata, were used as predictors in these baselines. For each image, RGB channel statistics, RGB-based vegetation color indices, gray-level co-occurrence matrix (GLCM) texture descriptors, and coarse 3 × 3 spatial features were extracted. RGB statistics included channel-wise mean, standard deviation, median, minimum, maximum, and percentile values. Vegetation indices included ExG, ExR, ExGR, NGRDI, VARI, GLI, and RGBVI. Texture information was represented using GLCM descriptors, including contrast, dissimilarity, homogeneity, energy, correlation, and ASM. Coarse spatial features were calculated by dividing each image into a 3 × 3 grid and extracting local RGB and vegetation-related summaries from each grid cell. These handcrafted RGB-derived features were used to train Random Forest, XGBoost, and CatBoost regression models. All baseline models were trained and evaluated using the same administrative-district-level training and validation split as the deep learning models.

To evaluate and compare model performance, three standard metrics were employed for the regression task. For a validation set comprising n observations, with true values

y_{i}

and predicted values

{\hat{y}}_{i}

, the metrics are calculated as demonstrated below.

The Mean Absolute Error (MAE) represents the average magnitude of prediction errors in the original units (c/ha) and is robust to outliers:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(4)

The Root Mean Squared Error (RMSE) amplifies larger deviations, making practically significant large errors more pronounced and readily detectable:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(5)

The coefficient of determination (

R^{2}

) measures the proportion of variance in the true values explained by the model; (

R^{2}

= 1) corresponds to perfect prediction, whereas (

R^{2} \leq

0) indicates a model no better than a naive baseline:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}},

(6)

where

\bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}

is the mean of observed values. During training,

R^{2}

on the validation set was used as the primary monitoring metric: the weights of the best-performing model were saved at the point where this metric reached its maximum. In model selection and hyperparameter tuning,

R^{2}

served as the primary criterion, while MAE and RMSE were used additionally for practical interpretation.

3. Results

3.1. Ablation Study

Based on the proposed architecture, several configurations were compared. These configurations differ in terms of backbone type, input image strategy, tile grid, and input resolution. All results are summarized in Table 3.

All metrics in Table 3 were computed on the geographically held-out district-level validation subset described in the Section 2 and detailed in Table 2.

In the overall comparison, configurations based on DINOv3 demonstrated a clear performance advantage over ConvNeXtV2. The highest R² values were achieved with the masked input (R² = 0.733) and the original input (R² = 0.732), which can be considered practically equivalent. Across all DINOv3 configurations, the MAE remained at approximately 0.8 c/ha, whereas the worst performance was observed for ConvNeXtV2 (MAE = 0.951 c/ha, R² = 0.614).

To directly evaluate the contribution of the proposed wide + tiles design, additional branch-level ablation experiments were conducted using the original-image configuration. The wide-only model, which used only the full input image, achieved MAE = 0.978 c/ha, RMSE = 1.305 c/ha, and R² = 0.676. The tiles-only model, which used only local image patches, achieved MAE = 0.794 c/ha, RMSE = 1.268 c/ha, and R² = 0.696. The full wide + tiles configuration achieved the best performance, with MAE = 0.774 c/ha, RMSE = 1.186 c/ha, and R² = 0.732. These results indicate that local tile-level features are more informative than global context alone, while the combination of global and local information provides the highest predictive accuracy. Therefore, the direct ablation supports the contribution of the proposed wide + tiles architecture to biomass prediction performance.

The present ablation study did not include comparisons with lightweight architectures such as MobileNetV3, EfficientNetV2-S, or MobileViT. This represents a limitation of the current experimental design. The primary objective of this work was to establish the accuracy potential of ground-level smartphone-based biomass estimation using state-of-the-art pretrained visual representations, rather than to optimize inference efficiency. Because the proposed method was designed primarily for offline or batch processing of field images, computational cost was not treated as the main constraint in the present study. Nevertheless, developing lightweight alternatives that approach the accuracy of DINOv3-based models remains an important direction for future deployment on mobile or edge devices.

DINOv3 (ViT-L/16) and ConvNeXtV2-Large were compared under identical experimental conditions (cropped input, 2 × 3 grid, wide = 1024 px, tile = 896 px). DINOv3 achieved (R² = 0.731), while ConvNeXtV2 obtained (R² = 0.614), resulting in a performance gap of (

∆

R² = 0.117). In terms of MAE, DINOv3 improved the error from 0.951 to 0.794 c/ha.

These results indicate that Vision Transformer representations obtained via self-supervised pretraining outperform ConvNeXt-type CNN features in pasture biomass prediction. The global attention mechanism inherent to the ViT-based DINOv3 model more effectively captures heterogeneous textural patterns of vegetation cover, whereas the locally sensitive convolutional operations of ConvNeXtV2 are less capable of fully modeling contextual relationships in pasture imagery.

The results across the three input strategies are as follows: original (R² = 0.732), masked (R² = 0.733), and cropped (R² = 0.731). All three configurations were implemented using the same DINOv3 backbone, identical training hyperparameters, and high input resolution (wide = 1024 px).

The difference between masking (R² = 0.733) and the original full image (R² = 0.732) is negligible (

∆

R²

\approx

0.001) and not statistically significant. This is an important finding: complete removal of the background does not lead to a substantial improvement in biomass prediction. It is likely that background elements—such as the sky, soil, and terrain—exhibit some degree of correlation with biomass values (for example, the relationship between terrain slope and pasture density), and thus their complete removal may result in a slight loss of informative context.

In addition, attention maps were analyzed for both masked and original unmasked input configurations (Figure 7 and Figure 8). The masked configuration shows the model’s focus after background suppression, whereas the original unmasked configuration retains the full scene context, including vegetation, soil, sky, terrain, and distant background elements. In the masked configuration, attention was concentrated mainly on vegetation and ground-surface regions (Figure 7). Importantly, the attention map generated from the original unmasked image showed a similar focus on vegetation-dominated areas, while sky and distant background regions received lower attention (Figure 8). This provides a stronger qualitative indication that the model can identify biomass-relevant regions even without segmentation-based background removal. Nevertheless, attention maps are interpreted as qualitative diagnostic tools rather than as formal proof of background-invariant behavior.

The cropped input strategy also achieved comparable performance (R² = 0.731). However, this configuration exhibited slightly lower training stability: occasional performance drops were observed across certain epochs, and the 2 × 3 grid provides fewer tiles compared to the 3 × 3 configuration.

Overall, the results across all three strategies were closely aligned, indicating that the model demonstrates robust performance regardless of the chosen input preprocessing approach. Consequently, it is feasible to obtain highly accurate biomass predictions even when using raw, unprocessed images as input.

High resolution (wide = 1024 px, tile = 896 px) was compared with lower resolution (wide = 896 px, tile = 640 px). Under the conditions of the original input and a 3 × 3 grid configuration, the higher resolution yielded an R² value of 0.732, whereas the lower resolution produced an R² of 0.671. This difference is of substantial practical significance.

In pasture imagery, local textural features—such as individual plant stems, the spacing between leaves, and surface-level ground patterns—may be directly associated with biomass values. At lower resolutions, these features are degraded during interpolation, thereby reducing the informativeness of the resulting feature vectors. Consequently, it was concluded that the use of high-resolution input is critically important for the task of pasture biomass prediction.

Two auxiliary strategies were evaluated: (1) a combination of regression and interval classification (reg + bin cls, with a log1p-transformed target), and (2) DINO-style self-supervised pretraining (SSL).

The reg + bin cls configuration incorporated log1p normalization and an additional 8-bin classification head (α_cls = 0.05). This setup achieved an R² of 0.623 and an MAE of 0.868 c/ha, which is substantially lower than the supervised DINOv3 baseline (R² = 0.732). During training, although the regression loss improved rapidly, the classification component performed poorly (R² exhibited only marginal improvement before plateauing). The discretization into eight intervals proved suboptimal for performance enhancement, while the log1p transformation compressed the dynamic range of the target variable, thereby diminishing the contribution of high biomass values to the loss function.

In the SSL pretraining scenario, the DINOv3 backbone was pretrained using a DINO-style approach (multi-crop augmentation with a teacher–student exponential moving average framework), followed by regression fine-tuning. The resulting performance (R² ≈ 0.598) was considerably inferior to that of the ImageNet-pretrained supervised DINOv3 model (R² = 0.732). This outcome may be attributed to the relatively limited size of the SSL pretraining dataset (approximately 2000 images) and the high computational demands of the training process. In domain-specific tasks involving ground-level imagery, surpassing ImageNet pretraining typically requires substantially larger SSL datasets. The overall results of these auxiliary strategy configurations are presented in Table 4.

Neither of the evaluated auxiliary strategies outperformed the supervised DINOv3 baseline pretrained on ImageNet. This result indicates that fine-tuning a high-quality pretrained backbone on the available dataset remains an effective approach.

To determine whether the proposed deep learning framework provides a meaningful improvement over conventional RGB feature-based biomass estimation methods, additional interpretable baseline models were evaluated. Random Forest, XGBoost, and CatBoost regressors were trained using the same handcrafted RGB-derived feature set, including color statistics, vegetation indices, GLCM texture descriptors, and coarse 3 × 3 spatial features. These models were evaluated on the same geographically held-out validation subset used for the deep learning models. The results are presented in Table 5.

The classical RGB feature-based baselines achieved substantially lower predictive performance than the proposed DINOv3 masked wide + tiles configuration. Among the handcrafted-feature baselines, CatBoost performed best, with MAE = 1.461 c/ha, RMSE = 2.223 c/ha, and R² = 0.085. However, this performance remained considerably below the DINOv3 model, which achieved MAE = 0.779 c/ha, RMSE = 1.183 c/ha, and R² = 0.733 on the same validation subset. These results indicate that RGB statistics, vegetation color indices, texture descriptors, and coarse spatial summaries contain some biomass-related information, but they are insufficient to capture the complex spatial and structural patterns present in heterogeneous pasture images. The superior performance of the DINOv3 framework suggests that pretrained visual representations and the wide + tiles design provide a clear advantage over conventional handcrafted RGB feature-based approaches.

To visually assess model prediction performance, three diagnostic plots were employed: a scatter plot of predicted versus observed values (Figure 9), a histogram of residuals (Figure 10), and a residuals-versus-predicted-values scatter plot (Figure 11). All models are ordered by increasing R² from (a) to (z).

The scatter plots (Figure 9) clearly illustrate the general patterns of model performance: as R² increases, the points cluster more tightly around the ideal agreement line

y

=

\hat{y}

, and the slope of the linear approximation approaches unity. Models with lower R² values (e.g., DINOv3 (SSL) and ConvNeXtV2) exhibit weak correlation between predictions and observations and higher dispersion. The best-performing model, DINOv3 masked 3 × 3 (R² = 0.733, MAE = 0.779 c/ha, RMSE = 1.183 c/ha), shows the densest clustering around the ideal line and the lowest dispersion among all configurations.

Furthermore, a systematic pattern is observed across all models: the slope of the regression line is consistently less than one. This phenomenon is attributed to the inherent imbalance in field data—high biomass values (

y

> 10 c/ha) are underrepresented in the dataset. Consequently, models tend to regress toward the more frequent intermediate values. This effect leads to systematic overestimation in the lower-value range (

y

< 1 c/ha) and underestimation in the higher-value range (

y

> 10 c/ha).

The residuals distribution histogram (Figure 10) shows an approximately symmetric distribution around zero for all models, confirming the absence of pronounced systematic bias. The best performance is observed in the DINOv3 masked 3 × 3 model, with a mean residual of 0.162 c/ha, a standard deviation of 1.172 c/ha, and a skewness coefficient of 0.723. The slight positive skew present across all models corresponds to the underestimation effect observed at higher biomass values.

The residuals-versus-predicted-values scatter plot (Figure 11) shows that absolute residuals are distributed without a strong visual dependence on predicted values. However, this plot should be interpreted cautiously, because the same absolute error may have different practical implications at different biomass levels. In particular, an error of approximately 0.8 c/ha is relatively more important in low-biomass areas than in high-biomass areas. Therefore, scale-independent and relative error measures were additionally considered to complement MAE, RMSE, and R².

To provide a scale-independent assessment of prediction accuracy, the Normalized Root Mean Squared Error (NRMSE) was calculated as NRMSE = RMSE/(y_max − y_min). For the best-performing configuration (DINOv3 masked 3 × 3), RMSE = 1.183 c/ha, and the observed biomass range was 0.0–13.75 c/ha. Therefore, NRMSE = 1.183/(13.75 − 0.0) = 0.086, indicating that the prediction error corresponds to approximately 8.6% of the full biomass range. This metric provides a scale-independent characterization of model accuracy across the heterogeneous biomass range of the dataset.

In addition, the relative mean absolute error (rMAE) was calculated by normalizing MAE by the mean observed biomass. For the best-performing configuration, rMAE = MAE/mean(y). Given MAE = 0.779 c/ha and a mean observed biomass of 3.63 c/ha, rMAE was approximately 21.5%. This indicates that the average absolute prediction error corresponded to about one-fifth of the mean biomass level, providing a more practical interpretation of model accuracy than absolute error alone. Overall, the error analysis indicates that the DINOv3 masked 3 × 3 configuration predicts biomass values in the 2–8 c/ha range with high accuracy. However, to improve performance at the extremes, it is recommended to incorporate additional data containing high and low biomass values or to employ loss functions that assign greater weight to rare observations.

3.2. Error Analysis Across Biomass Ranges

To further evaluate the practical reliability of the model, prediction errors were analyzed across three biomass ranges: low biomass (0–2 c/ha), medium biomass (>2–6 c/ha), and high biomass (>6 c/ha). This analysis was performed on the geographically held-out validation subset using the best-performing DINOv3 configuration. For each biomass range, the number of validation samples, mean observed biomass, MAE, RMSE, and relative MAE were calculated. Relative MAE was computed as MAE divided by the mean observed biomass within the corresponding biomass group. The results are presented in Table 6.

The biomass-range analysis shows that the model performed differently across biomass levels. The low-biomass group showed the lowest absolute error, with MAE = 0.281 c/ha and RMSE = 0.422 c/ha. The medium-biomass group represented the largest portion of the validation subset and showed higher error values, with MAE = 0.786 c/ha and RMSE = 1.109 c/ha. The high-biomass group had the highest absolute error in terms of MAE = 1.672 c/ha and RMSE = 2.059 c/ha, although its relative MAE was 21.4%, lower than the medium-biomass group. These results indicate that the model is most accurate in low-biomass conditions, while prediction uncertainty increases in medium- and high-biomass ranges. The higher errors at larger biomass values may be associated with greater structural complexity of vegetation and fewer high-biomass samples in the validation subset. Therefore, the proposed model is useful for rapid pasture monitoring and relative comparison across sites, but predictions in medium- and high-biomass conditions should be interpreted with additional caution.

4. Discussion

This study proposed a novel approach for predicting pasture biomass from ground-level RGB imagery acquired with smartphones: a DINOv3 (ViT-L/16)-based wide + tiles architecture, integrated with attention pooling and a two-stage fine-tuning protocol. The proposed method demonstrated high accuracy under heterogeneous field conditions (R² = 0.733, MAE = 0.779 c/ha, RMSE = 1.183 c/ha), indicating that the model is robust to complex textural variations in real-world scenarios.

The comparison with Random Forest, XGBoost, and CatBoost baselines further supports the advantage of the proposed deep learning framework. Although handcrafted RGB statistics, vegetation indices, GLCM texture descriptors, and coarse spatial features provided interpretable reference models, their predictive performance was substantially lower than that of the DINOv3 masked wide + tiles configuration. The best handcrafted-feature baseline was CatBoost, which achieved MAE = 1.461 c/ha, RMSE = 2.223 c/ha, and R² = 0.085, whereas the DINOv3 masked wide + tiles model achieved MAE = 0.779 c/ha, RMSE = 1.183 c/ha, and R² = 0.733 on the same geographically held-out validation subset. This suggests that pretrained visual representations capture biomass-relevant spatial and structural patterns more effectively than conventional handcrafted RGB feature descriptors under heterogeneous field conditions.

These results suggest that the main advantage of the proposed approach is not only its predictive performance, but also its practical simplicity. The model achieved competitive accuracy using only smartphone RGB images, whereas many previously reported high-performing approaches require UAV acquisition, canopy height models, multispectral data, or controlled imaging conditions. The relatively small differences among preprocessing strategies suggest that the model was not strongly dependent on a single background-removal pipeline, which is important for real-world field deployment. The wide + tiles strategy provides additional information by combining large-scale and local features. However, the model systematically underestimates high biomass values (y > 10 c/ha), which can be attributed to the imbalance present in the dataset.

To place the proposed approach in context, its performance was compared with previous studies on pasture biomass estimation. UAV-based methods remain one of the dominant approaches in this domain because they can exploit high-resolution structural information. In study [6], RGB images acquired via UAVs were processed using a photogrammetric Structure-from-Motion (SfM) workflow to construct canopy height models (CHMs), achieving biomass prediction with R² values in the range of 0.59–0.81. However, this approach requires ground control points (GCPs) for georeferencing, specialized flight planning, and dedicated processing software. Similarly, Ref. [7] combined canopy height metrics with vegetation indices, yielding R² = 0.57–0.73, but prediction accuracy remained sensitive to illumination conditions and the precision of the digital terrain model (DTM).

Studies applying modern deep learning methods to UAV data have reported very high accuracy. For example, Ref. [29] reported R² values close to 0.94; however, these results were obtained under limited and highly controlled conditions. Likewise, Ref. [1] achieved R² = 0.93–0.94, yet its reliance on accurate DTMs restricts applicability in extensive natural pastures. Therefore, these studies provide an important performance benchmark, but their experimental conditions differ substantially from the low-cost ground-level smartphone scenario addressed in the present study.

In comparison, the proposed ground-level smartphone-based RGB approach (R² = 0.733, MAE = 0.779 c/ha) delivered competitive results without requiring specialized flight equipment, canopy height modeling, or controlled experimental setups. Moreover, it effectively captured the wide-ranging variability of natural pasture conditions across 40 districts in Southern Kazakhstan.

Satellite-based and multi-source remote sensing studies provide additional scientific context. In [12], UAV-acquired RGB images were used to compute vegetation indices, and a Random Forest model achieved R² = 0.73; however, UAV deployment is still mandatory for this approach. A satellite-based study [18] integrated Sentinel-1, Sentinel-2, and climatic data to achieve R² = 0.75, but the need to combine multiple data sources and the associated computational complexity may limit rapid field-level applicability. Similarly, Ref. [17] achieved R² = 0.76 using multispectral UAV data, yet the high cost of specialized sensors remains a major constraint. In contrast, smartphone RGB imagery provides inexpensive, high-resolution ground observations and can complement satellite or UAV-based monitoring where rapid local assessment is needed. Deep learning approaches, such as [22], achieved R² = 0.81 with a CNN–SE–Fourier architecture, but the study was restricted to data from a single region. Studies using smartphone imagery, e.g., [25], reported extremely high accuracy (R² = 0.98); however, these results were obtained under artificial lighting conditions (700–720 lux) and for a single crop type. Recent ground-based pasture studies further show that performance under natural field conditions is more challenging than under controlled imaging. Woodrow et al. [27] used mobile-phone quadrat photographs from diverse grazing lands and reported that individual quadrat-level errors remained large, although site-level estimates were more promising. Similarly, Mandal [28] showed strong performance of DINOv3-based models on the CSIRO Pasture Biomass benchmark, achieving weighted R² = 0.903, but the benchmark was based on standardized dual-view quadrat imagery rather than wide-field smartphone scenes. These studies support the relevance of ground-based RGB biomass estimation while also confirming the need for validation under uncontrolled natural pasture conditions.

Taken together, these comparisons indicate that the proposed method provides a practical balance between accuracy, cost, and field deployability. Although some UAV-based, multispectral, or controlled-environment studies report higher accuracy, the proposed approach achieves useful performance using only smartphone RGB images collected under heterogeneous natural pasture conditions. Several limitations should be considered when interpreting the results. The dataset is imbalanced, with high biomass values (>10 c/ha) being sparsely represented, leading the model to systematically underestimate in this range. From a practical perspective, this is important because underestimation in high-biomass areas may affect forage planning and grazing-load decisions. Additionally, the data were collected exclusively from southern regions of Kazakhstan, so the model’s applicability to other climatic zones has not been evaluated. Moreover, the proposed approach relies solely on RGB visual information, without multispectral or hyperspectral channels, which limits the potential use of certain spectral indices.

Although the overall performance metrics are encouraging, the scatter plots indicate that individual predictions may still vary around the observed values. This is expected for field-based biomass estimation under heterogeneous natural conditions, where vegetation structure, illumination, soil background, and species composition vary substantially between samples. Therefore, the proposed model should be interpreted primarily as a tool for rapid, low-cost, and non-destructive monitoring rather than as a replacement for destructive sampling in highly precise plot-level measurements. For operational grazing-management decisions, predictions should preferably be aggregated across multiple images within a pasture unit, which can reduce the influence of individual prediction errors and provide more stable estimates for decision support.

The practical relevance of the proposed method can be considered on multiple levels. A smartphone RGB-based model allows any field worker to conduct independent monitoring without specialized equipment or flight permissions, which is particularly valuable for remote pastures. The achieved accuracy (MAE ≈ 0.78 c/ha) appears suitable for rapid agronomic monitoring and decision-support tasks, providing consistent estimates of pasture biomass while significantly accelerating data collection compared to traditional destructive methods. Timely and accurate biomass prediction supports agronomic decision-making, including pasture rotation, identification of degraded areas, and preemptive forage planning for livestock. Furthermore, the minimal impact of preprocessing strategies on prediction performance suggests that the model may remain useful under variable lighting, terrain, and sky-coverage conditions in smartphone imagery.

Based on the study’s results and limitations, several promising directions for future work can be proposed. Expanding the dataset, particularly in high-biomass regions, would help reduce systematic prediction errors, while increasing temporal and geographic coverage would enable the model to better capture seasonal and regional variability. External validation across additional regions of Kazakhstan and other arid or semi-arid rangelands is also needed to assess transferability. In addition to RGB imagery, integrating meteorological data, soil moisture measurements, and satellite observations could further improve prediction accuracy.

Domain-adaptive self-supervised learning (SSL), especially using large image collections from diverse geographic regions, could enhance pretraining and improve model robustness. Adapting the model for real-time mobile applications is also important; model compression techniques such as distillation or quantization may be applied to reduce computational requirements. Finally, uncertainty estimation and attention-map visualization should be further developed to support agronomic decision-making and provide more transparent explanations of model predictions.

5. Conclusions

This study proposes a novel approach for estimating aboveground biomass in the natural pastures of Kazakhstan based on RGB images captured via smartphones. The proposed method integrates the DINOv3-based wide + tiles architecture, an attention pooling mechanism, and a two-stage fine-tuning strategy, allowing the model to effectively utilize both global and local features simultaneously. Experimental results demonstrated that the model performs consistently and with high accuracy under heterogeneous field conditions (R² ≈ 0.73). Furthermore, the impact of preprocessing strategies was found to be minimal, highlighting the model’s high invariance to data variations, while high-resolution images were proven to be a crucial factor for the accuracy of biomass prediction.

The main scientific novelty of the study encompasses several aspects: (i) the proposal of a new architectural approach for biomass estimation in natural pastures based on smartphone RGB images; (ii) the integration of multi-scale features through the wide + tiles strategy; (iii) a comprehensive analysis of the impact of preprocessing strategies and demonstration of their invariance; and (iv) the evaluation of the model based on field data collected from different regions of Kazakhstan. The proposed approach, which does not require specialized equipment, enables rapid and accessible pasture monitoring based on smartphone imagery, making it directly applicable in real-field conditions. This method could be a practical tool in supporting agronomic decisions, such as evaluating pasture capacity, planning forage supply, and identifying degraded areas. In the future, expanding the dataset to cover all regions of Kazakhstan, considering the phenological stages of plant vegetation, exploring the potential of domain adaptation methods for transferring the model to other arid and semi-arid regions of Central Asia, integrating multispectral and satellite data, and adapting the model for mobile devices are promising directions for further research.

Author Contributions

Conceptualization, R.A., J.S. and M.N.; methodology, R.A., A.A., J.S. and M.N.; software, D.Z., A.A. and M.N.; validation, R.A., J.S., M.N. and G.K.; formal analysis, R.A., A.A. and J.S.; investigation, R.A., R.B., D.S. and B.S.; resources, G.K., A.B. and R.B.; data curation, R.A., A.A., D.Z. and M.N.; writing—original draft preparation, R.A., A.A. and J.S.; writing—review and editing, J.S., M.N., G.K. and A.B.; visualization, A.A., D.Z. and M.N.; supervision, J.S. and M.N.; project administration, J.S. and M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the Ministry of Agriculture of the Republic of Kazakhstan (Grant No. BR22883585 “Development of effective technologies to increase productive potential and rational use of pastures”).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors express their sincere gratitude to the administrators and rural farmers of the five regions of Kazakhstan included in this study—Kyzylorda, Turkestan, Zhambyl, Almaty, and Zhetysu regions, including their associated districts—for granting access to their lands and properties for field data collection.

Conflicts of Interest

Authors Gulnara Kabzhanova and Anuarbek Bissembayev were employed by LLP “Skyterra” and LLP “Scientific and Production Centre for Animal Husbandry and Veterinary”, respectively. The authors declare that they have no conflicts of interest related to this manuscript. The companies had no role in the study design; collection, analysis, or interpretation of data; writing of the manuscript; or the decision to submit the manuscript for publication. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Theau, J.; Lauzier-Hudon, É.; Aubé, L.; Devillers, N. Estimation of forage biomass and vegetation cover in grasslands using UAV imagery. PLoS ONE 2021, 16, e0245784. [Google Scholar] [CrossRef] [PubMed]
da Silva, R.C.E.; Tommaselli, A.M.G.; Imai, N.N.; Martins-Neto, R.P.; da Silva da Silveira, D.; Moro, E. Dry mass grassland estimation using UAV ultra-wide RGB images. ISPRS Ann. 2024, X-3, 69–75. [Google Scholar] [CrossRef]
Spaeth, K.E.; Weltz, M.A.; Nesbit, J.; Qi, J.; Rutherford, W.A.; Williams, C.J.; Toledo, D.; Newingham, B.A.; Iskakova, G.; Kussainova, M.; et al. Rangeland resource assessment in the Aqmola region of Kazakhstan. Rangel. Ecol. Manag. 2025, 98, 389–398. [Google Scholar] [CrossRef]
Zhao, B.; Hiller, J.; Awada, T.; Wardlow, B.; Erickson, G.; Shi, Y. Forage biomass estimation using UAV-based remote sensing and machine learning. Ecol. Inform. 2025, 90, 103361. [Google Scholar] [CrossRef]
Zhang, H.; Sun, Y.; Chang, L.; Qin, Y.; Chen, J.; Qin, Y.; Du, J.; Yi, S.; Wang, Y. Estimation of grassland canopy height and aboveground biomass at the quadrat scale using UAV. Remote Sens. 2018, 10, 851. [Google Scholar] [CrossRef]
Grüner, E.; Astor, T.; Wachendorf, M. Biomass prediction of heterogeneous temperate grasslands using an SfM approach based on UAV imaging. Agronomy 2019, 9, 54. [Google Scholar] [CrossRef]
Lussem, U.; Bolten, A.; Menne, J.; Gnyp, M.L.; Schellberg, J.; Bareth, G. Estimating biomass in temperate grassland with high resolution canopy surface models from UAV-based RGB images and vegetation indices. J. Appl. Remote Sens. 2019, 13, 034525. [Google Scholar] [CrossRef]
Bendig, J.; Bolten, A.; Bennertz, S.; Broscheit, J.; Eichfuss, S.; Bareth, G. Estimating Biomass of Barley Using Crop Surface Models (CSMs) Derived from UAV-Based RGB Imaging. Remote Sens. 2014, 6, 10395–10412. [Google Scholar] [CrossRef]
Viljanen, N.; Honkavaara, E.; Näsi, R.; Hakala, T.; Niemeläinen, O.; Kaivosoja, J. A novel machine learning method for estimating biomass of grass swards using a photogrammetric canopy height model. Agriculture 2018, 8, 70. [Google Scholar] [CrossRef]
Acorsi, M.G.; das Dores Abati Miranda, F.; Martello, M.; Smaniotto, D.A.; Sartor, L.R. Estimating biomass of black oat using UAV-based RGB imaging. Agronomy 2019, 9, 344. [Google Scholar] [CrossRef]
Guan, Q.; Jiang, M.; Du, W.; Chen, X.; Yan, B. Integrating UAV Visible and Multispectral Imagery to Assess Grazing-Induced Vegetation Responses in Sandy Grasslands. Front. Plant Sci. 2025, 16, 1730583. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Tang, Z.; Wang, B.; Meng, B.; Qin, Y.; Sun, Y.; Lv, Y.; Zhang, J.; Yi, S. A non-destructive method for rapid acquisition of grassland aboveground biomass for satellite ground verification using UAV RGB images. Glob. Ecol. Conserv. 2022, 33, e01999. [Google Scholar] [CrossRef]
Schreiber, L.V.; Amorim, J.G.A.; Guimarães, L.; Matos, D.M.; da Costa, C.M.; Parraga, A. Aboveground biomass wheat estimation: Deep learning with UAV-based RGB images. Appl. Artif. Intell. 2022, 36, 2055392. [Google Scholar] [CrossRef]
Castro, W.; Junior, J.M.; Polidoro, C.; Osco, L.P.; Gonçalves, W.; Rodrigues, L.; Santos, M.; Jank, L.; Barrios, S.; Valle, C.; et al. Deep learning applied to phenotyping of biomass in forages with UAV-based RGB imagery. Sensors 2020, 20, 4802. [Google Scholar] [CrossRef] [PubMed]
Karila, K.; Oliveira, R.A.; Ek, J.; Kaivosoja, J.; Koivumäki, N.; Korhonen, P.; Niemeläinen, O.; Nyholm, L.; Näsi, R.; Pölönen, I.; et al. Estimating grass sward quality and quantity parameters using drone remote sensing with deep neural networks. Remote Sens. 2022, 14, 2692. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Alvarez-Mendoza, C.I.; Guzman, D.; Casas, J.; Bastidas, M.; Polanco, J.; Valencia-Ortiz, M.; Montenegro, F.; Arango, J.; Ishitani, M.; Selvaraj, M.G. Predictive modeling of above-ground biomass in Brachiaria pastures from satellite and UAV imagery. Remote Sens. 2022, 14, 5870. [Google Scholar] [CrossRef]
Liu, W.; Xu, C.; Zhang, Z.; De Boeck, H.; Wang, Y.; Zhang, L.; Xu, X.; Zhang, C.; Chen, G.; Xu, C. Machine learning-based grassland aboveground biomass estimation and its response to climate variation in Southwest China. Front. Ecol. Evol. 2023, 11, 1146850. [Google Scholar] [CrossRef] [PubMed]
Dong, W.; Mitchard, E.T.A.; Yu, H.; Hancock, S.; Ryan, C.M. Forest aboveground biomass estimation using GEDI and earth observation data through attention-based deep learning. arXiv 2023, arXiv:2311.03067. [Google Scholar]
Tian, X.; Li, J.; Zhang, F.; Zhang, H.; Jiang, M. Forest aboveground biomass estimation using multisource remote sensing data and deep learning algorithms. Remote Sens. 2024, 16, 1074. [Google Scholar] [CrossRef]
Fan, X.; He, G.; Zhang, W.; Long, T.; Zhang, X.; Wang, G.; Sun, G.; Zhou, H.; Shang, Z.; Song, X. Sentinel-2 Images Based Modeling of Grassland Above-Ground Biomass Using Random Forest Algorithm: A Case Study on the Tibetan Plateau. Remote Sens. 2022, 14, 5321. [Google Scholar] [CrossRef]
Kalmani, V.H.; Gadekar, P.R.; Adamuthe, A.C. Deep learning-based RGB image modelling for multi-component pasture biomass estimation in precision grazing. Agric. Sci. Dig. 2026; in press.
Nakajima, K.; Saito, K.; Tsujimoto, Y.; Takai, T.; Mochizuki, A.; Yamaguchi, T.; Ibrahim, A.; Mairoua, S.G.; Andrianary, B.H.; Katsura, K.; et al. Robustness of RGB image-based estimation for rice above-ground biomass using multi-location datasets. Smart Agric. Technol. 2025, 11, 100998. [Google Scholar] [CrossRef]
Kan, X.; Lu, Z.; Zhang, Y.; Zhu, L.; Lim Kam Sian, K.T.C.; Wang, J.; Liu, X.; Zhou, Z.; Cao, H. DSRSS-Net: Improved-Resolution Snow Cover Mapping from FY-4A Satellite Images Using the Dual-Branch Super-Resolution Semantic Segmentation Network. Remote Sens. 2023, 15, 4431. [Google Scholar] [CrossRef]
Dhawi, F.; Ghafoor, A.; Almousa, N.; Ali, S.; Alqanbar, S. Predictive modelling employing machine learning, CNNs, and smartphone RGB images for non-destructive biomass estimation of pearl millet (Pennisetum glaucum). Front. Plant Sci. 2025, 16, 1594728. [Google Scholar] [CrossRef] [PubMed]
Liao, Q.; Wang, D.; Haling, R.; Liu, J.; Li, X.; Plomecka, M.; Robson, A.; Pringle, M.; Pirie, R.; Walker, M.; et al. Estimating pasture biomass from top-view images: A dataset for precision agriculture. arXiv 2025, arXiv:2510.22916. [Google Scholar]
Woodrow, L.; Carter, J.; Fraser, G.; Barnetson, J. Using Continuous Output Neural Nets to Estimate Pasture Biomass from Digital Photographs in Grazing Lands. AgriEngineering 2023, 5, 1051–1067. [Google Scholar] [CrossRef]
Mandal, M. Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression. arXiv 2026, arXiv:2603.07819. [Google Scholar]
Rueda-Ayala, V.P.; Peña, J.M.; Höglind, M.; Bengochea-Guevara, J.M.; Andújar, D. Comparing UAV-based technologies and RGB-D reconstruction methods for plant height and biomass monitoring on grass ley. Sensors 2019, 19, 535. [Google Scholar] [CrossRef] [PubMed]
Vahidi, M.; Shafian, S.; Thomas, S.; Maguire, R. Pasture biomass estimation using ultra-high-resolution RGB UAV images and deep learning. Remote Sens. 2023, 15, 5714. [Google Scholar] [CrossRef]
Grüner, E.; Wachendorf, M.; Astor, T. The potential of UAV-borne spectral and textural information for predicting aboveground biomass and nitrogen fixation in legume-grass mixtures. PLoS ONE 2020, 15, e0234703. [Google Scholar] [CrossRef] [PubMed]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Arystanov, A.; Karabkina, N.; Sagin, J.; Nurguzhin, M.; King, R.; Bekseitova, R. Use of Indices Applied to Remote Sensing for Establishing Winter–Spring Cropping Areas in the Republic of Kazakhstan. Sustainability 2024, 16, 7548. [Google Scholar] [CrossRef]
Kabzhanova, G.; Arystanova, R.; Bissembayev, A.; Arystanov, A.; Sagin, J.; Nasiyev, B.; Kurmasheva, A. Remote Sensing Applications for Pasture Assessment in Kazakhstan. Agronomy 2025, 15, 526. [Google Scholar] [CrossRef]
Arystanov, A.; Sagin, J.; Karabkina, N.; Arystanova, R.; Yermekov, F.; Kabzhanova, G.; Bekseitova, R.; Aktymbayeva, A.; Kutymova, N. Automatic Classification of Agricultural Crops Using Sentinel-2 Data in the Rainfed Zone of Southern Kazakhstan. Agronomy 2025, 15, 2040. [Google Scholar] [CrossRef]
Arystanov, A.; Sagin, J.; Kabzhanova, G.; Sarsekova, D.; Bekseitova, R.; Molzhigitova, D.; Balkozha, M.; Yeleuova, E.; Satvaldiyev, B. Winter Cereal Re-Sowing and Land-Use Sustainability in the Foothill Zones of Southern Kazakhstan Based on Sentinel-2 Data. Sustainability 2026, 18, 1053. [Google Scholar] [CrossRef]
Sarsekova, D.; Sagin, J.; Perzadayeva, A.; Arystanova, R.; Arystanov, A.; Kezheneva, A.; Jumassultanova, S.; Satybaldiyeva, G.; Ospangaliyev, A. Farmers’ Land Sustainability Improvement with Soil, Geology, and Water Retention Assessment in North Kazakhstan. Sustainability 2026, 18, 1316. [Google Scholar] [CrossRef]
Arystanov, A.; Arystanova, R.; Boribay, E.; Sagin, J.; Karabkina, N.; Sarsekova, D.; Perzadayeva, A.; Munaitpassova, A.; Yelikbayeva, S.; Tleubekuly, E. Interannual Dynamics of Fallow Land Extent in North Kazakhstan Based on Sentinel-2 Data for the Recent Period (2021–2025). Agronomy 2026, 16, 1008. [Google Scholar] [CrossRef]
Mukanov, Y.; Arystanova, R.; Sagin, J.; Samarkhanov, K.; Usmanov, T.; Baisholanov, S.; Arystanov, A.; Koshim, A.; Duisebek, B.; Zhukenova, A. A Statistical Analysis of Multi-Decadal Trends in Temperature, Precipitation and Drought Indices in Eastern and Southeastern Kazakhstan Between 1981 and 2023. Agronomy 2026, 16, 1097. [Google Scholar] [CrossRef]
Kyrgyzbay, K.; Usmanov, T.; Sagin, J.; Duisebek, B.; Arystanova, R.; Kulbekova, S.; Utepov, A.; Amanzholova, R. Spatial Assessment of Flood Susceptibility in the Abai Region, Kazakhstan. Water 2026, 18, 817. [Google Scholar] [CrossRef]
Varthani, S.; Donaghy, D.J.; Kenyon, P.R.; Sneddon, N.W.; Cartmill, A.D. Measuring Herbage Mass: A Review. Agronomy 2025, 15, 2264. [Google Scholar] [CrossRef]
Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]

Figure 1. Relief and hydrographic network of the southern regions of Kazakhstan.

Figure 2. Conducting a geobotanical survey in Almaty Region, May 2024.

Figure 3. Examples of terrestrial RGB images taken in the field.

Figure 4. An example of the main stages of image preprocessing: (a) original image, (b) masked image, and (c) cropped image.

Figure 5. The proposed wide + tiles architecture scheme: DINOv3 backbone, attention pooling, feature fusion, and regression layer.

Figure 6. Examples of proposed data augmentation methods include: (a) RandomResizedCrop, (b) HorizontalFlip, (c) VerticalFlip, (d) Rotate, (e) BrightnessContrast, (f) HueSaturationValue, (g) GaussNoise, (h) CLAHE, (i) Sharpen, (j) EnhanceGreenLAB, (k) GreenChannelEmphasis and (l) ChannelShuffle.

Figure 7. Attention maps based on the Masked configuration. Warmer colors (yellow–red) indicate higher attention weights, whereas cooler colors (blue–purple) indicate lower attention weights.

Figure 8. Attention map generated from the original unmasked RGB image, where the full scene context, including background regions, was retained. Warmer colors (yellow–red) indicate higher attention weights, whereas cooler colors (blue–purple) indicate lower attention weights.

Figure 9. Scatter plot of predicted values and actual values. The red line is the ideal fit line

y

=

\hat{y}

.

Figure 9. Scatter plot of predicted values and actual values. The red line is the ideal fit line

y

=

\hat{y}

.

Figure 10. Histogram of the distribution of residuals (

y

−

\hat{y}

).

Figure 10. Histogram of the distribution of residuals (

y

−

\hat{y}

).

Figure 11. Scatter of residuals relative to predicted values.

Table 1. Augmentations Used During Training.

Augmentation	Description	Parameters	p
Geometric Transformations
RandomScale + PadBlack	Randomly scales the image and fills empty areas with black color	scale ∈ [0.90, 1.00]	0.20
RandomResizedCrop	Randomly crops a region of the image and resizes it	scale ∈ [0.85, 1.00]; ratio ∈ [0.90, 1.10]	1.00
HorizontalFlip	Horizontal flipping	–	0.50
VerticalFlip	Vertical flipping	–	0.50
RandomRotate90	Rotation by multiples of 90°	–	0.50
Rotate	Small-angle arbitrary rotation	limit ∈ [−3°, +3°]	0.50
Photometric Transformations
RandomBrightnessContrast	Adjusts brightness and contrast	Δbrightness, Δcontrast ≤ 0.15	0.60
HueSaturationValue	Adjusts hue, saturation, and value	hue ≤ 8; sat ≤ 18; val ≤ 12	0.35
CLAHE	Enhances local contrast	clip_limit ∈ [1, 3]; tile = 8 × 8	0.15
Sharpen	Sharpens the image	–	0.10
GaussNoise	Adds Gaussian noise	–	0.15
Plant-Specific Spectral Transformations
EnhanceGreenLAB	Enhances the green channel in LAB color space	a-scale = 0.70	0.20
GreenChannelEmphasis	Computes green index (g/(r + b))	–	0.08
ChannelShuffle	Randomly permutes RGB channels	–	0.10
Regularization Transformations
RandomErasing	Randomly removes a region (filled with noise)	erase_area ∈ [1%, 12%]	0.25

Table 2. Administrative-district-level training and validation split.

Subset	Administrative Districts	Number of Districts
Training	Sarkand; Eskeldi; Korday; Karatal; Alakol; Merki; Koksu; Ryskulov; Aksu; Shu; Talas; Panfilov; Zhambyl; Kazygurt; Keles; Aral; Maktaaral; Sarysu; Zhanakorgan; Tolebi; Shardara; Ordabasy; Tulkibas; Zhetysai; Baidibek; Karmakshi; Kazaly; Syrdarya; Shieli; Sozak; Zhalagash; Turkestan	32
Validation	Zhualy; Baizak; Kerbulak; Moiynkum; Saryagash; Sauran; Sairam; Otyrar	8

Table 3. Comparative results of configurations.

Model	Input	Grid	Window/Tile, px	MAE	RMSE	R²
DINOv3	original	3 × 3	1024/896	0.774	1.186	0.732
DINOv3	masked	3 × 3	1024/896	0.7 79	1.183	0.733
DINOv3	cropped	2 × 3	1024/896	0.794	1.190	0.731
DINOv3	cropped + strong aug	2 × 3	1024/992	0.826	1.245	0.705
DINOv3	Original-tiles	3 × 3	896	0.794	1.268	0.696
DINOv3	Original-wide		1024	0.978	1.305	0.676
DINOv3	original (small)	3 × 3	896/640	0.904	1.384	0.671
ConvNeXtV2	cropped	2 × 3	1024/896	0.951	1.425	0.614

Table 4. Comparative results of alternative strategies.

Model	Input	Grid	Window/Tile, px	MAE	RMSE	R²
DINOv3 (SSL)	original	3 × 3	1024/896	0.896	1.454	0.598
DINOv3	cropped + bin cls	2 × 3	1024/896	0.868	1.407	0.623

Table 5. Comparison with interpretable RGB feature-based baseline models on the geographically held-out validation subset.

Model	Input/Features	Train Samples	Validation Samples	MAE, c/ha	RMSE, c/ha	R²	Relative MAE, %
Random Forest	RGB statistics + vegetation indices + GLCM texture features + 3 × 3 spatial features	956	240	1.495	2.227	0.082	40.4
XGBoost	RGB statistics + vegetation indices + GLCM texture features + 3 × 3 spatial features	956	240	1.472	2.232	0.078	39.8
CatBoost	RGB statistics + vegetation indices + GLCM texture features + 3 × 3 spatial features	956	240	1.461	2.223	0.085	39.5

Table 6. Error analysis across biomass ranges on the geographically held-out validation subset.

Biomass Range	Number of Samples	Mean Observed Biomass, c/ha	MAE, c/ha	RMSE, c/ha	Relative MAE, %
Low biomass (0–2 c/ha)	70	1.754	0.281	0.422	16.0
Medium biomass (>2–6 c/ha)	132	3.159	0.786	1.109	24.9
High biomass (>6 c/ha)	38	7.829	1.672	2.059	21.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Arystanova, R.; Zeinulla, D.; Kabzhanova, G.; Bissembayev, A.; Bekseitova, R.; Sarsekova, D.; Saule, B.; Arystanov, A.; Sagin, J.; Nurtay, M. Wide + Tiles Vision Transformer Framework for Smartphone-Based Grassland Biomass Prediction in Heterogeneous Field Conditions. Agriculture 2026, 16, 1401. https://doi.org/10.3390/agriculture16131401

AMA Style

Arystanova R, Zeinulla D, Kabzhanova G, Bissembayev A, Bekseitova R, Sarsekova D, Saule B, Arystanov A, Sagin J, Nurtay M. Wide + Tiles Vision Transformer Framework for Smartphone-Based Grassland Biomass Prediction in Heterogeneous Field Conditions. Agriculture. 2026; 16(13):1401. https://doi.org/10.3390/agriculture16131401

Chicago/Turabian Style

Arystanova, Ranida, Darkhan Zeinulla, Gulnara Kabzhanova, Anuarbek Bissembayev, Roza Bekseitova, Dani Sarsekova, Bakhbayeva Saule, Asset Arystanov, Janay Sagin, and Margulan Nurtay. 2026. "Wide + Tiles Vision Transformer Framework for Smartphone-Based Grassland Biomass Prediction in Heterogeneous Field Conditions" Agriculture 16, no. 13: 1401. https://doi.org/10.3390/agriculture16131401

APA Style

Arystanova, R., Zeinulla, D., Kabzhanova, G., Bissembayev, A., Bekseitova, R., Sarsekova, D., Saule, B., Arystanov, A., Sagin, J., & Nurtay, M. (2026). Wide + Tiles Vision Transformer Framework for Smartphone-Based Grassland Biomass Prediction in Heterogeneous Field Conditions. Agriculture, 16(13), 1401. https://doi.org/10.3390/agriculture16131401

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wide + Tiles Vision Transformer Framework for Smartphone-Based Grassland Biomass Prediction in Heterogeneous Field Conditions

Abstract

1. Introduction

2. Materials and Methods

3. Results

3.1. Ablation Study

3.2. Error Analysis Across Biomass Ranges

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI