Next Article in Journal
Azadirachtin and Its Nanoformulation Reshape the Maize Phyllosphere Microbiome While Maintaining Overall Microbial Diversity
Previous Article in Journal
Exploring the Possible Role of Semiochemicals in Quince (Cydonia oblonga Mill.): Implications for the Biological Behavior of Cydia pomonella
Previous Article in Special Issue
Scalable Satellite-Assisted Adaptive Federated Learning for Robust Precision Farming
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid CNN-Transformer Model for Soil Texture Estimation from Microscopic Images

1
College of Engineering, South China Agricultural University, Guangzhou 510642, China
2
Guangdong Institute of Modern Agricultural Equipment, Guangzhou 510630, China
3
College of Artificial Intelligence and Low-Altitude Technology, South China Agricultural University, Guangzhou 510642, China
4
College of Water Conservancy and Civil Engineering, South China Agricultural University, Guangzhou 510642, China
5
State Key Laboratory of Agricultural Equipment Technology, Guangzhou 510642, China
6
Department of Biosystems Engineering, University of Manitoba, Winnipeg, MB R3T 5V6, Canada
7
Guangdong Engineering Technology Research Center of Rice Transplanting Mechanical Equipment, Guangzhou 510642, China
*
Authors to whom correspondence should be addressed.
Agronomy 2026, 16(3), 333; https://doi.org/10.3390/agronomy16030333
Submission received: 25 December 2025 / Revised: 19 January 2026 / Accepted: 27 January 2026 / Published: 29 January 2026
(This article belongs to the Collection AI, Sensors and Robotics for Smart Agriculture)

Abstract

Soil texture is recognised as one of the key physical properties of soil. Although traditional laboratory testing methods can determine soil texture information with high accuracy, they are often considered time-consuming and costly. To achieve rapid and accurate acquisition of soil texture information, this study proposes RVFM, a hybrid deep learning model designed for soil texture detection using microscopic images. The model integrates a CNN branch for extracting multi-dimensional texture features with a Transformer branch for capturing global positional information, fused via a cross-attention module. This architecture effectively captures microscopic distribution characteristics to estimate soil composition proportions. Experimental results demonstrate high precision, with prediction coefficients (R2) for sand, silt, and clay reaching 0.971, 0.954, and 0.931, respectively. Corresponding Root Mean Square Errors (RMSE) were recorded at 3.789, 2.842, and 2.780. The test results outperform those of other classical network models, and the model shows better fitting performance in generalisation tests, demonstrating certain practical value

1. Introduction

Soil particle size distribution is one of the fundamental controlling factors of soil structure and function. Soil processes, properties, and specific characteristics are typically associated with these distributions, which are commonly referred to as soil texture [1]. Soil texture specifically refers to the weight proportion of soil particles smaller than 2 mm, which strongly influences soil water and nutrient retention, heat and air flow, as well as chemical and biological properties [2]. These properties are closely related to crop growth conditions. Therefore, understanding soil texture information in the field can assist field workers in selecting appropriate crops for planting, thereby achieving higher yields [3]. In the late 19th century, many countries began measuring soil mechanical composition and classifying soils accordingly. Today, numerous soil texture classification standards have been proposed worldwide, with the U.S. Department of Agriculture’s 1951 U.S. Soil Texture Classification Standard being the most widely used [4,5]. The American Standard classifies soil particles into sand (0.05–2 mm), silt (0.002–0.05 mm), and clay (<0.002 mm), and further divides them into 12 texture types based on the percentage of each particle size group.
Traditional texture testing is typically conducted in laboratories, with standard methods including the pipette method [6] and the hydrometer method. Both methods are based on Stokes’ law, determining the particle size distribution by measuring the settling velocity of soil particles in a liquid. However, these laboratory measurement methods require not only sample collection, grinding, and drying of soil samples but also the use of chemical reagents such as hydrogen peroxide and sodium hexametaphosphate to remove organic matter from the soil and disperse soil particles. Although some researchers have explored laser diffraction technology for final particle size analysis of soil suspensions [7,8], these procedures remain cumbersome and time-consuming, and render the submitted soil samples environmentally hazardous.
To address the high costs and time-consuming nature of traditional soil texture testing methods, soil near-field sensor technology has also developed rapidly [9], with research primarily categorised into three types: electromagnetic induction devices, spectroscopy, and imaging.
Nocco and Kucharik [10] found that the physical properties of the soil surface are strongly correlated with its conductivity at higher quantiles, although this correlation is poor in deeper soils. However, this confirms that a predictive model can be established between conductivity and soil physical properties. Additionally, other studies have shown that magnetic susceptibility is highly correlated with the sand and clay content in soil texture and have applied this to soil texture detection [11,12].
Spectroscopy serves as an efficient tool for detecting the physical and chemical properties of soil, and there has been extensive research on its application in soil texture analysis. Ternikar et al. [13] utilised the global soil spectral library from ICRAF-iSric, combining visible-near-infrared and short-wave infrared (VNIR/SWIR) laboratory spectroscopy with mid-infrared region spectroscopy to achieve high-precision detection of specific soil types. Pavão et al. [14] and Benedet et al. [15] both used a combination of X-ray fluorescence spectroscopy (PXRF) and diffuse reflectance spectroscopy in the visible-near-infrared (Vis-NIR) region to perform regression predictions of soil texture, particularly for accurate predictions of clay content. Ng et al. [16] believes that composite spectra can better characterise soil texture. Therefore he simultaneously used visible/near-infrared, mid-infrared, and their combined spectra with least squares regression, stereographic regression trees, one-dimensional, and two-dimensional convolutional neural networks for prediction. He concluded that one-dimensional CNN provided the best spectral fitting performance, while two-dimensional CNN had the next best fitting performance for spectral maps.
The development of digital imaging technology has made cameras an efficient soil near-field sensor device. RGB imagery provides a more intuitive representation of soil aggregate spatial and morphological characteristics, offering particular advantages in texture detection. Pan et al. [17] developed a soil texture type identification application based on the camera functionality of an Android smartphone, which uses the Perceptual Hashing algorithm to achieve high-accuracy identification of field soil images. Barman and Choudhury [18] and Ngu et al. [19] classified soil texture types using support vector machines and random forest algorithms, respectively, achieving accuracy rates above 90%. Azadnia et al. [20] input colour, texture, and their combined features of soil images into different traditional machine vision models and CNN models for learning, finding that combined feature detection yielded the best results. However, when used alone, texture features better reflected soil information, a finding later confirmed by Ma et al. [3].
Compared with the aforementioned near-field sensors, electromagnetic induction devices, and spectroscopy, which have drawbacks such as operational difficulties and high costs, cameras also have limitations in accurately describing the distribution of particles smaller than 2 mm in soil. However, with technological advancements, microscope cameras are becoming more affordable and portable. Their high magnification capabilities allow us to observe clearer texture details compared with ordinary cameras [21,22], demonstrating significant potential for detecting soil physical and chemical properties. Sudarsan et al. [23] and Qi et al. [24] analysed soil micrographs using wavelet analysis and Bag of Visual Words algorithms, respectively, with each particle’s fitting coefficient exceeding 0.8. The development of deep learning technology in the field of image detection has also provided a direction for the application of microscopic imaging technology. Therefore, this study will combine microscopic imaging technology with artificial intelligence deep learning technology to conduct high-precision detection research on soil texture. Based on the foregoing we propose an innovative soil texture detection model, RVFM, which integrates convolutional neural networks with visual transformers through a cross-attention fusion mechanism. By employing multi-stage fusion to model the interaction between local spatial features and global contextual representations, this model achieves more efficient feature enhancement, enabling precise prediction of sand, silt, and clay content within highly heterogeneous microscopic images.
The research content is as follows:
(1) Collect soil samples from the Guangdong region of China and perform true value detection of soil texture. Simultaneously, capture microscopic images of the samples under controlled environmental conditions with varying environmental parameters to construct a soil microscopic image dataset. Develop and optimise a deep learning model, RVFM, that integrates CNN and Transformer features to perform soil texture detection tasks.
(2) Validate the performance of the constructed RVFM model through comparative experiments and select the optimal dataset. Concurrently, the model undergoes generalisation testing to validate its robustness in practical applications.

2. Materials and Methods

2.1. Soil Sample Collection

This study collected a total of 160 soil samples from a 20-centimetre surface layer depth, all taken from different regions within Guangdong Province, China. Guangdong Province is situated in the southernmost part of mainland China, with its land area roughly spanning between 20°09′ and 25°31′ north latitude, and 109°45′ and 117°20′ east longitude. The distribution of the samples is shown in Figure 1. This region is characterised by a subtropical monsoon climate, featuring warm and humid conditions, with an annual average temperature above 20 °C and annual precipitation of approximately 2000 mm. The combination of abundant natural rainfall and soil erosion caused by human agricultural activities has significantly influenced the soil characteristics of this region [25] and has also shaped its soil texture characteristics to a certain extent.
Some of the collected samples were sent to a testing laboratory, where the true soil texture values were determined using the traditional hydrometer method, serving as reference data for subsequent model training. Meanwhile, another portion of the soil samples underwent high-temperature drying, grinding, and sieving through a 2 mm mesh, to be used for constructing the subsequent soil microscopic image dataset.

2.1.1. Image Acquisition Device Setup

To capture the large number of training images required for deep learning, this study compressed the soil samples into shallow soil dishes, as shown in Figure 2a, and developed an efficient and easy-to-operate image acquisition device. First, the AM7515MZTL digital microscope manufactured by the Dino-Lite company, together with its DinoCapture 2.0 software, was selected as the primary image acquisition equipment. This microscope camera offers a maximum resolution of 2592 × 1944 pixels and an adjustable magnification range of 10× to 140×, enabling clear observation of soil particles approximately 2 mm in size.; An opaque dark chamber was used as the primary imaging environment, equipped with a height-adjustable stand and a top-mounted light source controlled via a light-source controller, allowing artificial regulation of illumination within the enclosed space as required. The overall setup, as illustrated in Figure 2b, creates a stable and controllable imaging environment to ensure maximum consistency and precision during the capture of training data. During imaging, the micro camera is positioned in a straight line with the soil sample to ensure direct alignment for capture. The imaging height was adjusted according to data provided by the AM7515MZTL microscope, with 30×, 40× and 50× magnifications corresponding to imaging heights of 72.5 millimetres, 54.5 millimetres and 43.5 millimetres, respectively.

2.1.2. Soil Microscopic Image Capture

Although the principal advantage of microscopic imaging over traditional imaging methods lies in its ability to capture highly detailed images of microscopic particles, observing and predicting overall texture information does not necessarily require such a fine depiction of the edges of each soil particle. Instead, a sub-microscopic imaging approach should be adopted, which involves capturing the entire form of larger sand grains on the soil surface while simultaneously describing the surface texture features of clay particles that differ by two or three orders of magnitude in size. This consideration arises not only from the poor imaging performance caused by the substantial size differences between particles but also from the fact that, when a microscopic camera images the soil surface, higher magnification ratios considerably reduce the field of view. This reduction poses a significant challenge to maintaining the uniformity and representativeness of the soil texture information presented by the camera [21,26]. Therefore, this study conducted an in-depth investigation into this issue. By adjusting the magnification settings during soil microscopic imaging, multiple soil microscopic image datasets were constructed. The three magnification parameters ultimately selected for the comparative experiments were 30×, 40×, and 50×.
Previous studies have shown that variations in illumination conditions can influence the clarity of soil images. Unlike the inherent colour of soil, which is affected by its internal organic matter content, lighting intensity can substantially affect the conversion and extraction of image information during recognition [17]. Therefore, this study also adjusted the light intensity to capture multiple images of the same soil surface, thereby achieving data augmentation and improving the model’s generalisation capability. Since the AM7515MZTL digital microscope requires different imaging heights at different magnifications, a built-in top-mounted light source was used to control the illumination intensity at the sample position on the stage. The selected light intensities were 1000 lux, 2000 lux, 3000 lux, and 4000 lux. Light intensity is controlled solely by the integrated light source at the top and measured using a lux metre positioned at the soil sample location. The imaging results are shown in Table 1 below.
To minimise the error caused by uneven soil surfaces during imaging, each sample was imaged three times. Each imaging session ensures the completion of three magnification capture tasks, with illumination intensity adjusted for each capture without shifting the soil sample. Combined with the light-intensity parameter settings, the soil microscopic image dataset constructed at each magnification level comprised a total of 1920 images. Additionally, 32 samples were randomly selected for an additional round of imaging, which served as the test set for training the model and evaluating its detection performance. The distribution of samples in the training and test sets is shown in Figure 3 and Table 2.

2.2. Soil Texture Detection Model Construction

2.2.1. Overall Structure of the Detection Model

This study integrates the local feature extraction capabilities of Convolutional Neural Networks (CNNs) with the global information processing capabilities of Transformers to construct a soil micro-image texture detection model (RVFM) based on feature fusion between ResNet-50 and the Vision Transformer (ViT). The model is divided into four components: the ResNet-50 branch, a lightweight Vision Transformer branch, a cross-attention-based feature fusion module, and an output layer. The overall structure of the model is illustrated in Figure 4.
Through the dual-branch architecture of the CNN and Transformer, the model can fully extract both the content and positional information of surface particles in soil microscopic images and utilise a cross-attention module to fuse the feature maps of the two branches. The entire network includes four feature-fusion stages, with each stage’s fused feature map encompassing particle information from soil images across different spatial dimensions. Finally, these feature maps are fused and concatenated, and a regression head is employed to predict the particle composition features of the soil independently.

2.2.2. CNN Branch

Since the breakthrough achieved by AlexNet in 2017 [27], convolutional neural networks (CNNs) have developed rapidly and have attained a predominant position in the field of image processing. Unlike traditional image-processing methods that require manual feature extraction of attributes such as colour and texture, CNN are capable of self-learning and non-linear modelling through convolution operations, thereby significantly improving the accuracy and robustness of digital image-processing techniques [28,29]. In 2015, He et al. proposed the deep residual network (ResNet) [30], which effectively addressed the vanishing-gradient problem in deep networks by introducing residual connections, thereby enabling neural networks to be trained to greater depths with improved performance.
In this study, ResNet-50 was employed as one branch of the proposed model. This branch comprises an input layer and four residual layers containing varying numbers of bottleneck blocks. The structure of a bottleneck block is illustrated in Figure 5a, and the output feature x i + 1 is calculated using the following formula:
x i + 1 = ReLU F ( x i ) + x i
Among them, x i is the input feature, F ( x i ) is the feature learning operation, and ReLU is the activation function.
The gradient calculation formula is also updated to
( loss ) ( x i ) = ( loss ) ( x i + 1 ) 1 + ( x i + 1 ) F ( x i )
The equation prevents the gradient from vanishing during training.
Although the overall depth of the CNN branch is considerable, the multi-stage convolution operations of the bottleneck blocks, together with the use of max pooling in the input layer to reduce the feature-map dimensions, significantly decrease computational demands [31]. By leveraging the strong local feature extraction capability of ResNet-50, this branch extracts information on soil-particle distribution, overall colour, and texture from the feature maps of soil images at different stages, and utilises these features for subsequent feature-fusion processing. The feature maps extracted by this branch retain both low-dimensional spatial information and high-level semantic features processed by the ResNet-50 model when handling soil images, making this stage a critical step in enhancing the accuracy of soil-texture parameter predictions.

2.2.3. Transformer Branch

The Transformer architecture was originally proposed to address sequence-to-sequence problems in natural language processing. Subsequently, with the introduction of image-oriented variants such as Vision Transformer [32] and Swin Transformer [33], the dominance of CNN in the field of image processing has been partially challenged. However, although Transformers exhibit strong performance in visual tasks, their efficiency and accuracy for models of comparable size are often inferior to those of CNN.
The Vision Transformer (ViT), based on the standard Transformer framework, is a deep learning model specifically designed for image analysis. It divides an input image into several equally sized patches, converts these patches into a sequence of tokens, and assigns to each token both a learnable feature embedding and a positional encoding. These token sequences are then processed through a Transformer Encoder to generate the output representation [34]. In this study, an improved ViT model is adopted as the Transformer branch of the feature-fusion network. Unlike the CNN branch, which extracts fine-grained local features via convolutional operations, the Transformer branch employs a self-attention mechanism, incorporating positional encoding and a multi-head attention module within its Encoder Blocks to capture global contextual relationships between image patches [35]. The structure of the Encoder Block is illustrated in Figure 5b. This block makes multiple uses of Layer Normalisation (LayerNorm) to stabilise training and accelerate convergence. In addition, a Multi-Layer Perceptron (MLP) block is incorporated to expand the feature dimensions, thereby enhancing the model’s ability to represent non-linear relationships among the patches. By effectively capturing the global dependencies and overall structural relationships among the three major particle types present in soil microscopic images, the Transformer branch complements the local texture and colour information extracted by the CNN branch, thus enabling more accurate prediction of soil-particle content. However, because the full ViT architecture involves substantial computational cost, its direct application would result in redundant processing and performance inefficiency. Therefore, this study applies a lightweight optimisation strategy to the Transformer branch. On the one hand, the input image is reduced from the original 448 × 448 size to a convolved 112 × 112 size, thereby decreasing the number of patches to 7 × 7, totalling 49 image tokens. On the other hand, the number of Encoder Block layers in the ViT model is reduced to three, with each layer configured to have four attention heads. These adjustments considerably reduce both the parameter count and the computational depth of the network. The resulting Transformer branch performs efficient global feature modelling with a lighter architecture, allowing it to extract high-level global information effectively. When integrated as a component of the hybrid RVFM model, it significantly enhances the accuracy and robustness of soil-texture detection.

2.2.4. Cross-Attention Fusion Module

Feature fusion refers to the process of combining features originating from different sources or hierarchical levels to enhance overall model performance. Traditional feature fusion approaches often involve a simple summation or concatenation of features after the manual assignment of weight coefficients. In contrast, cross-attention fusion, as a specialised feature fusion method, integrates the deep learning attention mechanism, which allows the model to selectively focus on key information. It enables one sequence (the Query) to attend to another sequence (the Key and Value) and to dynamically calculate and adjust weighting coefficients adaptively, thereby improving the model’s ability to capture complex relationships between different feature sequences.
The formula for calculating the attention score is as follows:
Attention   Scores = Q K T d k
where Q denotes the Query, represented by the CNN branch features in this model; K denotes the Key, derived from the Transformer branch features; and d k represents the dimensionality of the Key, which is used to scale the dot-product result and prevent excessively large values.
The attention weight is calculated as follows:
Output = Softm ax Attention   Scores V
where Softmax denotes the normalisation operation, and V represents the Value, which also corresponds to the Transformer branch inputs.
The soil texture detection model employed in this study utilises a cross-attention fusion module for feature integration, the specific architecture of which is illustrated in Figure 6. It operates by extracting the feature maps from each stage of the CNN branch and inputting them, together with the feature maps of the Transformer branch, into the cross-attention mechanism. Within this module, a multi-head attention mechanism is executed, in which the CNN branch outputs are transformed into Queries, while the Transformer branch outputs serve as the Keys and Values. The module then retrieves and learns the relevant information from these Transformer features, computing the allocation of prediction weights across different regions of the image. Additionally, the CNN branch features are aligned with the output of the attention module through a 1 × 1 convolution operation, and a residual connection between the two is introduced to preserve the original feature information. Finally, a normalisation operation is applied to ensure the stability of the fused feature distribution and to generate the final fused feature map. The multi-stage feature extraction in the CNN branch preserves both low-dimensional and high-dimensional local information obtained during the training of the original soil microscopic images. Meanwhile, the Transformer branch contributes global contextual relationships to each feature map, effectively enabling the extraction of spatial distribution information about soil particles and providing a more accurate representation of their relative contribution to overall soil composition.

2.2.5. Output Layer

A dedicated output layer was constructed for the soil texture detection task, consisting of a feature fusion and dimensionality-reduction module and a regression head. The feature maps produced by the cross-attention module contained rich multi-scale representations of soil images, which were subsequently processed by the fusion and reduction module. Initially, bilinear interpolation was employed to align the spatial resolutions of these feature maps. A 1 × 1 convolutional layer was then applied for channel fusion, through which a complex, multi-dimensional fused feature map was generated. The resulting representation was further reduced in dimensionality by means of global average pooling. Finally, the feature maps obtained after dimensionality reduction are fed into the regression head, where they undergo regression mapping through its internal fully connected network, yielding predicted values for the content of cohesive particles, fine particles, and sand particles, respectively.

2.3. Model Evaluation Indicators

A total of 160 soil samples were utilised to construct three training datasets of equal size under different microscopic imaging magnifications, each containing 1920 images. In addition, 32 samples were randomly selected and re-imaged under varying surface conditions to form an independent test dataset. Model training was conducted for 80 epochs, ensuring that convergence was nearly achieved. The loss variation observed in the test set was used to determine and preserve the optimal model iteration at each stage. Following the training process, model performance was evaluated using two statistical indicators: the coefficient of determination (R2) and the root mean square error (RMSE). The R2 value was used to assess the model’s goodness of fit, with values closer to 1 indicating stronger explanatory power. The RMSE was calculated to measure the magnitude of the prediction error, as it is sensitive to outliers and reflects the deviation between predicted and observed values more effectively. The corresponding formulae are expressed as follows:
R 2 = 1 i = 1 n y i y i 2 i = 1 n y i y ¯ 2
R M S E = 1 n i = 1 n y i y i 2
Among them, n is the number of soil samples, y i represents the true value of the texture parameter obtained from laboratory testing, y i ^ denotes the predicted value of the texture parameter obtained during model evaluation, and   y i ¯ is the mean value of the texture parameter.

3. Results and Discussion

3.1. Performance Evaluation of the RVFM Model on Different Datasets

The RVFM model was trained for 80 epochs, employing stochastic gradient descent (SGD) as the optimiser and mean square error (MSELoss) as the loss function. The learning rate was set to 0.0001, the batch size to 2, and the input image resolution to 448 × 448 pixels. All training processes were carried out on a system equipped with an Intel Core i3–7100 processor and an NVIDIA GeForce GTX 1080 graphics card, operating under Ubuntu 18.04. The deep learning framework was implemented in PyTorch 2.4.1, and CUDA 11.8 was used to accelerate computation.
The final training outcomes are presented in Figure 7. At an imaging magnification of 30×, the model achieved R2 values of 0.884, 0.923, and 0.952 and RMSE values of 3.593%, 3.662%, and 4.913% for clay, silt, and sand, respectively. Under 40× magnification, the corresponding R2 values for clay, silt, and sand were 0.931, 0.954, and 0.971, with RMSE values of 2.780%, 2.842%, and 3.789%, respectively. When the magnification was increased to 50×, the fitted R2 values were 0.877, 0.876, and 0.900, while the RMSE values were 3.705%, 4.648%, and 7.063%, respectively. It was observed that the fitting coefficients of all texture detection models exceeded 0.85, and the root mean square errors remained generally within 5%, indicating that the proposed model achieved high-precision detection of soil texture based on microscopic imaging. These results also confirm the feasibility of employing a microscopic camera as a near-field sensing device for soil texture detection. Among all experimental settings, the RVFM model trained at 40× magnification demonstrated the best overall performance, outperforming the other two magnification levels. Because different datasets were created by varying the imaging magnification while capturing images from the same soil surface, the primary distinctions between datasets arose from changes in the field of view caused by differing magnification ratios. Accordingly, it was concluded that soil micrographs obtained at 40× magnification are most suitable for soil texture detection. Compared with 30× magnification, the 40× level provided clearer imaging of fine soil particle textures and edges, facilitating the identification of larger sand particles and improving the estimation of the proportion of smooth surfaces formed by aggregated clay particles. Additionally, although a 50× magnification offers greater precision in localised details, its actual performance proves least effective. This is likely due to the narrower field of view being dominated by larger soil particles, thereby increasing measurement errors when capturing the overall distribution of soil texture information. Consequently, 40× magnification was deemed to achieve the optimal balance between local detail representation and overall texture consistency in the experiments.

3.2. Performance Evaluation Comparison of Different Models

To further assess the performance of the RVFM model proposed in this study, evaluation experiments were conducted on several other models using the 40× dataset under the same training environment and parameter settings as described in Section 3.1. The comparative models included ResNet-50, ViT-16B, Inception v3, and EfficientNet-B0. The detection performance of each model is presented in Table 3. It can be observed that, in the detection tasks involving the three types of soil particles, the RVFM model achieved the highest coefficient of determination, closest to 1, and the lowest root mean square error (RMSE), thereby indicating the best model-fitting performance.
The training loss curves are illustrated in Figure 8. It was observed that ResNet-50 exhibited strong capability in extracting local features from images, resulting in relatively lower losses in the prediction of clay and silt particles. Inception v3 also demonstrated favourable performance in the prediction of silt particles. A new breakthrough in performance evaluation was achieved by the RVFM model developed in this study. Its performance was found to be comparable to, or even superior to, that of ResNet-50 and Inception v3 in predicting fine particles (clay and silt). Moreover, it outperformed all other tested models in predicting the content of coarse sand particles, thereby demonstrating the most outstanding overall performance.

3.3. Ablation Experiment

To better evaluate the contribution of each module within the RVFM model to the overall performance, ablation experiments were conducted based on the CNN branch of the backbone network. The comparison items included the CNN branch of the complete RVFM model (ResNet-50) and a simple concatenation fusion model combining the CNN and ViT branches (CNN Branch + ViT Branch), in which the cross-attention fusion module was removed. The corresponding results are presented in Table 4. It can be observed that the ResNet-50, as a classical CNN network, exhibited excellent performance on the soil texture microscopic image dataset. Merely concatenating the feature maps of the two branches did not lead to an improvement in model performance. Nevertheless, through the introduction of an additional cross-attention fusion module, more comprehensive feature fusion between the two branches was achieved, thereby enhancing the overall performance of the model.

3.4. Generalisation Testing

To further assess the generalisation capabilities of the RVFM model, eighteen additional soil samples were collected from Guangzhou, Guangdong Province. Microscopic soil images were captured at identical locations under varying illumination intensities, and the model was subsequently employed to predict and evaluate performance directly. Table 5 presents a comparison of the predictive results across these different lighting conditions. Given the limited sample size, the R2 coefficient was prone to distortion; consequently, RMSE was utilised as the primary metric for performance evaluation. The results indicate that for microscopic images, illumination ranging from 1000 lux to 4000 lux has a limited effect on the overall detection error. Taking the 4000 lux intensity as an example, the prediction error line graph is illustrated in Figure 9. It is evident that the model demonstrates a satisfactory fitting performance on the newly acquired soil samples. Although some samples exhibited larger prediction errors, this is likely due to a lack of training data for specific soil types or the fact that microscopic images may not fully represent macroscopic texture characteristics. Nonetheless, the results confirm that the model possesses strong generalisation capabilities.
The RVFM model employs an advanced dual-branch architecture that effectively synergises the strengths of CNNs in extracting fine-grained, high-dimensional spatial morphology with the capabilities of Transformers in capturing long-range global contextual dependencies. By incorporating cross-attention modules, the model achieves a deep fusion of local granular details with macro-level distribution patterns, significantly enhancing detection accuracy while demonstrating exceptional robustness and reliability across varying light intensities. Nevertheless, this performance gain remains contingent upon the availability of high-quality training data.

4. Conclusions

In conclusion, this research establishes a robust framework for soil texture detection by synergising microscopic imaging with a novel hybrid deep learning architecture. Firstly, we developed the RVFM model, which achieves superior predictive accuracy. Its dual-branch CNN and Transformer architecture integrates information through cross-attention modules, ensuring that both granular-level details and overall distribution patterns are comprehensively considered. Secondly, through a comparative analysis of multiple scales, we determined that 40× magnification provides the optimal balance of spatial resolution and field of view for capturing the critical granular features required for texture discrimination. Experimental results, with R2 values for sand, silt, and clay reaching 0.971, 0.954, and 0.931, respectively, substantiate the model’s precision and its resilience to varying illumination. Our experiments demonstrated the feasibility of using soil microscopic images for soil texture detection by presenting more detailed soil surface information and combining it with deep learning image processing techniques. Future research should further expand the dataset to enhance the feasibility of this method for soil texture detection.

Author Contributions

Conceptualization and methodology M.P.; validation, W.Z., Z.Z. and X.J.; investigation, W.Z., Z.Z. and X.J.; resources, Y.J.; data curation, W.Z. and Z.Z.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z. and S.W.; visualisation, W.Z.; supervision, M.P.; project administration, C.L.; funding acquisition, L.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, grant number 2021YFD2000201.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Martín, M.Á.; Pachepsky, Y.A.; García-Gutiérrez, C.; Reyes, M. On soil textural classifications and soil-texture-based estimations. Solid Earth 2018, 9, 159–165. [Google Scholar] [CrossRef]
  2. Hartemink, A.E.; Minasny, B. Towards digital soil morphometrics. Geoderma 2014, 230–231, 305–317. [Google Scholar] [CrossRef]
  3. Ma, R.; Jiang, J.; Ouyang, L.; Yang, Q.; Du, J.; Wu, S.; Qi, L.; Hou, J.; Xing, H. Toward Flexible Soil Texture Detection by Exploiting Deep Spectrum and Texture Coding. Agronomy 2024, 14, 2074. [Google Scholar] [CrossRef]
  4. Wu, K.N.; Zhao, R. Classification of soil texture and its application in China. Acta Pedol. Sin. 2019, 56, 227–241. [Google Scholar]
  5. He, H.L.; Qi, Y.B.; Lu, J.L.; Peng, P.P.; Yan, Z.R.; Tang, Z.X.; Cui, K.; Zhang, K.Y. Development and Proposed Revision of the Chinese Soil Texture Classification System. J. Agric. Resour. Environ. 2023, 40, 501–510. [Google Scholar] [CrossRef]
  6. Ashworth, J.; Keyes, D.; Kirk, R.; Lessard, R. Standard procedure in the hydrometer method for particle size analysis. Commun. Soil Sci. Plant Anal. 2001, 32, 633–642. [Google Scholar] [CrossRef]
  7. Orhan, U.; Kilinc, E.; Albayrak, F.; Aydin, A.; Torun, A. Ultrasound Penetration-Based Digital Soil Texture Analyzer. Arab. J. Sci. Eng. 2022, 47, 10751–10767. [Google Scholar] [CrossRef]
  8. Faé, G.S.; Montes, F.; Bazilevskaya, E.; Añó, R.M.; Kemanian, A.R. Making Soil Particle Size Analysis by Laser Diffraction Compatible with Standard Soil Texture Determination Methods. Soil Sci. Soc. Am. J. 2019, 83, 1244–1252. [Google Scholar] [CrossRef]
  9. Nadporozhskaya, M.; Kovsh, N.; Paolesse, R.; Lvova, L. Recent Advances in Chemical Sensors for Soil Analysis: A Review. Chemosensors 2022, 10, 35. [Google Scholar] [CrossRef]
  10. Nocco, M.A.; Ruark, M.D.; Kucharik, C.J. Apparent electrical conductivity predicts physical properties of coarse soils. Geoderma 2019, 335, 1–11. [Google Scholar] [CrossRef]
  11. Filla, V.A.; Coelho, A.P.; Ferroni, A.D.; Bahia, A.S.R.D.; Júnior, J.M. Estimation of clay content by magnetic susceptibility in tropical soils using linear and nonlinear models. Geoderma 2021, 403, 115371. [Google Scholar] [CrossRef]
  12. Andrade, R.; Silva, S.H.G.; Faria, W.M.; Poggere, G.C.; Barbosa, J.Z.; Guilherme, L.R.G.; Curi, N. Proximal sensing applied to soil texture prediction and mapping in Brazil. Geoderma Reg. 2020, 23, e00321. [Google Scholar] [CrossRef]
  13. Ternikar, C.R.; Gomez, C.; Nagesh Kumar, D. Visible and infrared lab spectroscopy for soil texture classification: Analysis of entire spectra v/s reduced spectra. Remote Sens. Appl. Soc. Environ. 2024, 35, 101242. [Google Scholar] [CrossRef]
  14. Pavão, Q.S.; Ribeiro, P.G.; Maciel, G.P.; Silva, S.H.G.; Araújo, S.R.; Fernandes, A.R.; Demattê, J.A.M.; Souza Filho, P.W.M.E.; Ramos, S.J. Texture prediction of natural soils in the Brazilian Amazon through proximal sensors. Geoderma Reg. 2024, 37, e00813. [Google Scholar] [CrossRef]
  15. Benedet, L.; Faria, W.M.; Silva, S.H.G.; Mancini, M.; Dematte, J.A.M.; Guilherme, L.R.G.; Curi, N. Soil texture prediction using portable X-ray fluorescence spectrometry and visible near-infrared diffuse reflectance spectroscopy. Geoderma 2020, 376, 114553. [Google Scholar] [CrossRef]
  16. Ng, W.; Minasny, B.; Montazerolghaem, M.; Padarian, J.; Ferguson, R.; Bailey, S.; McBratney, A.B. Convolutional neural network for simultaneous prediction of several soil properties using visible/near-infrared, mid-infrared, and their combined spectra. Geoderma 2019, 352, 251–267. [Google Scholar] [CrossRef]
  17. Pan, H.; Liang, J.; Zhao, Y.; Li, F. Facing the 3rd national land survey (cultivated land quality): Soil survey application for soil texture detection based on the high-definition field soil images by using perceptual hashing algorithm (pHash). J. Soils Sediments 2020, 20, 3427–3441. [Google Scholar] [CrossRef]
  18. Barman, U.; Choudhury, R.D. Soil texture classification using multi class support vector machine. Inf. Process. Agric. 2020, 7, 318–332. [Google Scholar] [CrossRef]
  19. Ngu, N.H.; Thanh, N.N.; Duc, T.T.; Non, D.Q.; Thuy An, N.T.; Chotpantarat, S. Active learning-based random forest algorithm used for soil texture classification mapping in Central Vietnam. Catena 2024, 234, 107629. [Google Scholar] [CrossRef]
  20. Azadnia, R.; Jahanbakhshi, A.; Rashidi, S.; Khajehzadeh, M.; Bazyar, P. Developing an automated monitoring system for fast and accurate prediction of soil texture using an image-based deep learning network and machine vision system. Measurement 2022, 190, 110669. [Google Scholar] [CrossRef]
  21. Danchana, K.; Cerdà, V. Design of a portable spectrophotometric system part II: Using a digital microscope as detector. Talanta 2020, 216, 120977. [Google Scholar] [CrossRef]
  22. Sudarsan, B.; Ji, W.; Biswas, A.; Adamchuk, V. Microscope-based computer vision to characterize soil texture and soil organic matter. Biosyst. Eng. 2016, 152, 41–50. [Google Scholar] [CrossRef]
  23. Sudarsan, B.; Ji, W.; Adamchuk, V.; Biswas, A. Characterizing soil particle sizes using wavelet analysis of microscope images. Comput. Electron. Agric. 2018, 148, 217–225. [Google Scholar] [CrossRef]
  24. Qi, L.; Adamchuk, V.; Huang, H.; Leclerc, M.; Jiang, Y.; Biswas, A. Proximal sensing of soil particle sizes using a microscope-based sensor and bag of visual words model. Geoderma 2019, 351, 144–152. [Google Scholar] [CrossRef]
  25. Zhu, L.A.; Li, D.Q.; Wei, X.G.; Zhang, H.H. Analysis of Soil Erodibility Status and Influencing Factors in Guangdong Province. Subtrop. Soil Water Conserv. 2007, 19, 4–7+16. [Google Scholar] [CrossRef]
  26. Dasgupta, S.; Pate, S.; Rathore, D.; Divyanth, L.G.; Das, A.; Nayak, A.; Dey, S.; Biswas, A.; Weindorf, D.C.; Li, B.; et al. Soil fertility prediction using combined USB-microscope based soil image, auxiliary variables, and portable X-ray fluorescence spectrometry. Soil Adv. 2024, 2, 100016. [Google Scholar] [CrossRef]
  27. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks; Association for Computing Machinery: New York, NY, USA, 2017; Volume 60, pp. 84–90. [Google Scholar]
  28. Liu, X.; Feng, H.; Wang, Y.; Li, D.; Zhang, K. Hybrid model of ResNet and transformer for efficient image reconstruction of electromagnetic tomography. Flow Meas. Instrum. 2025, 102, 102843. [Google Scholar] [CrossRef]
  29. Sattar, K.; Maqsood, U.; Hussain, Q.; Majeed, S.; Kaleem, S.; Babar, M.; Qureshi, B. Soil texture analysis using controlled image processing. Smart Agric. Technol. 2024, 9, 100588. [Google Scholar] [CrossRef]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  31. Panda, M.K.; Sharma, A.; Bajpai, V.; Subudhi, B.N.; Thangaraj, V.; Jakhetiya, V. Encoder and decoder network with ResNet-50 and global average feature pooling for local change detection. Comput. Vis. Image Underst. 2022, 222, 103501. [Google Scholar] [CrossRef]
  32. Yang, B.; Wu, J.; Ikeda, K.; Hattori, G.; Sugano, M.; Iwasawa, Y.; Matsuo, Y. Face-mask-aware Facial Expression Recognition based on Face Parsing and Vision Transformer. Pattern Recognit. Lett. 2022, 164, 173–182. [Google Scholar] [CrossRef]
  33. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
  34. Goceri, E. An efficient network with CNN and transformer blocks for glioma grading and brain tumor classification from MRIs. Expert Syst. Appl. 2025, 268, 126290. [Google Scholar] [CrossRef]
  35. Xing, L.; Jin, H.; Li, H.; Li, Z. Multi-scale vision transformer classification model with self-supervised learning and dilated convolution. Comput. Electr. Eng. 2022, 103, 108270. [Google Scholar] [CrossRef]
Figure 1. Soil sample distribution. (a) Soil sampling areas in Guangdong Province. (b) Number of samples collected in each region.
Figure 1. Soil sample distribution. (a) Soil sampling areas in Guangdong Province. (b) Number of samples collected in each region.
Agronomy 16 00333 g001
Figure 2. Soil microscopic image data acquisition platform. (a) Prepared soil surface samples. (b) Schematic diagram of the soil microscopic image capture device.
Figure 2. Soil microscopic image data acquisition platform. (a) Prepared soil surface samples. (b) Schematic diagram of the soil microscopic image capture device.
Agronomy 16 00333 g002
Figure 3. Triangulation of the distribution of sample texture in the USDA.
Figure 3. Triangulation of the distribution of sample texture in the USDA.
Agronomy 16 00333 g003
Figure 4. RVFM Model Structure.
Figure 4. RVFM Model Structure.
Agronomy 16 00333 g004
Figure 5. (a) Bottleneck Block structure. (b) Encoder Block structure.
Figure 5. (a) Bottleneck Block structure. (b) Encoder Block structure.
Agronomy 16 00333 g005
Figure 6. Cross-attention module structure.
Figure 6. Cross-attention module structure.
Agronomy 16 00333 g006
Figure 7. R Model performance of RVFM on the test set. (a) Scatter plot of predicted and actual soil texture parameters at 30× magnification. (b) Scatter plot of predicted and actual soil texture parameters at 40× magnification. (c) Scatter plot of predicted and actual soil texture parameters at 50× magnification.
Figure 7. R Model performance of RVFM on the test set. (a) Scatter plot of predicted and actual soil texture parameters at 30× magnification. (b) Scatter plot of predicted and actual soil texture parameters at 40× magnification. (c) Scatter plot of predicted and actual soil texture parameters at 50× magnification.
Agronomy 16 00333 g007aAgronomy 16 00333 g007b
Figure 8. Changes in test set loss during training of soil particle models. (a) Changes in loss during clay particle detection. (b) Changes in loss during silt particle detection. (c) Changes in loss during sand particle detection.
Figure 8. Changes in test set loss during training of soil particle models. (a) Changes in loss during clay particle detection. (b) Changes in loss during silt particle detection. (c) Changes in loss during sand particle detection.
Agronomy 16 00333 g008
Figure 9. Linear error plots for model generalisation testing under 4000 lux illumination intensity. (a) Prediction error plot for clay. (b) Prediction error plot for silty soil. (c) Prediction error plot for sandy soil.
Figure 9. Linear error plots for model generalisation testing under 4000 lux illumination intensity. (a) Prediction error plot for clay. (b) Prediction error plot for silty soil. (c) Prediction error plot for sandy soil.
Agronomy 16 00333 g009
Table 1. Comparison of imaging effects in different environments.
Table 1. Comparison of imaging effects in different environments.
Magnification1000 lux2000 lux3000 lux4000 lux
30×Agronomy 16 00333 i001Agronomy 16 00333 i002Agronomy 16 00333 i003Agronomy 16 00333 i004
40×Agronomy 16 00333 i005Agronomy 16 00333 i006Agronomy 16 00333 i007Agronomy 16 00333 i008
50×Agronomy 16 00333 i009Agronomy 16 00333 i010Agronomy 16 00333 i011Agronomy 16 00333 i012
Table 2. Statistical data on soil particle content.
Table 2. Statistical data on soil particle content.
Data TypeSoil ParticleMin (%)Max (%)Mean (%)SD (%)
Training
(n = 160)
Sand0.786.435.522.8
Silt1.656.733.113.9
Clay11.061.331.410.7
Testing
(n = 32)
Sand0.785.636.522.7
Silt3.452.131.513.4
Clay11.051.932.010.7
Table 3. Performance evaluation of various models using a 40× magnification soil micrograph dataset.
Table 3. Performance evaluation of various models using a 40× magnification soil micrograph dataset.
ModelSoil ParticlesR2RMSE
ResNet-50Sand0.9524.875
Silt0.9492.978
Clay0.9103.172
ViT-16BSand0.9086.775
Silt0.8994.194
Clay0.7864.889
Inception v3Sand0.9315.879
Silt0.9423.191
Clay0.8534.054
Efficientnet-B0Sand0.9236.214
Silt0.9233.654
Clay0.8504.090
RVFMSand0.9713.789
Silt0.9542.842
Clay0.9312.780
Table 4. Experimental study on the ablation of different components of the model.
Table 4. Experimental study on the ablation of different components of the model.
ModelSoil ParticlesR2RMSE
CNN BranchSand0.9524.875
Silt0.9492.978
Clay0.9103.172
CNN Branch + ViT BranchSand0.8957.221
Silt0.8654.846
Clay0.8564.007
RVFMSand0.9713.789
Silt0.9542.842
Clay0.9312.780
Table 5. Comparison of Prediction Error Magnitudes in the RVFM Model Under Different Illuminations.
Table 5. Comparison of Prediction Error Magnitudes in the RVFM Model Under Different Illuminations.
Light IntensitySoil ParticleRMSE
1000 luxSand2.746
Silt2.389
Clay3.000
2000 luxSand4.343
Silt2.512
Clay3.497
3000 luxSand3.451
Silt2.205
Clay2.767
4000 luxSand2.467
Silt1.900
Clay2.409
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pan, M.; Zhang, W.; Zhong, Z.; Jiang, X.; Jiang, Y.; Lin, C.; Qi, L.; Wu, S. A Hybrid CNN-Transformer Model for Soil Texture Estimation from Microscopic Images. Agronomy 2026, 16, 333. https://doi.org/10.3390/agronomy16030333

AMA Style

Pan M, Zhang W, Zhong Z, Jiang X, Jiang Y, Lin C, Qi L, Wu S. A Hybrid CNN-Transformer Model for Soil Texture Estimation from Microscopic Images. Agronomy. 2026; 16(3):333. https://doi.org/10.3390/agronomy16030333

Chicago/Turabian Style

Pan, Ming, Wenhao Zhang, Zeyang Zhong, Xinyu Jiang, Yu Jiang, Caixia Lin, Long Qi, and Shuanglong Wu. 2026. "A Hybrid CNN-Transformer Model for Soil Texture Estimation from Microscopic Images" Agronomy 16, no. 3: 333. https://doi.org/10.3390/agronomy16030333

APA Style

Pan, M., Zhang, W., Zhong, Z., Jiang, X., Jiang, Y., Lin, C., Qi, L., & Wu, S. (2026). A Hybrid CNN-Transformer Model for Soil Texture Estimation from Microscopic Images. Agronomy, 16(3), 333. https://doi.org/10.3390/agronomy16030333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop