1. Introduction
In the context of the deep integration of the digital economy and the real economy, digital transformation has emerged as a critical pathway for traditional industries to achieve high-quality development [
1,
2,
3,
4,
5]. As a major agricultural country, China is accelerating the modernization and green transformation of its agricultural sector. Agribusinesses, as the main drivers of agricultural technological innovation, play a crucial role in promoting the modernization of agriculture. However, compared with manufacturing and high-tech enterprises, the innovation of Chinese agribusinesses faces unique sector-specific constraints. Specifically, their production processes are non-standardized, and the industrial chains are long and fragmented. In addition, they hold insufficient collateral and have weak risk resilience. These inherent characteristics lead to low innovation efficiency and insufficient high-quality innovation output [
6]. More significantly, the agricultural sector is facing mounting pressure to pursue green development. Therefore, green innovation has become an urgent imperative for Chinese agribusinesses to balance productivity growth with ecological sustainability.
Existing studies have confirmed that digital transformation can significantly promote firm innovation [
7,
8,
9,
10]. However, most of them focus on industrial enterprises, with limited attention paid to the unique characteristics of agricultural enterprises. Unlike industrial firms that implement digital transformation mainly within standardized production and closed supply chains, agribusinesses’ digitalization needs to cover the entire chain from “farm to table” [
11], address the high uncertainty of natural production conditions [
12,
13], and coordinate with tens of thousands of dispersed smallholder farmers [
14]. In addition, the existing literature measures digital transformation as a single-dimensional construct by aggregating the frequency of digital keywords in annual reports. This approach obscures the heterogeneous manifestations of digital transformation across different business links. In the context of agribusiness, digital transformation encompasses multiple facets, including production, circulation, traceability, governance, and institutional digitalization. The heterogeneous predictive impacts of these dimensions on green innovation remain unexamined.
Methodologically, most existing studies rely on traditional econometric models such as structural equation modeling or linear regression, which assume linear relationships between variables. However, the relationship between digital transformation and innovation may involve nonlinearities and complex feature interactions [
15,
16]. Different digital dimensions may have synergistic or threshold predictive effects on environmental innovation that linear models fail to capture. Explainable machine learning algorithms, such as XGBoost, offer advantages in capturing such complexities and identifying key predictive features [
17,
18,
19]. Combining XGBoost with SHAP analysis enables a more nuanced understanding of the heterogeneous predictive effects of different digital dimensions. While recent studies have begun to apply machine learning methods to examine digital transformation, most still treat it as a single aggregate indicator and fail to unpack the heterogeneous predictive contributions of different digital dimensions. Moreover, few studies adopt explainable artificial intelligence techniques to provide actionable insights into the underlying mechanisms, which limits their practical value for both managers and policymakers.
Against this backdrop, this study addresses three core research questions (RQs) as follows.
RQ1. Which digital transformation dimension is the core predictive driver of agribusiness green innovation?
RQ2. How are different digital transformation dimensions heterogeneously associated with agribusiness green innovation?
RQ3. Is there a synergistic interaction between different digital transformation dimensions?
Drawing on the resource-based view and dynamic capabilities theory, these questions correspond to three testable propositions: first, digital resources embedded in different business links differ in their value, rarity, and inimitability, leading to a clear hierarchy of importance; second, the transformation of digital resources into innovation capabilities follows a nonlinear path bounded by absorptive capacity thresholds; and third, cross-dimensional digital resources generate complementarities that produce synergistic effects beyond the sum of their individual contributions.
To answer these questions, this study systematically analyzes the complex relationship between multi-dimensional digital transformation and agribusiness green innovation. Using panel data of Chinese A-share listed agricultural companies from 2011 to 2021, we deconstruct agribusiness digital transformation into five business dimensions and two structural features. We then employ the XGBoost algorithm with SHAP analysis to identify the most critical digital dimensions, uncover their nonlinear predictive patterns, and explore their interaction effects. This study moves beyond the traditional single-dimensional and linear analytical framework and provides a more nuanced understanding of how digital transformation is associated with green innovation in the agribusiness context.
This study makes three contributions to the existing literature. First, it refines the measurement system of agribusiness digital transformation by decomposing it into five distinct dimensions and two structural features. This is in contrast to prior research that typically measures digital transformation as a single aggregate indicator. This decomposition extends the resource-based view by distinguishing heterogeneous digital resources that underpin green innovation. Second, it introduces the XGBoost–SHAP explainable machine learning method to capture nonlinear relationships and threshold predictive effects, thereby expanding the application of dynamic capability theory in agricultural digitalization research. Unlike traditional linear models that impose restrictive assumptions, this approach identifies hidden nonlinear patterns and interactive effects that cannot be detected in conventional analyses. Third, by uncovering the nonlinear characteristics and synergistic interactions among digital dimensions and between digitalization and firm capabilities, this study provides actionable insights for agribusiness managers to prioritize core digital dimensions and for policymakers to design targeted interventions that account for threshold effects and complementarities. By doing this, this study advances both measurement and methodological approaches beyond the existing literature and provides new empirical evidence for institutional theory.
The remainder of this study is structured as follows.
Section 2 provides a comprehensive review of the related literature.
Section 3 builds a theoretical framework for analyzing the relationship between multi-dimensional digital transformation and agribusiness green innovation.
Section 4 describes the data sources, variable definitions, and methodology. The model results and validation are presented in
Section 5, and
Section 6 discusses the findings and implications. Finally,
Section 7 concludes the paper.
4. Data, Methodology and Research Design
4.1. Sample Selection and Data Source
This study selects Chinese A-share listed agricultural companies from 2011 to 2021 as the initial research sample. Because the sample period covers 2020–2021, years that were heavily affected by the COVID-19 pandemic, we conduct a robustness check by excluding observations from 2020–2021 to ensure the reliability of our findings, and the main results remain highly consistent. Corporate annual report textual data for measuring digital transformation are retrieved from the China Securities Information Network (CNINFO). Green patent data for measuring agribusiness green innovation are obtained from the Chinese Research Data Services Platform (CNRDS). Financial and corporate governance data are collected from the China Stock Market & Accounting Research Database (CSMAR) and Wind Database.
4.2. Variable Definitions
4.2.1. Explained Variable: Agribusiness Green Innovation
Drawing on the studies by [
20,
42], we use the number of green patent applications of enterprises to represent the level of agribusiness green innovation. Given the large number of zero values in the raw green patent application data, this variable is measured via logarithmic transformation of the annual number of enterprise green patent applications, i.e., ln(1 + number of patents), covering invention patents. It is worth noting that this measure primarily captures formal and codified technological innovation and may understate process-based, incremental, and operational green improvements that are common in agricultural practice.
4.2.2. Multi-Dimensional Digital Transformation Variables
Referring to the research of [
63,
85], the level of firm-level digital transformation is measured by the occurrence frequency of keywords associated with digital transformation in the annual reports of Chinese listed agricultural enterprises. Nevertheless, a unidimensional indicator cannot capture the reality that digital transformation is not a monolithic phenomenon but rather a multifaceted process permeating various operational links of agribusiness. Therefore, this study decomposes digital transformation into five business dimensions and two structural features.
- 1.
Production digitalization (PD)
PD refers to the application of digital technologies in agricultural production processes. Through real-time monitoring and data-driven decision-making, production digitalization optimizes the allocation of agricultural inputs, thereby freeing up internal resources for R&D activities [
31,
86]. Furthermore, the operational data generated by precision farming delivers actionable insights for the development of eco-friendly agricultural technologies, directly fueling agribusiness green innovation [
87].
- 2.
Circulation digitalization (CD)
CD captures the deployment of digital platforms and technologies across agricultural product distribution and marketing. A defining feature of agriculture is its long industrial chains and high circulation costs, which frequently drive information asymmetry between producers and consumers [
88]. By enabling real-time information sharing and precision demand forecasting, circulation digitalization alleviates this asymmetry, allowing agribusinesses to accurately identify unmet green product demand [
48,
87]. This directly incentivizes targeted green innovation, as firms develop novel products and processes to match consumers’ evolving sustainability preferences [
42].
- 3.
Traceability digitalization (TD)
TD refers to the deployment of traceability technologies across the full farm-to-table cycle of agricultural products. It elevates product transparency, ensures compliance with stringent environmental regulations, and creates reputational incentives for greener production. For agribusinesses, a robust traceability system not only ensures compliance with increasingly stringent environmental regulations but also generates strong reputational incentives to adopt greener production practices [
89]. The imperative to maintain full traceability records drives process innovations that reduce environmental footprints, while the granular data collected informs the development of novel, high-quality green products.
- 4.
Governance digitalization (GD)
GD refers to the application of digital technologies that support agribusinesses’ internal management, R&D, and decision-making processes. As the core technological infrastructure for digital operation, GD directly strengthens innovation capacity by streamlining information processing, optimizing knowledge management, and boosting R&D efficiency [
90]. It enables firms to identify green innovation opportunities, facilitate cross-team R&D collaboration, and accelerate the development of novel low-environmental-footprint technologies, ultimately fostering both incremental and radical green innovation [
91].
- 5.
Institutional digitalization (ID)
ID captures the digitalization of interactions between agribusinesses and external institutional environments. Agribusinesses face severe financial constraints due to the inherent seasonality of agricultural production and insufficient collateralizable assets, and institutional digitalization eases such constraints by reducing information asymmetry between agribusinesses, financial institutions and government authorities [
92]. This expands access to external funding, allowing firms to increase investment in long-term, high-risk green innovation projects that would otherwise suffer from insufficient funding [
93].
To ensure the accuracy and validity of the above dimensional measurement, we developed a standardized keyword dictionary through a rigorous expert review process. An initial keyword pool was developed based on authoritative literature and agricultural digitalization application scenarios. Three rounds of independent screening and revision were conducted by three experts in agricultural economics and digital management to ensure accuracy, relevance, and theoretical consistency. The complete keyword dictionary is presented in
Table 1.
Beyond these business-specific dimensions that capture the content of digital transformation across operational links, the structural distribution of a firm’s digital investment and implementation also plays a non-negligible role in shaping its innovation outcomes. To fully characterize the multi-faceted nature of agribusiness digital transformation, we further construct the following two structural features to profile the overall pattern of firms’ digital transformation efforts.
- 6.
Digital transformation Scope (DS)
To capture the extensiveness of a firm’s digital transformation across different functional dimensions, we measure digital transformation scope using the number of digital dimensions in which a firm exhibits non-zero keyword frequency as follows.
where
SCOREidt is the normalized keyword frequency of dimension d for firm
i in year
t and
is an indicator function that takes the value of 1 if the normalized score is greater than zero, and 0 otherwise. DS ranges from 0 to 5, with higher values indicating that a firm engages in digital transformation across a broader range of business functions.
- 7.
Digital transformation balance (DB)
To capture the structural distribution of digital transformation across the five dimensions, we measure digital transformation balance using the inverse of the standard deviation across the five dimensions as follows:
where
is the standard deviation of keyword frequencies across the five dimensions. This specification ensures that higher values indicate more balanced digital development.
4.3. Data Preprocessing and Cleaning
We conduct full-text extraction and standardized preprocessing of annual reports in Python 3.14, with all procedures fully reproducible. The specific steps are as follows:
Step 1. Noise removal: eliminate page numbers, headers, footers, tables, special symbols, URLs, and redundant blank lines from the raw text;
Step 2. Chinese word segmentation: perform text segmentation using the Jieba library in Python;
Step 3. Stop-word removal: exclude general function words, conjunctions, pronouns, and other semantically empty terms;
Step 4. Text standardization: unify character encoding, consolidate synonymous expressions, and standardize technical terms;
Step 5. Keyword matching: count the frequency of digital-transformation-related keywords using exact whole-word matching;
Step 6. Normalization: apply the Min–Max normalization method to map the scores of all digital dimensions to the [0, 1] interval.
To ensure data reliability, we process the raw data as follows. (1) abnormal enterprise samples such as ST and *ST are excluded; (2) observations with substantial missing values for core variables are removed; (3) continuous variables are winsorized at the 1% and 99% levels to mitigate the impact of outliers. Ultimately, a final valid sample comprising 155 firms and 1636 firm–year observations is obtained.
4.4. Measurement Validity and Limitations
The dimensional division of digital transformation in this study follows the established valuation logic and scenario classification standards in authoritative agricultural digitalization studies, which ensures good discriminant validity. Meanwhile, we acknowledge the inherent limitations of the annual report keyword measurement method. While this approach effectively captures the strategic emphasis and information disclosure level of corporate digital transformation, it cannot fully reflect the actual implementation depth, investment scale, or operational efficiency of digitalization. This is a common limitation shared by all text-mining-based empirical research. To avoid potential bias caused by voluntary disclosure, we exclude firm–year observations with zero digital transformation keywords and further conduct a series of robustness checks in
Section 5.6 to confirm the stability of our core conclusions.
4.5. Methodology
4.5.1. Explainable Machine Learning Method: XGBoost Algorithm
The eXtreme Gradient Boosting (XGBoost) algorithm proposed by [
94] is a scalable machine learning method based on Gradient Boosting decision trees. XGBoost builds an ensemble of weak learners (decision trees) sequentially, where each new tree corrects the errors of the previous ones by optimizing a regularized objective function. Compared to traditional econometric models, XGBoost can capture nonlinear relationships and complex feature interactions without imposing a pre-specified functional form. The objective function of XGBoost is defined as
where
L denotes the loss function that measures the difference between the predicted value
and the true value
and
represents the regularization term for the
k-th decision tree to avoid overfitting, and its calculation formula is
In the formula, T is the number of leaf nodes of the decision tree; and are regularization parameters; and is the weight of the j-th leaf node.
The Gain from splitting a decision tree node can be expressed as
where
GL and
HL are the sum of the first derivatives and second derivatives of the loss function for samples in the left leaf node after splitting, respectively, and
GK and
HK are the corresponding sums for the right leaf node. If the Gain is positive, the node splitting is beneficial and will be retained; otherwise, the splitting is abandoned.
4.5.2. Input Features and Hyperparameter Optimization
The input features for the XGBoost model consist of the five business-specific digitalization dimension variables and two structural feature variables, yielding a total of 7 features. We divide the sample into a training set (2011–2018, 1190 observations) and a test set (2019–2021, 446 observations) to evaluate model performance.
XGBoost models require extensive parameter tuning, which often involves testing tens of thousands of parameter combinations. Identifying the optimal configuration within such a vast search space is known as hyperparameter optimization, with mainstream methods including grid search and Bayesian optimization. Bayesian optimization is a global optimization method for black-box functions, which uses historical parameter information (prior knowledge) to iteratively search for optimal hyperparameters [
95]. Compared with grid Search, it converges to optimal parameters with fewer iterations, offering higher computational efficiency, and is better at finding the global optimum rather than local optima. As a result, a growing number of studies use Bayesian optimization for hyperparameter tuning.
4.5.3. SHAP for Model Interpretation
To interpret the XGBoost model and uncover the complex relationships between digital transformation dimensions and green innovation, we employ the SHapley Additive exPlanations (SHAP) method proposed by [
76]. It provides a unified framework that decomposes the prediction of each instance into additive feature contributions based on Shapley values from cooperative game theory. The core advantage of the SHAP model lies in its ability to quantify feature importance while also revealing the direction, nonlinear patterns, and interaction effects of feature impacts, thus addressing the “black box” problem of many machine learning models. It is important to stress that SHAP values measure each feature’s contribution to the model’s prediction; they do not imply causal relationships. The analysis that follows identifies predictive associations, marginal effects, and nonlinear shapes among variables, rather than causal effects. The basic SHAP formula is
where
is the explainable model;
is the average predicted value of all samples;
is the SHAP value of the
i-th feature, representing the marginal contribution of this feature to the prediction result;
is the standardized input value of the feature; and
N is the total number of features.
For the interaction effects between features, the SHAP model expands the formula by introducing interaction terms as
where
is the SHAP interaction value between the
i-th feature and the
j-th feature, reflecting the additional contribution of their combined effect to the prediction result.
The calculation of the SHAP value for a single feature follows the principle of fair distribution of the Shapley value, with the specific formula
where
F is the set of all input features;
S is any subset that does not include the
i-th feature;
is the number of features in subset S;
and
respectively represent the predicted values of the model when the
i-th feature is included and excluded; and
is the weight coefficient of the feature subset, ensuring that the contribution of each feature is calculated fairly.
In this study, SHAP values serve three purposes: (1) feature importance ranking based on the mean absolute SHAP values across all samples; (2) nonlinear effect plots (SHAP dependence plots) that show how the marginal contribution of a feature varies with its value; and (3) interaction effect plots (SHAP interaction values) that reveal how the contribution of one feature depends on the value of another feature. These visualizations allow us to identify whether digital dimensions exhibit diminishing returns, threshold effects, or synergistic interactions, all within a predictive rather than causal framework.
4.5.4. Implementation Details
We implemented all analyses in Python using several well-established libraries. The XGBoost regressor was built with XGBoost, with the objective function set to reg: squarederror. The evaluation metrics are RMSE, MAE, and R2, and the input features include PD, CD, TD, GD, ID, DS, and DB. The dependent variable is agribusiness green innovation, measured as the natural logarithm of one plus the number of green invention patent applications. Model evaluation and data splitting relied on scikit-learn, while Bayesian hyperparameter optimization was conducted with the bayesian-optimization package. LightGBM was used for robustness checks.
Bayesian optimization was adopted to tune XGBoost hyperparameters and avoid overfitting, with the optimal hyperparameters and their search ranges presented in
Table 2. With the optimal hyperparameters fixed, the final model was trained on the complete training set and interpreted using the SHAP framework. SHAP values were computed to generate feature importance rankings, nonlinear dependence plots and interaction effect plots, revealing the heterogeneous marginal effects and synergistic interactions among digital transformation dimensions.
5. Results
5.1. Bayesian Hyperparameter Optimization
This study employed Bayesian optimization to identify the optimal XGBoost hyperparameter configuration, balancing model capacity and overfitting risk given the moderate size of our dataset. To strictly prevent data leakage, cross-validation was nested exclusively within the training set (2011–2018, 1190 firm–year observations), while the test set (2019–2021, 446 firm–year observations) was held out entirely until the final model evaluation. The number of iterations for Bayesian optimization was set to 200, with each iteration utilizing 5-fold cross-validation for testing purposes. The process of adjusting the six hyperparameters using Bayesian optimization is depicted in
Figure 1.
Table 2 summarizes the search space for all seven hyperparameters and the final optimal values.
Figure 1 presents the performance of key XGBoost hyperparameters under 5-fold cross-validation, including max_depth, learning_rate, n_estimators, subsample, gamma, and min_child_weight. Each subplot maps the relationship between one hyperparameter and the out-of-sample R
2 score obtained during the search. The red and yellow points identify the configurations that achieved the best cross-validation performance. For several parameters, the R
2 rises sharply when moving from less favorable regions toward the optimum and then remains relatively flat or declines slightly, a pattern consistent with a well-converged search. The relatively stable exploration near the optimum, particularly for max_depth, learning_rate, and n_estimators, suggests that the Bayesian optimizer successfully identified a region of strong and consistent performance rather than fitting noise.
5.2. Model Validation and Comparison
We adopt root mean squared error (RMSE), mean absolute error (MAE), and R-squared (R
2) to evaluate the performance of the XGBoost model. RMSE and MAE quantify prediction accuracy, with lower values indicating smaller deviations between predicted and observed values, while R
2 measures the model’s explanatory power for agribusiness green innovation, with higher values indicating stronger explanatory ability. The dataset was chronologically split into a training set (2011–2018, 1190 observations) and a test set (2019–2021, 446 observations). In order to verify the reliability and superiority of the Bayesian-optimized XGBoost model, its performance is comprehensively compared with representative mainstream machine learning models, including decision tree (DT), random forest (RF), Elastic Net, Light Gradient-Boosting Machine (LightGBM), and their grid-search-optimized variants. The detailed comparison results are summarized in
Table 3.
Table 3 presents the performance comparison results of all evaluated models. The XGBoost model optimized by Bayesian optimization (XGBoost+BO) achieves the best overall performance across all three metrics, with the lowest RMSE (1.2657), the lowest MAE (0.6521), and the highest R
2 (0.6562). To facilitate understanding for readers unfamiliar with machine learning, we first provide intuitive interpretations of the model evaluation indicators. A lower value of RMSE and MAE indicate a smaller deviation between predicted and observed values, representing higher prediction accuracy. A higher R
2 value indicates a stronger explanatory power of the model for agribusiness green innovation. The linear model Elastic Net yields the lowest R
2 of only 0.2762 among all baseline models, which confirms the existence of strong nonlinear relationships between digital transformation and agribusiness green innovation that traditional linear models cannot fully capture. Meanwhile, the grid-search-optimized variants show limited performance improvement compared with their default versions and even underperform the baseline models in some cases (e.g., decision tree+grid). This suggests that Bayesian optimization is more effective in navigating the hyperparameter space and avoiding local optima, thereby improving model performance and generalization ability. The superior overall performance and robustness of XGBoost+BO fully validate its suitability as the core benchmark model for subsequent SHAP-value-based mechanism interpretation and heterogeneity analysis.
5.3. Predictive Drivers of Digital Transformation to Agribusiness Green Innovation
This section tests Proposition 1 by examining the relative importance of different digital transformation dimensions. We use the term “predictive driver” to refer to features that make the largest contribution to the model’s predictive performance for agribusiness green innovation. This term does not imply causal relationships but rather reflects the relative importance of each feature in the XGBoost–SHAP framework.
To assess the relative importance of multi-dimensional digital transformation features, we employ SHAP values, which not only quantify feature importance but also reveal the direction and heterogeneity of their predictive contributions.
Figure 2 depicts the predictive significance of multi-dimensional digital transformation features in the agribusiness green innovation. Specifically,
Figure 2a presents the XGBoost built-in feature importance ranking, where longer bars indicate greater predictive contribution to model prediction.
Figure 2b reports the mean absolute SHAP value ranking, quantifying each feature’s actual marginal predictive contribution to green innovation.
Figure 2c presents the SHAP honeycomb diagram, where each dot corresponds to a firm–year observation of Chinese A-share listed agribusinesses, mapping the full distribution of SHAP values across the sample. The horizontal axis shows SHAP values, with positive and negative values indicating positive and negative predictive contributions to green innovation, respectively, while the color gradient indicates the magnitude of each feature.
As illustrated in
Figure 2a,b, GD ranks first in both XGBoost feature importance and SHAP total effect, establishing it as the dominant predictive driver. ID consistently ranks second in both importance rankings, emerging as the second most critical feature for predicting agribusiness green innovation. This indicates that the external digital institutional environment, including digital credit, digital agricultural insurance and intelligent subsidy application systems, is a key predictor of agricultural enterprises’ green innovation activities, second only to the enterprise’s internal underlying digital infrastructure.
The remaining features show minor discrepancies in their rankings across the two metrics. This is because the two importance measures have different core emphases. The XGBoost Gain metric reflects the overall contribution of each feature to the model’s overall explanatory power, while the mean absolute SHAP value offers a more direct measure of each feature’s real-world marginal contribution to the predicted green innovation output of agricultural enterprises. Specifically, TD contributes more to the model’s overall explanatory power. PD, by contrast, delivers a more stable and widespread positive effect on green innovation across the full sample of firms. Both CD and DB register positive but more modest contributions in both rankings. This indicates they play a supportive, secondary role in predicting green innovation relative to core digital infrastructure and production-focused digitalization. Notably, DS ranks lowest in both rankings, with a near-zero mean absolute SHAP value. It confirms that simply expanding the number of digital dimensions has negligible predictive relevance for agricultural enterprises’ green innovation. Meaningful predictive associations only materialize when firms deepen digital implementation in their core business links.
The SHAP honeycomb diagram in
Figure 2c further reveals the relationship between the value of each digital transformation dimension and the direction of its SHAP contribution to agribusiness green innovation. The scatter distribution of GD and ID is wider than that of other dimensions, indicating that these two dimensions exhibit a greater range of predictive influence on agribusiness green innovation. In contrast, DS shows a near-zero SHAP total effect, suggesting that simply increasing the number of digital dimensions without depth yields negligible predictive relevance.
5.4. Nonlinear Relationship Between Digital Transformation and Agribusiness Green Innovation
This section tests Proposition 2 by analyzing the nonlinear threshold characteristics of the predictive relationship between digital transformation and green innovation.
Figure 3 illustrates the predictive associations between various dimensions of digital transformation and agribusiness green innovation. The figure presents seven core digital transformation features, with each scatter point representing a sampled agricultural enterprise in our dataset. The red dashed line in each subplot depicts the marginal contribution of the corresponding digital transformation feature to the predicted agribusiness green innovation output measured by SHAP values, which captures the nonlinear changing pattern of the feature’s predictive association as its value increases.
Among these features, CD and DB show positive contributions that strengthen steadily as the feature value increases. PD exhibits a mild U-shaped pattern in its SHAP contribution. Its marginal contribution turns negative at lower values (Ln(PD) roughly below 0.2), then gradually shifts upward and becomes weakly positive beyond that approximate region. TD presents an L-shaped pattern in its SHAP contribution, with a strong positive contribution at low TD levels, followed by a sharp decline in its marginal contribution as TD increases, and the contribution stabilizes at a low level with slight fluctuations after reaching the trough at Ln(TD) = 0.2. GD features a significant U-shaped pattern in its SHAP contribution, with negative contributions at low-to-medium levels that give way to strongly positive ones once the feature moves into the upper value range (Ln(GD) approximately above 0.25). ID demonstrates an exponentially increasing positive SHAP contribution, with a negligible contribution when Ln(ID) < 0.3, whereas its marginal contribution surges dramatically when the feature value exceeds this threshold. In contrast, DS shows no significant variation in marginal contribution across its full value range, indicating no meaningful linear or non-linear predictive association with agricultural enterprises’ green innovation. These approximate turning points should be read as empirical regularities observed in our sample rather than as exact universal thresholds.
5.5. Interaction Effects Between Features of Digital Transformation
This section tests Proposition 3 by exploring the synergistic interaction effects between different digital transformation dimensions.
Figure 4 shows the SHAP interaction effects of digital transformation features on the predictive agribusiness green innovation. In each subplot, the red dashed curve represents the marginal contribution of a single feature on green innovation, while the green curve represents the interactive contribution after incorporating another feature. Observations with high values of the moderating feature are marked in red.
Figure 4a presents the interaction effect between GD and ID. The individual contribution of GD presents a significant U-shaped pattern. When ID enters the interaction, the green curve flattens substantially, staying close to zero across the GD range. This empirical pattern indicates that higher institutional digitalization correlates with the mitigation of GD’s early-stage negative predictive tendency while retaining its long-term positive relevance. High-ID firms (red points) cluster in the high-SHAP region once GD reaches elevated levels, indicating that the model predicts a stronger positive contribution of governance digitalization to green innovation when institutional digitalization is well-developed. Consistent with this predictive pattern, agricultural enterprises with early investment in data-driven decision protocols and digital compliance systems tend to show weaker negative SHAP contributions from GD at low levels and a steadier upward predictive contribution once governance digitization matures.
Figure 4b presents the interaction effect between GD and PD. The individual U-shaped contribution of GD is significantly weakened after interacting with PD, as reflected by the flat green curve near zero. Samples with high PD levels (marked in red) show higher SHAP values in the high GD interval, indicating that higher PD is associated with a pattern where the negative SHAP contribution of GD at low levels is weaker and the positive contribution at high GD levels is stronger. From a predictive perspective, agricultural enterprises with sensor-rich and data-ready production tend to show a smoother GD contribution trajectory: the early downward part of the U-curve is mitigated, and the late upward part is reinforced in the model’s prediction.
Figure 4c presents the interaction effect between ID and PD. The individual contribution of ID shows an exponentially increasing positive trend. After interacting with PD, the green curve is flat and close to zero, indicating that the interaction with PD substantially smooths the exponential growth pattern of ID’s individual contribution. Meanwhile, samples with high PD levels (marked in red) are concentrated in the high SHAP value region in the high ID interval, meaning that high-level production digitalization is associated with a significantly amplified positive contribution of institutional digitalization on green innovation, especially when ID breaks through the critical threshold. This predictive pattern suggests that production digitalization stabilizes ID’s contribution by channeling institutional innovations into green innovation through a feedback loop.
Figure 4d presents the interaction effect between GD and TD. After the interaction with TD, the green curve shows an almost flat trend near the zero line, indicating that the interaction between GD and TD significantly smooths the U-shaped nonlinear fluctuation of GD’s individual contribution. Meanwhile, samples with high TD levels (marked in red) are concentrated in the high SHAP value region when GD is at a high level, which means that the improvement of TD is associated with a stronger positive contribution of GD on green innovation after GD breaks through the critical threshold and a weaker negative contribution of GD at low levels. This predictive pattern suggests that, in the model, higher traceability digitalization tends to coincide with an earlier and stronger upward shift in GD’s SHAP contribution.
Figure 4e presents the interaction effect between GD and DB. After interacting with DB, the green curve is flat and close to the zero line, indicating that the interaction with DB significantly weakens the U-shaped nonlinear contribution pattern of GD’s individual contribution. Meanwhile, samples with high DB levels (marked in red) show higher SHAP values in the high GD interval, indicating that higher digital transformation balance is associated with a pattern where the early negative SHAP contribution of GD is attenuated and the later positive contribution is amplified. Specifically, when GD exceeds the critical threshold, higher DB tends to correspond with a further amplification of the positive contribution of GD on green innovation in the model’s prediction. This predictive pattern suggests that balanced multi-dimensional digitalization helps governance digitalization enter its positive contribution range more smoothly by providing complementary support, rather than operating in isolation.
Figure 4f presents the interaction effect between DB and ID. After interacting with ID, the green curve becomes largely flat and close to the zero line, indicating that the interaction with ID completely changes the inverted U-shaped nonlinear pattern of DB’s individual contribution. Meanwhile, samples with high ID levels (marked in red) are concentrated in the high SHAP value region in the medium and high DB interval, meaning that high-level institutional digitalization is associated with a significantly amplified positive contribution of digital transformation balance on green innovation, especially when DB breaks through the critical threshold.
Overall, the interaction effects between digital transformation features show significant synergistic patterns. The interaction between different dimensional digital transformation features can smooth the nonlinear fluctuation of individual features’ contributions, and the high level of moderating features are associated with significantly amplified positive contributions of core features after they break through the critical threshold. This indicates that the coordinated development of multi-dimensional digital transformation is the key to fully releasing the predictive gains of digitalization for agricultural enterprises’ green innovation.
5.6. Robustness Checks
To ensure the reliability of our findings, we conduct three robustness tests. We first replace the original green innovation measure with the logged sum of both green patent applications and grants, namely ln(1 + applications + grants). The model yields a test-set R2 of 0.3670, with RMSE and MAE of 0.3898 and 0.2143, respectively. Reassuringly, the SHAP-based feature importance ranking remains highly consistent with the baseline specification. This confirms that the direction, relative importance, and nonlinear patterns of digitalization dimensions affecting agribusiness green innovation are robust to the choice of green innovation proxy.
The pandemic period introduced substantial operational disruptions. To rule out confounding effects, we drop observations from 2020 and 2021 and reestimate the model using only the 2011 to 2019 subsample. The resulting test set R2 is 0.0496. The RMSE and MAE are 0.9461 and 0.5995, respectively. However, the overall fit declines substantially. This decline is expected given the reduced sample size and associated information loss. The SHAP importance ranking stays broadly stable. This indicates that our core inferences are not predictively driven by the pandemic period. The considerably weakened predictive power reflects the substantial loss of variation in both digitalization and green innovation measures in this reduced sample. It does not indicate a fundamental change in the underlying relationships.
Finally, we check whether the results depend on the choice of the base learner by switching from XGBoost to LightGBM. This alternative model returns a test-set R2 of −0.0607, with an RMSE and MAE of 1.0781 and 0.6999. While the predictive performance is considerably weaker, the SHAP feature importance pattern remains highly consistent with the baseline. This consistency across different algorithms lends further support to the stability of our findings.
In conclusion, these checks indicate that our main results are not sensitive to measurement choices, sample periods, or model specifications. The core story, which features matter most and how they nonlinearly relate to green innovation, holds up across all three tests.
6. Discussion
6.1. Representative Features of Digital Transformation for Agribusiness Green Innovation
To answer the first research question (RQ1) and empirically examine Proposition 1, this study employs the XGBoost algorithm with SHAP analysis to unravel the heterogeneous associations of multi-dimensional digital transformation on agribusiness green innovation. It finds that DS has no significant linear or nonlinear association on agricultural enterprises’ green innovation across its full value range. This result indicates that simply expanding the coverage of digital transformation across business links without deepening the implementation of digitalization in specific dimensions is not associated with substantive green innovation performance improvements in agricultural enterprises. Theoretically, this finding qualifies the conventional undifferentiated understanding of digital transformation. It reveals that digital resources differ fundamentally in their association with value creation. Only those with distinct value and limited replicability are associated with sustainable competitive advantages. Superficial expansion of digital coverage cannot achieve this outcome.
This study identifies a clear hierarchy of predictively important digital transformation features for agribusiness green innovation. GD, which reflects enterprises’ underlying digital technology layout including artificial intelligence, big data, and cloud computing, ranks first in both XGBoost feature importance and SHAP total effect. This suggests that a complete underlying digital infrastructure is a key feature that shows a strong predictive association with agricultural enterprises’ capacity to carry out green innovation activities. ID consistently ranks second, indicating that the external digital institutional environment (including digital credit, digital agricultural insurance, and intelligent subsidy application systems) is a critical pillar that is predictively associated with green innovation, second only to internal digital infrastructure. From an institutional theory perspective, digital institutional environments correlate with reduced resource constraints for agricultural firms, thereby showing associations with innovation-related activities. TD and PD, which are deeply embedded in the core production and operation links of agricultural enterprises, also show strong explanatory power for green innovation. This aligns with the unique characteristics of agricultural enterprises, whose green innovation activities are highly dependent on the optimization of production links and the improvement of quality and safety management systems. Digital empowerment in production and traceability links is predictively associated with lower resource consumption and pollutant emissions in the production process and is associated with enterprises’ compliance with environmental regulation and green certification requirements, thus directly driving green innovation output. In addition, CD and DB show positive but modest predictive associations, acting as important supplementary features for green innovation. In particular, the stable predictive pattern of DB indicates that balanced digital resource allocation is persistently associated with favorable green innovation performance.
It is important to note that the hierarchical structure identified in this study is based on a sample of Chinese A-share listed agricultural enterprises. These enterprises typically possess relatively abundant capital and technical resources to support the independent development of underlying digital infrastructure. For resource-constrained small- and medium-sized agricultural enterprises and family farms that dominate global agricultural production, as well as agricultural enterprises in emerging economies with less developed digital infrastructure, the relative importance of different digital dimensions may shift significantly. In these contexts, external digital institutional resources provided by governments or third-party platforms may play a more dominant role than internal governance digitalization. They allow small-scale operators to access digital capabilities without bearing prohibitive upfront costs. This contextual boundary highlights the need for differentiated digital transformation strategies tailored to the resource endowments of different types of agricultural operators.
6.2. Impact of Multi-Dimensional Digital Transformation on Agribusiness Green Innovation
6.2.1. Individual Nonlinear Effects
Unlike traditional econometric models that pre-assume a linear nexus between digital transformation and green innovation, this study systematically captures the complex nonlinear patterns and heterogeneous threshold characteristics of multi-dimensional digital transformation on agribusiness green innovation via SHAP dependence plots. This analysis empirically addresses the second research question (RQ2) and offers empirical support for Proposition 2.
Across the five business dimensions and two structural features of digital transformation examined, we observe pronounced heterogeneity in nonlinear predictive patterns. For core dimensions closely tied to enterprises’ underlying digital infrastructure and external institutional environment, i.e., GD and ID, we identify significant threshold effects. GD exhibits robust U-shaped predictive patterns with a critical threshold at Ln(GD) = 0.25. It suppresses green innovation at low-to-medium levels due to the crowding-out predictive effect of high upfront investment in AI, big data and cloud computing on green R&D resources yet delivers a strong positive innovation dividend once the underlying digital system is fully built and integrated into operations. This U-shaped trajectory reflects a common pattern of technology adoption. Early-stage investment in digital infrastructure initially strains resources without immediate returns. Once a critical mass of digital capabilities is accumulated, the enterprise may become capable of applying these technologies in ways that are predictively associated with green innovation. It should be noted that the threshold value observed here applies primarily to medium and large listed agricultural enterprises with relatively sufficient capital reserves. Smaller operators with tighter financial constraints may face higher thresholds or even be unable to cross this initial investment barrier.
Meanwhile, ID presents an exponentially increasing positive nonlinear predictive pattern with a threshold at Ln(ID) = 0.3. Its marginal predictive contribution increases sharply only when the digital institutional system (covering digital credit, agricultural insurance and intelligent subsidy schemes) is sufficiently developed to unlock its network effects, which is associated with alleviation of the financing constraints and operational risks facing agricultural enterprises’ green innovation activities. This pattern suggests that the enabling role of external support systems becomes particularly pronounced only after reaching a certain scale and level of maturity. In regions with less developed digital public infrastructure, this threshold may be higher and the network effects may take longer to materialize.
For PD and TD, which are deeply embedded in agricultural enterprises’ core production and operation links, nonlinear effects also feature clear thresholds but with distinct patterns. PD has a mild U-shaped predictive pattern, with an initial negative predictive association below Ln(PD) = 0.2 driven by the high upfront asset-specific investment in smart agricultural production equipment, followed by a weak positive predictive association as the scale effects of resource conservation and production efficiency improvement materialize. This trajectory reflects the time needed for firms to integrate and reconfigure production-related digital resources into functional innovation capabilities. TD displays an L-shaped predictive pattern. It shows a strong initial positive predictive association by enabling enterprises to quickly meet environmental regulations and obtain green certification. Its marginal benefit diminishes sharply after Ln(TD) = 0.2 once basic regulatory compliance requirements are fulfilled.
In contrast, circulation digitalization (CD) and digital transformation balance (DB) demonstrate a continuously and steadily strengthening positive predictive pattern without obvious thresholds. They are associated with stable incremental benefits for green innovation throughout the whole process and show no evidence of the crowding-out effect caused by high upfront investment.
These heterogeneous nonlinear patterns highlight the importance of avoiding one size fits all digital transformation strategies. It is important to note that the specific threshold values observed in this study are context-dependent. They may vary across countries with different levels of digital development and across enterprises with different resource endowments.
6.2.2. Interaction and Synergistic Effects
To answer the third research question (RQ3) and empirically examine Proposition 3, this study further reveals the significant synergistic interaction patterns between different digital transformation dimensions. The results show that the interaction between core digital transformation dimensions is associated with a smoothing of the nonlinear fluctuation of individual features’ predictive patterns. Specifically, the interaction between GD and ID is associated with a weakening of the U-shaped pattern of GD’s individual effect. High ID levels are associated with both a stronger positive predictive contribution of GD after the threshold and a less negative predictive association of GD at low levels. Similar synergistic patterns are observed in the interaction between GD and PD, as well as between ID and PD. These interaction patterns are consistent with a fundamental principle of digital resource deployment. When multiple digital resources are developed in tandem, they are associated with combined predictive benefits that appear to exceed the sum of their individual contributions. This appears to lower the threshold each dimension must cross before showing positive predictive returns.
In addition, DB shows a positive moderating pattern in the relationship between GD and green innovation. High-level DB is associated with an amplification of the positive predictive contribution of GD after the threshold, while the U-shaped pattern of GD is less pronounced under low DB levels. This finding further suggests that the coordinated development of multi-dimensional digital transformation is importantly associated with unlocking the predictive contribution of digitalization to agricultural enterprises’ green innovation. Single-dimensional digital transformation, even in the core governance dimension, is associated with higher threshold constraints and less stable innovation-related patterns. The results suggest that only through the coordinated development of multiple dimensions can agricultural enterprises more quickly show a positive association between digital transformation and green innovation, forming a more consistent predictive pattern for green innovation.
It is worth noting that the strength of these synergistic effects is also context-dependent. For enterprises with extremely limited resources, simultaneous investment in multiple digital dimensions may exacerbate resource constraints and delay the emergence of innovation dividends. In such cases, a phased development approach that first builds basic institutional digital capabilities may be more practical.
6.3. Implications
These findings provide clear and actionable practical implications for agricultural enterprises, managers, and policymakers. While the results are derived from Chinese listed agricultural firms, the core logic underlying these findings is that different digital dimensions play hierarchically distinct roles and interact synergistically. This logic offers reference value for agribusinesses in other emerging economies facing similar digitalization challenges.
For agricultural enterprises, priority should be given to advancing governance digitalization and institutional digitalization, which show the strongest predictive associations with green innovation. They should abandon the “wide but shallow” symbolic digital transformation strategy and shift from simply expanding the coverage of digitalization to deepening the implementation of digitalization in core business dimensions. Priority should be given to the construction of underlying digital infrastructure and the deep integration of digitalization in core production and traceability links to avoid the dispersion of limited resources caused by the blind layout of multiple digital dimensions. Concretely, priority should be given to building AI-driven decision systems and cloud-based data platforms for governance digitalization and to deploying IoT sensors and precision irrigation equipment in production rather than spreading limited budgets thinly across all digital dimensions. Second, enterprises should fully recognize the threshold effects of digital transformation. For core dimensions such as governance and institutional digitalization, they should maintain sustained investment to break through the critical threshold so as to release the innovation dividend of digitalization. Third, enterprises should pay attention to the balanced allocation of digital resources and the synergistic development of multi-dimensional digitalization. While focusing on the construction of core digital dimensions, they should also promote the coordinated development of digitalization in circulation, institutional and other links to amplify the driving effect of core digital dimensions on green innovation through synergistic effects. For managers, the nonlinear and threshold effects identified in this study indicate that enterprises should focus on deepening digitalization in key business links rather than blindly expanding the scope of digital transformation.
For policymakers, improving the digital institutional environment, including digital credit, digital agricultural insurance, and intelligent subsidy systems, can effectively support enterprises’ green innovation activities and help them cross critical digitalization thresholds. In operational terms, this means accelerating the rollout of digital credit products tailored to agribusiness cash-flow cycles, expanding pilot programs for digital agricultural insurance, and streamlining intelligent subsidy application platforms for faster disbursement. Second, they should formulate differentiated support policies for agricultural enterprises’ digital transformation. For small- and medium-sized agricultural enterprises with an insufficient digital foundation, governments can provide targeted subsidies for purchasing cloud computing services and sensor equipment, alongside tax rebates for early-stage digital infrastructure investment, to alleviate the crowding-out effect on green R&D. For enterprises with a certain digital foundation, policies should guide them toward integrating digital tools into core operational processes, such as linking traceability systems with green certification schemes and embedding AI analytics in production planning, to translate digital investment into measurable green outcomes. Third, they should build an open sharing platform for agricultural digital technology, promote the flow and sharing of digital technology knowledge between enterprises, and help agricultural enterprises form a collaborative development pattern of multi-dimensional digital transformation.
7. Conclusions
Digital transformation has become a core factor associated with agribusiness green innovation and a key strategic path for China’s agricultural sector to achieve low-carbon transformation, high-quality development and rural revitalization. Existing studies mostly treat digital transformation as a single aggregate variable, overlooking the heterogeneous associations of digitalization across different business links of agricultural enterprises, and traditional linear econometric models are limited in capturing the complex nonlinear patterns and synergistic interaction patterns between digital transformation and agribusiness green innovation. To fill these research gaps, this study takes 2011–2021 Chinese A-share listed agricultural companies as the research sample, decomposes digital transformation into five business dimensions (production, circulation, traceability, governance, and institutional digitalization) and two structural features (transformation scope and balance), and systematically examines the associations of multi-dimensional digital transformation on agribusiness green innovation through an explainable machine learning framework integrating Bayesian optimization, the XGBoost algorithm and the SHAP method.
The core findings of this study are threefold. First, the “wide but shallow” symbolic digital strategy that merely expands the number of digital dimensions is not associated with substantive green innovation performance for agricultural enterprises. The predictively important factors of agribusiness green innovation present a clear hierarchical structure. Governance digitalization is the most important predictive driver, followed by institutional digitalization, while traceability and production digitalization embedded in core production and operation links are also important predictive factors, and circulation digitalization and digital transformation balance play a positive supplementary role. Second, different dimensions of digital transformation show significant heterogeneous nonlinear patterns and clear threshold characteristics in their associations with agribusiness green innovation. Governance digitalization presents a significant U-shaped predictive pattern. Institutional digitalization shows an exponentially increasing positive nonlinear pattern with prominent network effects. Production digitalization has a mild U-shaped predictive pattern. Traceability digitalization presents an L-shaped pattern with diminishing predictive contributions, while circulation digitalization and digital transformation balance show a steadily strengthening positive pattern without obvious threshold constraints. Third, there are significant synergistic interaction patterns between different digital transformation dimensions. The interaction between core dimensions is associated with smoothing of the nonlinear fluctuation of individual patterns, and the balanced development of multi-dimensional digitalization is associated with a positive moderating effect on the predictive contributions of core digital dimensions to green innovation, which is importantly associated with unlocking the innovation potential of digital transformation. It should be noted that the XGBoost–SHAP framework identifies predictive associations rather than causal relationships. The observed nonlinear patterns and interaction patterns reflect the model’s estimation of feature contributions and should be interpreted with due caution.
Notwithstanding its findings, this study has two main limitations. First, the sample is limited to Chinese A-share listed agricultural enterprises, which hold advantages in capital, resources and policy support over the small- and medium-sized agricultural enterprises and family farms that dominate China’s agricultural sector. The conclusions’ applicability to small and micro-operators with weak digital foundations thus needs further verification, and future research can expand the sample to explore heterogeneous associations across agricultural operators of different types and scales. Second, this study measures multi-dimensional digital transformation through annual report keyword frequency, a mainstream method that cannot fully capture digital transformation’s actual implementation, input–output efficiency and application depth. Future research can optimize this measurement system by combining field survey data, digital investment and operation indicators to more accurately identify digitalization’s associations with agribusiness green innovation. Third, green patent applications, while widely used, mainly capture formal technological innovation and may overlook process-oriented improvements and informal green practices that are common in the agricultural sector. Future research could therefore incorporate a broader set of indicators, such as green product certifications, the adoption of resource-saving production techniques, and environmental compliance records, to provide a more comprehensive assessment of agribusiness green innovation. Fourth, given the observational nature of the data, potential endogeneity concerns cannot be fully ruled out. Future research could employ quasi-experimental designs or instrumental variable approaches to further strengthen causal identification. Fifth, while the explainable machine learning framework (XGBoost–SHAP) adopted in this study provides rich insights into heterogeneous patterns, the results should be interpreted as correlational rather than causal evidence. Future research could combine quasi-experimental designs with explainable machine learning to strengthen causal identification.