5.1. Quasi-Natural Experimental Design and Model Construction
This paper aims to explore the impact mechanism of the water resource fee-to-tax policy pilot on mission-driven green energy-saving technological innovations, vision-driven green energy-saving technological innovations, green energy-saving management innovations, and the synergistic development of these three types of green energy-saving innovations in manufacturing enterprises. It employs a double machine learning model for causal inference. The double machine learning model excels in handling high-dimensional data for prediction more adeptly than traditional statistical methods (like linear regression and OLS), but it is not employed directly for prediction; rather, its significance lies in fitting and generalizing over high-dimensional control variables to achieve unbiased estimation of the coefficients associated with the core variables. And it does not impose strict assumptions on the data-generating process. Instead, it learns the functional form of variables from the existing data, thereby avoiding model specification bias.
Since its formal introduction by Chernozhukov et al. in 2018 [
55], research on the double machine learning (DML) method has primarily bifurcated into two directions. One stream of research focuses on leveraging DML to evaluate economic causal relationships. For instance, Yang et al. (2020) [
56] employed Gradient Boosting to assess the effects of audit firms, demonstrating the superiority of their approach over propensity score matching. Zhang et al. (2022) [
57] analyzed the multifaceted impacts of nocturnal subway services on London’s economy, while Farbmacher et al. (2022) [
58] combined causal mediation analysis to explore the role of health insurance in youth health. Conversely, another line of inquiry prioritizes methodological innovations, such as Chiang et al.’s (2022) [
59] multidirectional cross-fitting technique, which bolsters the robustness of high-dimensional data analysis, and Bodory et al.’s (2022) [
60] integration of dynamic analysis, broadening the model’s applicability in dynamic contexts.
Compared to conventional causal inference models, DML offers unique advantages in variable selection and model estimation, making it particularly suitable for the research question at hand. On the one hand, corporate green management energy saving and technological innovation are composite indicators influenced by myriad factors within the socioeconomic context. To ensure an accurate estimation of policy effects, it is essential to comprehensively account for the potential confounding effects of other factors. Traditional regression models, however, may grapple with the “curse of dimensionality” and multicollinearity when dealing with high-dimensional control variables. DML, utilizing various machine learning algorithms and regularization techniques, automatically selects an effective subset of control variables from a preselected high-dimensional set, enhancing prediction accuracy. This process alleviates the “curse of dimensionality” associated with redundant controls and mitigates bias from restricting to a limited set of primary controls. Furthermore, nonlinear relationships among variables are prevalent, and linear regression models can introduce specification biases with less robust estimates. DML harnesses the strengths of machine learning algorithms in handling nonlinear data, effectively averting model misspecification issues, as highlighted by Yang et al. (2020) [
56]. Additionally, through the use of instrumental variable functions, two-step prediction error regressions, and sample splitting fitting strategies, DML addresses the “regularization bias” inherent in machine learning estimations, ensuring an unbiased estimation of treatment effects even in small samples.
Based on the foundational principles of constructing and utilizing double machine learning models, this paper constructs Model (1) as follows:
Model (1):
where
is the core explanatory variable of interest, representing the treatment variable of the water resource fee-to-tax policy pilot, making
the estimation parameter of interest and the basis for inference in this study.
denotes a series of high-dimensional control variables, which relate to the dependent variable of the corporate green energy-saving innovation “triple drive” system (
) and
in a high-dimensional complex nonlinear form. To investigate the differential impacts of the water resource fee-to-tax policy on the three types of green energy-saving innovation, mission-driven green energy-saving technological innovation (
), vision-driven green energy-saving technological innovation (
), and green energy-saving management innovation (
) are set as dependent variables for partial linear regression.
denotes the error term with a conditional mean of zero. The direct estimation of Equation (1) yields the estimator for the treatment effect:
where n denotes the sample size.
Based on these estimators, further examination of their estimation bias can be conducted:
In this context, follows a normal distribution with a mean of zero, and , where it is noteworthy that the double machine learning approach employs machine learning and regularization algorithms to estimate the specific functional form , inevitably introducing “regularization bias”. While this prevents excessive variance in the estimator, it also renders it unbiased, characterized by a slow convergence rate of towards , with . Consequently, as n approaches infinity, b also tends towards infinity, impeding the convergence of to .
To expedite the convergence and ensure unbiasedness of the treatment effect estimator under small-sample conditions, an auxiliary regression is constructed as Equation (2) of Model (1).
Here, denotes the regression function of the treatment variable on high-dimensional control variables, also requiring machine learning algorithms for estimating its form , with being the error term with a conditional mean of zero.
The operational procedure involves the following: initially, estimating the auxiliary regression
and obtaining residuals
; secondly, applying machine learning algorithms to estimate
and modifying the main regression equation to
; finally, using
as an instrumental variable for
and attaining the unbiased coefficient estimator:
Similarly, Equation (6) can be approximated as
where
adheres to a normal distribution with a mean of zero. Given the application of machine learning twice, the overall convergence rate of
depends on the rates at which
approaches
and
approaches
, specifically
. Compared to Equation (4),
converges to zero at a faster pace, thereby enabling the attainment of an unbiased estimator for the treatment effect coefficient.
Next, this study aimed to explore the mechanism pathways of “water resource fee-to-tax → green energy-saving management innovations → mission-driven green energy-saving technological innovations” and “water resource fee-to-tax → green management energy-saving innovations → vision-driven green energy-saving technological innovations”. Therefore, a mediating effect model was first constructed based on a double machine learning stepwise regression method (Model (2)):
The first equation was designed to estimate the total effect of the policy treatment variable on the dependent variables to evaluate the overall impact of the water resource fee-to-tax policy on green energy-saving technological innovations. The second and third equations aimed to estimate the direct and indirect effects along the pathways “→→” and “→→”, respectively. These equations measure the extent to which green energy-saving management innovations mediate the relationship between the policy and both mission-driven and vision-driven green energy-saving technological innovations.
To address the potential issues in parameter estimation that may arise with the three-step mediation effect model and to explore the effectiveness and scalability of the water resource fee-to-tax policy within a counterfactual framework, a causal mediation effect model (Model (3)) was constructed in this study. This model further investigated the aforementioned mechanism pathways as follows:
Model (3):
where Y represents the outcome variable (
and
), which is influenced by both the policy treatment variable
and the mechanism variable
. Since
is also affected by
, this leads to four equations.
and
represent the direct effects for the treatment group and control group, respectively. The former indicates the difference between the actual situation of the treatment group (
) and the scenario where the treatment group individuals are not subjected to the policy intervention but the mechanism variable is still influenced by the policy (
). The latter represents the difference between the scenario where the control group individuals are introduced to the policy intervention, but the mechanism variable is not affected by the policy (
) or the actual situation of the control group (
). The magnitude of this result determines whether the water resource fee-to-tax policy can bypass the mechanism variable (green energy-saving management innovations) and directly affect the outcome variables (mission-driven and vision-driven green energy-saving technological innovations) without indirectly going through the mechanism variable. This result also indicates whether this effect will remain when the water resource fee-to-tax policy is extended to non-pilot areas. Similarly,
and
explain the indirect effects for the treatment and control groups.
Lastly, to verify the external mechanisms, namely whether financial technology () and the level of marketization can regulate the policy effects of the water resource fee-to-tax, this study constructed Models (4) and (5).
and represent the coefficients of the interaction terms between the policy and the moderating variables. The direction of the moderating effect can be determined by comparing the signs of and with the sign of from Model (1). Specifically, if is significantly different from 0, and the signs of and are the same as that of , it indicates that the moderating variables have a positive moderating effect on the original policy impact. Conversely, if the signs of and are opposite to , it suggests a negative moderating effect.
This study, excluding causal mediation effect analysis, employed Stata 17.0 interfacing with Python 3.11 for the processing of panel data, including training, generalization, and parameter estimation. The software offers various base learning algorithms such as LassoIC, LassoCV, Random Forest, Neural Networks, and Gradient Boosting. These algorithms play distinctive roles within the double machine learning (DML) framework, each designed with specific principles and implementation steps:
LassoIC and LassoCV are integral parts of the DML workflow, primarily for variable selection in high-dimensional datasets. Both leverage Lasso’s L1 regularization for sparsity, encouraging insignificant coefficients to approach zero. LassoIC chooses the regularization parameter based on information criteria like AIC or BIC, while LassoCV employs cross-validation for automatic tuning. In DML, LassoIC helps identify influential control variables, and LassoCV ensures robustness with a data-driven approach, though both can be sensitive to criterion choice or computationally demanding.
Random Forest (RF) bolsters prediction accuracy and model stability by combining multiple decision trees. Under DML, RF constructs trees from random data and feature subsets, aggregating their outcomes for predictions. RF excels in nonlinear modeling and interaction detection, crucial for complex causality. Its advantages are strong nonlinearity handling, stability, parallel computing support, and feature importance indication, but it can be less interpretable and computationally intensive with large datasets.
Neural Networks (NNets) within DML span preprocessing to final effect estimation, involving data cleaning, architecture design for capturing nonlinearity, and optimization through backpropagation with regularization. In the DML first stage, NNet predicts control variable impacts, producing purified proxies for unbiased estimation in the next phase. While offering high flexibility and accuracy, NNet requires substantial data and tuning, along with considerable computational resources.
Gradient Boosting Machine (GradBoost) in DML iteratively builds decision trees to minimize prediction errors. In the DML first phase, it estimates control variable effects, improving predictions via iterative learning. It effectively captures linear and nonlinear relationships and interactions, enhancing the analysis of complex data. However, GradBoost may overfit with overly complex models or insufficient data and lacks straightforward interpretability.
Ultimately, Random Forest was chosen as the primary algorithm for the DML framework examining the water resource fee-to-tax policy’s impact on corporate green innovations. This decision rested on its superior nonlinear modeling capabilities, robustness through ensemble trees, suitability for parallel computation, and provision of interpretable feature importance measures, offering a balanced combination of prediction accuracy, computational efficiency, and interpretability for this study’s requirements.
5.5. Control Variables
The following control variables were used in this study, and all were matched to the panel data of the Chinese A-share-listed companies from 2009 to 2020:
At the manufacturing enterprise level: the manufacturing enterprise size, measured as the natural logarithm of the total assets at the end of the year; the listing duration, measured from the year of listing to the year of variable selection; the debt-to-assets ratio, measured as the total liabilities/total assets; the ESG rating, measured using the China Securities ESG score; the cash flow level, measured as the net cash flow from operating activities/total assets; the growth potential, measured as (current period revenue-previous period revenue)/previous period revenue; the profitability, measured as the net profit/total assets; the market power, measured as ln(revenue/cost of goods sold + 1); the capital intensity, measured as ln(total fixed assets/number of employees); the shareholding concentration, measured as the shareholding percentage of the largest shareholder; the board size, measured as the number of board members; the board independence, measured as the number of independent directors/total number of board members; and the ownership structure, which was assigned as 1 if the company was state-owned and -controlled, and was otherwise assigned as 0.
At the city level: the population density, measured as the population/city area; the per-capita GDP in real terms, measured as the real GDP/population; and the urbanization level, measured as the rate of urbanization.
This study employed Python web scraping techniques to extract keywords related to urban policy and science and technology talent from the government work reports of Chinese cities from 2009 to 2020. The keywords included “basic research”, “scientific research”, “applied basic research”, “core technology”, “basic science”, “cutting-edge technology”, “original innovation”, “key technology”, “social welfare technology”, “talent resources”, “overseas high-level talents”, “returnees”, “talent team construction”, “science and technology system reform”, “talent-power strategy”, “strategy for invigorating China through science and education”, “scientific and technological achievements”, “intellectual property”, “scientific and energy-saving technological innovations”, “high-level talents”, “leading talents”, “innovations team”, “talent team”, “innovations and entrepreneurship”, “scientific researchers”, “double innovations”, and “innovation-driven”. The frequency of these keywords and the total word count in the government work reports were calculated after removing stop words for segmentation. The ratio of the keyword frequency to the total word count served as a proxy variable for the city’s policy focus on basic research and talent attention.
Referencing Zhao and Zhang [
66], this study evaluated the development level of the digital economy in Chinese cities using the entropy method. This method combines five indicators: internet users per hundred people, the proportion of computer service and software personnel, per-capita telecommunications business volume, per-capita postal business, and mobile phone users per hundred people.