Research on the Impact of Public Data Openness on Green Technological Innovation: Empirical Evidence from Machine Learning Methods

Wang, Mengyu; Guo, Bingnan

doi:10.3390/su18104862

Open AccessArticle

Research on the Impact of Public Data Openness on Green Technological Innovation: Empirical Evidence from Machine Learning Methods

by

Mengyu Wang

and

Bingnan Guo

^*

School of Humanity and Social Science, Jiangsu University of Science and Technology, Zhenjiang 212000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(10), 4862; https://doi.org/10.3390/su18104862

Submission received: 21 March 2026 / Revised: 8 May 2026 / Accepted: 11 May 2026 / Published: 13 May 2026

(This article belongs to the Topic Big Data Analytics for Climate and Human Impacts on Terrestrial Ecosystems)

Download

Browse Figures

Versions Notes

Abstract

With the digital economy emerging as a new driver of high-quality development, unlocking the value of data factors and stimulating innovation momentum has become a key component of national strategy. This study treats the establishment of government open data platforms as an exogenous policy shock reflecting the degree of public data openness. Based on a multidimensional dataset constructed from China’s full patent database covering the period 2003–2022, we empirically examine how public data openness affects green technological innovation. The results indicate that public data openness exerts a significant positive effect on green technological innovation. These conclusions remain robust and consistent after a battery of rigorous robustness checks. In terms of heterogeneity, the impact displays prominent regional disparities: it is more pronounced in the Beijing–Tianjin–Hebei region, the Yangtze River Delta, the urban agglomerations in the middle reaches of the Yangtze River, as well as in non-traditional industrial bases and transport hub cities. Mechanism analysis suggests that public data openness fosters green technological innovation by attracting innovative talents, boosting entrepreneurial dynamism, and advancing governance transparency. These conclusions offer new implications for governments to optimize the provision of public services in the digital era and advance high-quality economic growth.

Keywords:

digital economy; public data openness; siphoning effect; multiplier effect; green technological innovation

1. Introduction

Public data is a fundamental resource for promoting high-quality economic development and fostering innovation. Comprehensive promotion of the openness, sharing, and utilization of public data is essential for enhancing social governance capabilities and implementing innovation-driven development strategies [1,2,3]. The Chinese government emphasizes the need to strengthen the aggregation, sharing, and open development of public data, promote interconnectivity, and break down data barriers. This is of significant importance for improving the efficiency of social resource allocation, promoting green technological innovation, and achieving sustainable development goals [2,4,5]. However, a large amount of public data resources still remain trapped within various government departments, creating data barriers that severely restrict the effective utilization of the data elements. In light of this, leveraging the publicly available data resources across different regions to promote enterprises’ practices and development in the field of green technological innovation has become a pressing issue that requires urgent attention [6].

As emphasized by You et al. [7], data plays a crucial role in enterprises’ green technological innovation; nevertheless, the actual value embodied in public data openness remains underexplored in existing literature. Current research primarily focuses on theoretical frameworks, and the analysis of how different types of data influence innovation is still not sufficiently in-depth, particularly concerning the relationship between public data and green technological innovation, which remains relatively understudied. From the literature, public data can provide valuable information to enterprises, helping them better understand market dynamics and technological trends [8]. Therefore, the openness of public data allows enterprises to obtain the necessary resources at a low cost or even at no cost. This valuable information helps reduce the uncertainties faced in green technological innovation and further incentivizes enterprises to invest in innovation, promoting the continuous development of technology [9].

Although the openness of public data is considered by some scholars to be a potential means of promoting green technological innovation, some literature indicates that this openness does not necessarily stimulate innovation in practice. In reality, the nature of data varies, and different types of data may have significantly different impacts on the promotion of green technological innovation [10,11,12,13]. Based on the different sources of data, we can categorize data into personal data, enterprise data, and public data. Typically, personal data and enterprise data are collected by enterprises according to their own needs, which are associated with high costs for acquisition, processing, and storage. This results in these types of data having strong proprietary characteristics and competitive barriers [14]. In contrast, public data is collected by the government in the course of fulfilling its responsibilities and inherently possesses sharing attributes [15]. However, the openness of public data means that these resources are no longer exclusively owned by enterprises, which may enable competitors to access the same information and pursue similar technological development. In this case, the profit potential gained by enterprises through innovation may be limited, which could lead to hesitation when investing in green technological innovation.

It is evident that the impact of public data openness on green technological innovation does not yield a definitive answer. Whether the openness of public data can effectively promote green technological innovation and realize the empowering role of data elements still requires further theoretical analysis and empirical research for exploration and validation. From the current research on public data openness, many studies have examined its impact from the perspectives of carbon emissions [16,17,18], income inequality [19], and enterprise efficiency [20,21], while there is still a lack of literature analyzing and testing on the value of public data openness from the perspective of technological innovation. As technological innovation serves as a core driving force for economic transformation and sustainable development, the relationship between technological innovation and public data openness urgently needs to be explored in depth.

There are obvious research gaps in current academic research. First, the actual value of public data openness in promoting green technological innovation has not been fully explored. Most existing studies focus on the construction of theoretical frameworks and lack in-depth analysis of its specific impact paths and mechanisms. Second, there is insufficient research on the heterogeneous impacts of different types of data on green technological innovation, especially the specific role and mechanism of public data with inherent sharing attributes in this process, which remains unclear. Third, existing studies rarely analyze and test the value of public data openness from the perspective of technological innovation. There is still controversy about whether public data openness can effectively promote enterprises’ green technological innovation, and there is a lack of sufficient empirical verification. Fourth, traditional research methods have limitations in high-dimensional data analysis, and few studies have adopted interpretable machine learning methods to explore the importance of control variables affecting green technological innovation and their nonlinear relationships with green technological innovation, which affects the accuracy and comprehensiveness of research conclusions.

Aiming at the above research gaps, this paper focuses on solving the following core research questions: (1) Can public data openness effectively promote green technological innovation, and what is its net effect? (2) Through what specific mechanisms does public data openness affect green technological innovation, and what are the specific action paths of these mechanisms? (3) Are there heterogeneous characteristics in the impact of public data openness on green technological innovation, and what are the specific manifestations of these heterogeneous characteristics?

Based on this, this paper first analyzes the impact of public data openness on green technological innovation. Utilizing a database of prefecture-level cities in China from 2003 to 2022, this study treats the launch of local government data open platforms as an exogenous policy shock related to public data openness and empirically tests its effect on green technological innovation using a double machine learning approach. Furthermore, this paper employs the Shapley Additive explanations method of interpretable machine learning to rank the importance of control variables that may influence green technological innovation. It also investigates the nonlinear relationships between these control variables and green technological innovation. In the robustness checks, this study enhances the credibility of the baseline regression through various methods, including adjusting the research sample, conducting time dynamic tests, altering the dependent variable, and transforming the machine learning prediction model. Additionally, this paper conducts double verification through grouped regression and causal forest methods to examine variations in model explanatory power, thereby revealing heterogeneous characteristics. Finally, based on the theoretical analysis of the mechanisms through which public data openness exerts its influence, this paper empirically tests its role in reducing the uncertainty surrounding green technological innovation from three perspectives: the siphoning effect of aggregating innovative talent, the multiplier effect of enhancing entrepreneurial vitality, and the leapfrogging effect of promoting government transparency.

The marginal contributions of this paper are primarily reflected in three aspects. First, from a theoretical perspective and impact mechanism, this study incorporates the factor of public data openness into the classic product quality model, exploring the net effect of public data openness on green technological innovation from both the enterprise and consumer perspectives. Thus, this approach expands the research scope on how public data openness creates value and deepens our understanding of this relationship. Unlike previous studies that often analyze the mechanisms influencing green technological innovation from the perspectives of R&D investment and environmental regulation, Based on the policy attributes of public data, this study systematically analyzes and verifies the macro effects of public data openness, as well as its empowering mechanisms for driving green technological innovation. Second, from the perspective of research themes, this study conducts theoretical dissection and empirical corroboration on the issue of how public data openness affects green technological innovation, clarifying the significant effects of public data openness in driving the process of green technological innovation. Existing literature typically lacks in-depth exploration of the relationship between public data and green technological innovation, and there is ongoing debate regarding whether public data can effectively promote enterprises’ green technological innovation. Through empirical analysis of local government public data openness policies, this study substantiates the positive impact of public data openness on green technological innovation, filling a gap in the literature regarding the importance of control variable characteristics that influence green technological innovation. Third, from the perspective of research methodology, this study adopts the SHAP method within the realm of interpretable machine learning and the double machine learning method, effectively mitigating the limitations of traditional policy evaluation methods in high-dimensional data analysis. This study compares and explains the differences in the predictive effects and SHAP values of control variables that may influence green technological innovation, introducing a machine learning research paradigm that contributes to the further promotion and application of machine learning methods.

2. Theoretical Research Hypotheses

2.1. The Basic Mechanisms of Public Data Openness in Promoting Green Technological Innovation

Public data covers standardized information in various fields, including environmental monitoring indicators, real-time meteorological parameters, electricity consumption data, and natural resource endowments [22]. Its standardized opening and sharing can effectively break down information barriers among different entities, significantly reduce information search costs and R&D trial-and-error costs for innovative actors such as enterprises and research institutions, provide accurate, comprehensive and reliable data support for the basic research and process optimization of green technologies, and consolidate the foundation for innovation [23,24].

Cross-field and cross-industry integration of public data can drive breakthrough innovations in green technologies in key scenarios such as the efficient utilization of clean energy, precise governance of industrial pollution, and optimization of low-carbon production models, facilitating the iterative upgrading of existing green technologies and the implementation and incubation of new green technologies [25]. The opening of public data can fully stimulate the innovation vitality of market entities and research institutions, and accelerate the transformation of green technologies from laboratory research and development to industrialized and large-scale applications. This is not only consistent with China’s strategic orientation of the coordinated development of the digital economy and green low-carbon growth, but also effectively improves the quality and efficiency of green technological innovation, providing strong support for ecological civilization construction and high-quality development. Based on this, we propose the hypothesis H1:

Hypothesis H1.

Public data openness can promote green technological innovation.

2.2. The Mechanisms of Public Data Openness in Promoting Green Technological Innovation

Public data openness can reduce information asymmetry in the field of green innovation, enhance the transparency and predictability of research and development directions [26], and establish a more inclusive and efficient innovation ecosystem, so as to effectively attract innovative talents with expertise in green technologies. The continuous inflow of innovative talents brings cutting-edge knowledge, interdisciplinary thinking and professional research and development capabilities to the sector, optimizes the human capital structure for green innovation, and directly improves the research and development efficiency and breakthrough capacity of green technological innovation [27,28]. In the meantime, talent agglomeration generates knowledge spillover and collaborative innovation effects, facilitates the in-depth integration of public data resources with green research and development practices, accelerates the iterative upgrading and application of green technologies, and ultimately forms a complete and coordinated transmission mechanism for public data openness, innovative talent agglomeration and green technological innovation. Accordingly, this paper proposes Hypothesis H2:

Hypothesis H2.

Public data openness positively promotes urban green technological innovation through the siphoning effect of attracting innovative talents.

Public data openness enriches the resource supply and application scenarios for green entrepreneurship [29,30], consolidates the factor support, and fosters a high-quality and efficient entrepreneurial environment, thus continuously boosting entrepreneurial dynamism in the green field. The rising entrepreneurial dynamism gathers various entrepreneurial entities, cultivates diverse green innovation formats, and strengthens market-oriented R&D momentum [31]. The benign interaction and exploration among entrepreneurial entities accelerate the transformation of technological achievements, promote the efficient linkage of capital, technology and market, and speed up the iteration of green technological innovation. Finally, it forms a complete transmission mechanism featuring the coordinated advancement of public data openness, boosted entrepreneurial dynamism and green technological innovation. Accordingly, this paper proposes Hypothesis H3:

Hypothesis H3.

Public data openness positively promotes urban green technological innovation through the multiplier effect of enhancing entrepreneurial vitality.

Public data openness drives the leapfrog improvement of advancing governance transparency, and continuously optimizes the institutional environment and government service ecosystem for green technological innovation [32]. Advancing governance transparency enhances the stability and predictability of policy implementation, unblocks communication channels between government and enterprises, and ensures the precise implementation of green innovation policies and efficient allocation of resources [33]. Meanwhile, it improves the efficiency of government environmental supervision and innovation services, reduces institutional operation costs for green innovation entities, and stimulates R&D investment enthusiasm [34]. Supported by the leapfrog upgrading of governance efficiency, green technological innovation obtains solid institutional guarantee, finally forming a coordinated and progressive transmission mechanism of public data openness, leapfrog progress in advancing governance transparency and green technological innovation. Accordingly, this paper proposes Hypothesis H4:

Hypothesis H4.

Public data openness positively promotes urban green technological innovation through the leapfrogging effect of driving government governance transparency.

3. Research Design

3.1. Model Setting

This paper focuses on public data open platforms, examining the driving role of data elements in green technological innovation. Currently, causal inference methods have been broadly utilized in policy evaluation studies. The difference-in-differences model requires a substantial sample size, and as the number of control variables increases, the estimation results may become unstable, raising the risk of bias in the outcomes [35]. The synthetic control method is prone to overfitting in high-dimensional settings, and the synthetic control group may not accurately reflect the actual control situation, thereby affecting the accuracy of causal inferences. The fixed effects model attempts to control for unobservable time-invariant characteristics; however, when a large number of control variables are used, it can lead to multicollinearity, reducing estimation efficiency.

Therefore, this paper adopts a Double Machine Learning (DML) framework to identify causal effects. The DML method is based on partially linear models in econometrics, utilizing Neyman orthogonalization and cross-fitting to ensure robust estimation of the structural parameters of interest. First, we construct the partially linear DML model as follows:

{I n n o v}_{i t} = θ_{0} {P D a t a}_{i t} + g (X_{i t}) + U_{i t}

(1)

E (U_{i t}| P D a t a, X_{i t}) = 0

(2)

The subscripts in the model indicate the cross-sectional and time-series dimensions. Specifically,

i

represents cities,

t

represents years,

{I n n o v}_{i t}

is the level of green technological innovation, and

P D a t a_{i t}

is the policy variable for the public data open platform.

θ_{0}

indicates the policy effect, revealing the sustained enabling impact of public data openness on green technological innovation.

X_{i t}

represents potential high-dimensional control variables, and

g (X_{i t})

captures the potential nonlinear relationship between high-dimensional control variables and green technological innovation. Due to its unknown nature, we estimate its form using machine learning algorithms as

\hat{g} (X_{i t})

.

U_{i t}

denotes the error term, with a conditional mean of 0. When using

\hat{g} (X_{i t})

, it is necessary to introduce the concept of Neyman orthogonalization to construct the matrix as shown below:

ψ_{i t} (θ, g, m) = {(E v e n t}_{i t} - m (X_{i t})) [({P D a t a}_{i t} - g (X_{i t})) - θ ({E v e n t}_{i t} - m (X_{i t}))]

(3)

If

θ = θ_{0}

, and machine learning can approximate

g (X_{i t})

and

m (X_{i t})

arbitrarily closely, then:

E [ψ (θ, g (X_{i t}))] = 0

(4)

By expanding Equation (4), we can obtain:

E [{E v e n t}_{i t} - m (X_{i t})] ({P D a t a}_{i t} - g (X_{i t})) - θ_{0} E [{E v e n t}_{i t} - m (X_{i t})] = 0

(5)

By solving, we obtain:

θ_{0} = \frac{E [{E v e n t}_{i t} - m (X_{i t})] ({P D a t a}_{i t} - g (X_{i t}))}{E [{E v e n t}_{i t} - m (X_{i t})]}

(6)

\hat{g} (X_{i t}) \approx g (X_{i t}), \hat{m} (X_{i t}) \approx m (X_{i t})

(7)

Then:

\hat{θ_{0}} = {(\frac{1}{n} \sum_{i \in I, t \in T} {P D a t a}_{i t}^{2})}^{- 1} \frac{1}{n} \sum_{i \in I, t \in T} [Y_{i t} - \hat{g} (X_{i t})]

(8)

We conduct an in-depth test and correction of the estimation bias of policy effects:

\begin{matrix} \sqrt{n} ({\hat{θ}}_{0} - θ_{0}) = {(\frac{1}{n} \sum_{i \in I, t \in T} {P D a t a}_{i t}^{2})}^{- 1} \frac{1}{\sqrt{n}} \sum_{i \in I, t \in T} {P D a t a}_{i t} U_{i t} \\ + {(\frac{1}{n} \sum_{i \in I, t \in T} {P D a t a}_{i t}^{2})}^{- 1} \frac{1}{\sqrt{n}} \sum_{i \in I, t \in T} {P D a t a}_{i t} [g (X_{i t}) - \hat{g} (X_{i t})] \end{matrix}

(9)

where

{(\frac{1}{n} \sum_{i \in I, t \in T} {P D a t a}_{i t}^{2})}^{- 1} \frac{1}{\sqrt{n}} \sum_{i \in I, t \in T} {P D a t a}_{i t} U_{i t}

follows a normal distribution with a mean of 0 and an undetermined variance.

To accelerate the convergence from

θ_{0}

to

{\hat{θ}}_{0}

, we construct an auxiliary regression:

{P D a t a}_{i t} = m (X_{i t}) + V_{i t}

(10)

E (V_{i t} X_{i t}) = 0

(11)

3.2. Variable Definition

3.2.1. Dependent Variable

This article uses green technological innovation as the dependent variable. Green technological innovation (

{I n n o v}_{c t}

) denotes the total number of green patent applications in city

c

during year

t

. As referenced in Bian & Luo and Liu et al. [36,37], the total number of green patent applications is measured by taking the natural logarithm of the total plus one.

3.2.2. Independent Variables

Public data openness (

{P D a t a}_{c t}

) represents a dummy variable for public data openness. Drawing on the research of G. Liu et al., Yang et al. and Zhong et al. [38,39,40], the public data openness platform is treated as an exogenous policy shock. An interaction term is constructed using a grouping dummy variable (

T r e a t

) and a time dummy variable (

T i m e

) to measure the effect of data openness. Specifically, this means that cities that have launched the public data openness platform are assigned a value of 1, while those that have not are assigned a value of 0. We set the platform launch year as the cutoff point, assigning a value of 1 to the launch year and all subsequent years, and 0 to all years prior to the launch.

3.2.3. Mediator Variables

To explore the mechanism through which public data openness influences green technological innovation, this article aims to demonstrate its mechanisms through three pathways: the siphoning effect of attracting innovative talents, the multiplier effect of enhancing entrepreneurial vitality, and the leapfrogging effect of promoting government transparency.

① The siphoning effect of attracting innovative talents.

This article develops a gravity model of talent inflow, incorporating Ricardo’s labor migration theory, which considers the differences in economic development and wage levels between regions as key factors attracting the migration of green innovation talents to the region. The talent inflow model is as follows:

{p f l}_{i j} = [P_{i}^{α} \cdot P_{j}^{β} \cdot {({W a g}_{j} / {W a g}_{i})}^{θ} \cdot {({G d p}_{j} / {G d p}_{i})}^{\emptyset} \cdot Q_{j}^{η}] / d_{i j}^{b}

(12)

Let

i

and

j

represent regional spatial research units, corresponding to 292 prefecture-level cities in China.

p f l_{i j}

denotes the flow of green innovation talent from the source region

i

to the destination region

j

. A larger value indicates how much green innovation talent is attracted from region

i

to region

j

, reflecting a stronger flow intensity. In this model,

P_{i}

and

P_{j}

denote the population sizes of the source region

i

and the destination region

j

respectively, and their values are directly corresponding to the level of participation in economic activities of the respective regions.

α

and

β

represent the elastic effects of the populations of the source and destination regions on the flow of green innovation talents.

\frac{W a g_{j}}{W a g_{i}}

denotes the wage level ratio between the source and destination regions, indicating how income differences pull talent migration.

θ

represents the impact weight of wage level differences.

\frac{G d p_{j}}{G d p_{i}}

represents the economic development level ratio between the destination and source regions, while

ϕ

denotes the impact weight of economic development differences.

Q_{j}

signifies the investment in green technology research and development in the destination region, with

η

representing the impact weight of this investment.

d_{i j}

indicates the spatial distance between the source and destination regions, and

b

is the distance decay coefficient, signifying how distance inhibits migration.

② The multiplier effect of enhancing entrepreneurial vitality.

Academia typically measures entrepreneurial activity using two methods: the labor market method, based on the labor force population aged 15 to 64, and the ecological method, based on the number of existing enterprises in a region. Both methods standardize comparisons of new registered enterprises over a specific period, excluding regional scale effects [41]. However, the ecological method fails to adequately account for differences in enterprise size, often leading to the overestimation of areas with a concentration of large enterprises. Therefore, cross-national analyses frequently employ the labor market method. This paper uses the number of newly registered enterprises in information transmission, software, and IT services to gauge green innovation and entrepreneurship vitality, an indicator that well reflects the activity of digital technology-driven green entrepreneurial activities.

③ The leapfrogging effect of driving government transparent governance.

Considering the positive role of government information disclosure and governance transparency in promoting green innovation among enterprises, this article selects the government transparency index for urban-principled government-business relations published by the National Development and Strategy Research Institute at Renmin University of China as a key indicator to measure government governance transparency capacity.

3.2.4. Control Variables

Leveraging regularization techniques, double machine learning effectively addresses the “curse of dimensionality” problem caused by a high number of control variables. This approach reduces estimation bias and enhances the precision and robustness of policy effect estimates [42]. Drawing on the works of Ling et al. [43] and Chen & Zhang [44], this article also incorporates additional factors that may influence green technological innovation as control variables, as outlined below: ① Population density (PD) is measured as the ratio of the total population at the end of the year to the urban area; ② Economic development level (Pgdp) is assessed through per capita regional gross domestic product; ③ Government intervention level (Gov) is measured by the proportion of local fiscal general budget expenditure to regional GDP; ④ Degree of openness (Open) is represented by the ratio of actually utilized foreign capital to regional GDP; ⑤ Financial development level (Fdi) is measured as the ratio of the total balance of deposits and loans of financial institutions to regional GDP; ⑥ Urban economic density (Ued) is measured using the natural logarithm of the regional gross domestic product per unit area; ⑦ Human capital level (Hcl) is expressed as the natural logarithm of the number of individuals receiving higher education per ten thousand people; ⑧ Degree of fiscal decentralization (Fisc) is measured by the proportion of local fiscal revenue to national fiscal revenue; ⑨ Urbanization level (Urban) is represented by the ratio of the urban population to the total population; ⑩ Transportation level is indicated by the natural logarithm of both highway passenger volume (Pass) and highway freight volume (Fre); ⑪ Unemployment rate (Unemp) is measured as the proportion of registered unemployed individuals at the end of the year to the total population. To improve the accuracy of model fitting, this article introduces the quadratic terms of control variables in the regression analysis. Additionally, to control for information loss at the city and time levels, individual and year dummy variables are used to control for fixed effects of both cities and years.

3.2.5. Sample Selection and Data Sources

This article selects panel data from 292 cities at or above the prefecture level in China from 2003 to 2022 as the research sample. The data sources are as follows: First and foremost, the data underpinning the core variables in this study are drawn primarily from three categories of publicly accessible, officially recognized statistical yearbooks—specifically, the China City Statistical Yearbook, the China Environmental Statistical Yearbook, and the China Science and Technology Statistical Yearbook. For the subset of missing values identified within the constructed dataset, the research team conducts targeted supplementation and rigorous calibration by leveraging the granular data disseminated in the official statistical yearbooks issued by each provincial-level administrative region. Second, the launch dates of government data open platforms are sourced from the China Local Government Data Open Report published by Fudan University’s Digital and Mobile Governance Laboratory. Based on this report, manual comparisons are conducted through publicly available internet reports and official announcements from various city government websites for cross-validation. Third, data related to green innovation is sourced from the National Intellectual Property Administration’s patent information system and the China Economic and Financial Research Database, among others. Table 1 presents the descriptive statistics of each variable.

4. Empirical Results

4.1. Correlation Analysis and SHAP Interpretability Methods

Before conducting the regression analysis, this article performed a correlation analysis on the core explanatory variables and various control variables. Figure 1 shows that all correlation coefficients are less than 0.85, indicating that there are no significant multicollinearity issues among the variables.

The SHAP method is used to explain the contribution or importance of each variable to the target prediction value in machine learning models. Based on the concept of cooperative game theory, SHAP considers all possible sequences in which a new variable can enter the model. It calculates the impact of adding the new variable on the model’s prediction results for each sequence. Finally, this study takes the mean value of various impact effects as the core measurement indicator to determine the overall contribution degree of the new variable to the model prediction results.

The SHAP values are computed following the process outlined below. For a dataset

F

containing

∣ F ∣

variables, the prediction result of the machine learning model

f (\cdot)

for a specific sample

x_{*}

is

\hat{f} (x_{*})

. This result can be decomposed into the predicted baseline value

ϕ_{0}

plus the SHAP values

ϕ_{*}^{j}

for each variable

j

, expressed as:

\hat{f (x_{*})} = ϕ_{0} + \sum_{j = 1}^{|F|} ϕ_{*}^{j}

(13)

In this context, the predicted baseline value

ϕ_{0}

is typically the average prediction

E ([\hat{f} (x)])

for the overall sample, while the SHAP value

ϕ_{*}^{j}

for each variable

j

is calculated as follows:

ϕ_{*}^{j} = \sum_{S \subseteq F \ {j}} \frac{|F|!}{|S|! (|F| - |S| - 1)!} [ν (S \cup \{j\}) - ν (S)]

(14)

In this context,

S

is a subset of

F ∖ \{j\}

(the set

F

excluding variable

j

), allowing for a total of

\frac{∣ F ∣!}{∣ S ∣! (|F| - |S| - 1)!}

such subsets to be constructed. The term

ν (S \cup \{j\}) - ν (S)

represents the impact of adding variable

j

to the model’s prediction results based on the set of variables

S

. At this point, the SHAP value

ϕ_{*}^{j}

indicates the average effect of introducing variable

j

to the model across all possible scenarios. The larger the absolute value of the variable parameter

∣ ϕ_{*}^{j} ∣

, the more prominent the marginal impact effect of this variable on the target predicted value. A positive

ϕ_{*}^{j}

indicates a positive effect on the prediction results, while a negative

ϕ_{*}^{j}

indicates a negative effect.

Overall, SHAP provides an effective method for assessing feature importance, with several significant advantages: ① The SHAP model is model-agnostic, allowing it to be applied to any machine learning model without reliance on specific structures. ② SHAP maintains global consistency, ensuring that feature importance scores remain consistent across different prediction outcomes. SHAP values clearly reflect the contribution of each feature to the model’s prediction results, enabling interpretability even in complex models.

This article illustrates the global analysis results of SHAP by plotting a beehive chart. Figure 2 reports the control variables for predicting the importance of green technological innovation based on the XGBoost model, arranged in descending order of importance. Among these, PD and Unemp have the greatest impact on the model output, with higher SHAP values, indicating their significant role in predicting green technological innovation. Overall, the beehive chart forms an inverted pyramid shape, with a pronounced mixing of colors and varying widths of the color bands, suggesting a clear nonlinear relationship between the control variables and green technological innovation.

Next, this article delves into the specific relationship between control variables and green technological innovation using SHAP dependence plots. Figure 3 presents the SHAP dependence plots for each control variable, revealing distinct threshold characteristics and synergistic effects on green technological innovation. This indicates that the impact of control variables may change significantly within specific ranges. For instance, when the level of economic development is at a lower stage, it restricts green technological innovation. However, when population density exceeds 5 (5 × 1.00) and the level of economic development reaches above 10 (10 × 1.00), under such a scenario, the synergistic effects between the two factors are particularly distinct, which significantly boosts the positive incentive for green technology innovation and underscores the driving merits derived from the integration of economic and demographic elements.

4.2. Baseline Estimation Results

This study employs a double machine learning approach to evaluate the impact of public data open platforms on urban green technological innovation. In the research, the samples were split in a 1:4 ratio, and the random forest algorithm was used to fit both the main regression and the auxiliary regression. The primary regression results are reported in Table 2, where Model (1) controls for city and time fixed effects as well as linear terms of various city characteristic variables across the full sample range. The regression findings indicate that the coefficient of the core explanatory variable, public data openness platforms, with respect to urban green technological innovation is positive and achieves statistical significance at the 1% level, suggesting that public data openness contributes to the advancement of green technological innovation. Building on this, Model (2) further incorporates the quadratic terms of the city variables, and the results remain significantly positive with only minor changes in coefficient magnitude. Additionally, considering the lengthy sample period and the presence of missing data, this study performed imputation on the variables. To examine the impact of imputation on the results, Models (3) and (4) limited the sample period to 2010–2022 and re-conducted the regression. The results show that while the regression coefficient for the public data open platform decreased after shortening the sample period, its positive effect on urban green technology growth remains significant, and the research conclusions have not undergone substantial changes, thereby validating the hypothesis.

4.3. Robustness Analysis

4.3.1. Adjusting the Research Sample

Given the clear tiered differences in the maturity of government data openness among various prefecture-level cities in China, different regions exhibit significant gradient distribution in terms of policy implementation depth and platform construction levels. Based on this, this study excluded four directly governed municipalities—Beijing, Shanghai, Tianjin, and Chongqing, which have independent policy authorities—during the empirical analysis, retaining the remaining 288 prefecture-level cities as the research sample to enhance the representativeness of the regression results and the accuracy of the estimates. The results are shown in Column (1) of Table 3.

The regression results indicate that, although the impact coefficient of public data openness on green technological innovation changed after excluding the four directly governed municipalities, its positive effect remains significant at the 1% level. This suggests that the baseline regression results demonstrate strong robustness.

4.3.2. Impact of Excluding Outliers

In view of the fact that outliers in the regression sample tend to impair the robustness of coefficient estimation and induce deviations of the final results from the true interdependency among variables, this study implemented relevant outlier mitigation strategies accordingly, this study conducted 1% and 5% winsorization on all variables (excluding the policy dummy variable

P D a t a

) separately. The regression results of this study are detailed in Columns (2) and (3) of Table 3, where Column (2) presents the regression outcomes after 1% winsorization to control for outliers, and Column (3) represents the results following 5% winsorization. It can be observed that, after excluding outliers, although the regression coefficients slightly decreased, they remain significant, further validating the robustness of this study’s conclusions.

4.3.3. Excluding Interference from Concurrent Policies

Policy implementation usually has cyclical characteristics and synergistic effects. In the process of promoting government data openness, it is inevitable that there will be overlaps with other concurrent policies in terms of timing and influence. To more accurately identify and control these interfering factors, this study incorporated dummy variables for relevant policies, including the National Big Data Comprehensive Pilot Areas (

B i g d a t a

), the Broadband China Pilot Cities (

B a n d b o r d

), Gigabit Cities (

G i g a b i t

), and Smart City Construction (

S m a r t c i t y

) into the baseline regression model. The results are presented in Table 4.

The findings indicate that, although the regression coefficient for public data openness decreased after controlling for the influence of other concurrent policies, it remains significantly positive. This suggests that the policy effect of public data openness may be overestimated, yet its facilitative impact on green technological innovation still exists significantly.

4.3.4. Temporal Dynamic Testing

There may be temporal effects before and after the launch of government data openness platforms, which could impact the accuracy of the research conclusions. To test the robustness of the baseline regression [45,46], this study adopted the following strategies: (1) Following the implementation of the public data openness policy, pilot and control cities may exhibit structural and inherent differences in the time trends of core indicators such as green technological innovation. To address this, a variable named

P o l i c y_{e f f e c t}

was constructed and included in the regression model for control; (2) Considering that policy expectations may have an impact, a dummy variable for the policy was additionally introduced for the year prior to implementation, and a new regression analysis was conducted. The regression results are presented in Table 5. After accounting for time trends and expected effects, the impact of public data openness on green technological innovation remains significantly positive, confirming the robustness of the research conclusions.

4.3.5. Replacement of Dependent Variables

To examine the sensitivity of public data openness to different metrics, this study utilized two alternative measures of green technological innovation to replace the total number of green patent applications. Following the research by Shen et al. [47], green invention patent applications (

I n n o v 1

) and green patent grants (

I n n o v 2

) were used to represent the levels of green technological innovation. The regression results are presented in Table 6, which indicate that the estimated coefficients for public data openness are both positive and significant, demonstrating the robustness of the research conclusions across different green innovation indicators.

4.3.6. Restructuring the Double Machine Learning Model

To eliminate potential biases from the configuration of the double machine learning model, this study conducted robustness checks on the baseline regression results from multiple perspectives: (1) Adjusting the division ratio of the training and test sets from the original 1:4 to 1:2 and 1:7, examining the impact of different partitioning methods on the results; (2) Testing the sensitivity of the conclusions to algorithm selection, we replace the core algorithm with the original random forest prediction algorithm, and supplement it with methods such as Lasso regression, gradient boosting, and support vector machines to systematically compare the consistency of the research conclusions under the settings of different machine learning algorithms; (3) Optimizing the model specification. In the initial setup of the main model, a partially linear structure was employed, assuming that the treatment variable has a constant linear effect on the outcome (Equation (1)), while covariate effects were fitted non-parametrically (Equation (2)). To further reduce specification bias and enhance the model’s adaptability to the complex relationship between covariates and treatment variables, this study extended the model to a generalized interactive double machine learning model, allowing for any nonlinear and interactive relationships between

E v e n t

and

X

, thereby achieving more robust and flexible causal effect estimates. In Equation (1),

θ_{0}

only allows for a constant marginal effect of

E v e n t

, failing to capture the effects of

E v e n t

varying with

X_{i t}

or any nonlinear interaction effects. If the true effect of Event exhibits high-order interactions with

X_{i t}

, the model’s risk of bias will increase. Therefore,

g (X_{i t})

is generalized to include high-order functions representing any functional relationship between

E v e n t

and

X_{i t}

, allowing for:

g ({E v e n t}_{i t}, X_{i t})

.

Then, the model is rewritten as:

Y_{i t + 1} = g ({E v e n t}_{i t}, X_{i t}) + U_{i t}

(15)

{E v e n t}_{i t} = m (X_{i t}) + V_{i t}

(16)

The average treatment effect is:

θ_{1} = E [g ({E v e n t}_{i t} = 1, X_{i t}) - g ({E v e n t}_{i t} = 0, X_{i t})]

(17)

Table 7 displays the regression results after reconfiguring the double machine learning model, with columns (1)–(6) providing these results. The findings indicate that regardless of adjustments in model specifications, sample partition ratios, or the selected machine learning algorithms, the effect of public data openness on promoting green technological innovation remains significant; however, this does influence the estimated values of policy effects. These findings suggest that the research conclusions are robust across different model specifications, further validating the reliability of the baseline regression results.

4.3.7. Endogeneity Test

The launch of public data openness platforms also exhibits a certain degree of non-randomness, as their implementation is often influenced by multiple resource endowment factors such as the region’s technological innovation foundation, the level of information infrastructure development, industrial structural characteristics, and the concentration of high-quality talent. This may introduce endogeneity issues. Based on the aforementioned theoretical and practical foundations, this study draws on the modeling ideas of Shen, Zhang, & Wang [48] combines double machine learning with the partial linear instrumental variable model, and completes the model construction.

Y_{i t} = θ_{0} {E v e n t}_{i t} + g (X_{i t}) + U_{i t}

(18)

{I n s t r u m e n t}_{i t} = m (X_{i t}) + V_{i t}

(19)

Among them,

{Instrument}_{i t}

represents the instrumental variable, which in this study is characterized by the interaction term of slope standard deviation and time trend. On the one hand, slope standard deviation is calculated by measuring the dispersion of elevation changes across different regions, making it a typical natural geographic characteristic variable primarily determined by geological and geomorphological conditions, thus satisfying the exogeneity assumption of the instrumental variable. On the other hand, the complexity of terrain slope can affect the construction level of transportation and information infrastructure, thereby indirectly influencing the advancement of regional public data openness. Meanwhile, the slope standard deviation does not directly affect green technological innovation, but exerts its influence only by regulating the conditions of data openness, which thus satisfies the assumptions of the instrumental variable. Column (7) of Table 7 indicates that the estimated coefficient of the policy variable remains significantly negative, consistent with the baseline regression results.

4.4. Heterogeneity Analysis

4.4.1. Urban Agglomeration Heterogeneity

The regional innovation system theory suggests that innovative activities are deeply embedded in specific geographical spaces and collaborative networks. Urban agglomerations exhibit multi-level disparities in various aspects, including network connectivity, knowledge spillover effects, innovation support policies, and institutional environments. Consequently, the principles and intensity with which public data openness policies operate within different regions are not consistent. Various interwoven factors, both within and outside the region, contribute to the spatial heterogeneity characteristics of the impact of public data openness policies on green technological innovation.

Considering the significant differences in policy implementation effects between regions, this study categorizes the research samples into five major urban agglomerations—Beijing–Tianjin–Hebei, Yangtze River Delta, Middle Yangtze River, Chengdu–Chongqing, and Pearl River Delta—based on the classification standards outlined in the China Urban Agglomeration Development Report. By implementing spatial classification at the urban agglomeration level, this study will explore the policy effects of public data openness within each urban agglomeration in greater depth.

The regression results are reported in columns (1)–(5) of Table 8, exhibiting distinct characteristics of regional heterogeneity: by virtue of optimizing the allocation of innovation resources, the opening-up of public data exerts a significant positive impact on green technology innovation in the Yangtze River Delta and the Middle Reaches of the Yangtze River urban agglomerations; in contrast, the actual effectiveness of this policy is relatively limited in the Beijing–Tianjin–Hebei, Chengdu–Chongqing, and Pearl River Delta urban agglomerations. Possible reasons for this discrepancy are as follows: by virtue of their own development advantages, the Yangtze River Delta and the Middle Reaches of the Yangtze River urban agglomerations have fostered solid economic strength, a solid foundation in the green industry and high-tech sectors, and a pressing demand for data resources, thus making it easier for them to stimulate innovation through public data openness. These regions have well-developed data infrastructure, high governance levels, and significant policy implementation efficacy. Local governments have advantages in institutional guarantees and innovation environments, which strongly support green technological innovation. In contrast, the stakeholder relationships within the Beijing–Tianjin–Hebei urban agglomeration are complex, and coordination is challenging, resulting in obstacles to policy execution that affect green technological innovation. Moreover, certain cities in the Chengdu–Chongqing and Pearl River Delta regions still face shortcomings in economic structural transformation and innovation system development, leading to weaker policy effects.

4.4.2. Industrial Foundation Heterogeneity

Evolutionary economic geography theory suggests that changes in regional industrial foundations and path dependence affect the absorption and diffusion of innovative technologies. In the context of green technological innovation, public data openness serves as a policy that promotes knowledge flow, and its effectiveness exhibits heterogeneity due to the differences in regional industrial foundations. There are disparities in industrial accumulation and technological development between traditional industrial bases and non-traditional industrial bases, which leads to more diverse paths and mechanisms for public data openness to promote green technological innovation, resulting in uneven policy effects.

Therefore, based on the list of old industrial cities outlined in the State Council Opinions on Promoting the Transformation and Upgrading of Old Industrial Cities and Resource-Based Cities, this study categorizes the research samples into traditional industrial bases and non-traditional industrial bases.

The regression results are presented in columns (1) and (2) of Table 9. Public data openness has a significant positive impact on green technological innovation in non-traditional industrial bases, but it does not demonstrate a notable effect in traditional industrial bases. This discrepancy may arise from two key factors: first, traditional industrial bases have long depended on conventional industries, resulting in limited innovation capabilities and a poor capacity to absorb and utilize new technologies and data; second, these areas suffer from a weak digital foundation, lacking the necessary infrastructure and skilled personnel to effectively convert data value. Most traditional industrial bases are dominated by large state-owned enterprises with closed innovation networks and limited collaboration between industry, academia, and research institutions, which hinders the development of an open and shared innovation ecosystem. In contrast, non-traditional industrial bases feature a more diverse industrial structure, higher levels of digitalization among enterprises, and a more responsive market to green innovation, facilitating the transformation of data openness into practical innovative applications.

4.4.3. Transportation Heterogeneity

Network association theory suggests that the degree of connectivity and interaction among cities within multi-level spatial networks influences the effectiveness of policy execution. In the field of green technological innovation, public data openness serves as a key policy tool, with its impact exhibiting significant spatial heterogeneity across cities with different network association statuses. The core differences between transportation hub cities and non-hub cities are reflected in two dimensions: transportation connectivity and resource integration mechanisms, and these two types of differences are precisely the key factors affecting the efficiency of public data flow and the scope of knowledge collaboration. Consequently, the dynamics of data sharing and the interaction of innovative resources differ between transportation hub cities and non-hub cities, resulting in varying policy feedback.

Therefore, this article categorizes the research samples into transportation hub cities and non-hub cities according to the classification standards outlined in the National Comprehensive Three-Dimensional Transportation Network Planning Outline (2021–2050).

The regression results are presented in columns (1) and (2) of Table 10, revealing that public data openness significantly promotes green technological innovation in transportation hub cities, while showing no notable effect in non-hub cities. This disparity arises from the differences in foundational conditions and developmental environments between transportation hub cities and non-hub cities. Transportation hub cities, benefiting from abundant resources and well-developed infrastructure, are better positioned to leverage public data to drive green technological innovation. These cities attract high-quality talent and funding, fostering close connections among innovation entities and creating a conducive collaborative atmosphere. In contrast, non-hub cities face insufficient capacity to promote green technological innovation due to weak resource allocation capabilities and lagging infrastructure. The innovation entities in these cities tend to be smaller in scale, with an incomplete industry-university-research collaboration system, confronting technological and coordination challenges, lacking advantages in factor aggregation, and receiving limited policy support, which makes it difficult to attract green technological innovation projects and restricts the effectiveness of public data application.

4.5. Causal Forest

To conduct a comprehensive analysis of the heterogeneous effects that public data openness brings to green technological innovation, this article adopts a hierarchical and progressive empirical strategy. First, it conducts group regressions within a double machine learning framework to assess the average policy impacts across different types of cities, revealing the overall distribution of effects. However, the group regression focus on the macro level makes it difficult to effectively illustrate the various effects of the policy at the individual level, leading to the oversight of minor heterogeneities. To address this issue, this article employs the causal forest method, establishing a system for identifying individual heterogeneous causal effects to further analyze the different distribution of policy impacts among micro entities.

Figure 4 illustrates the distribution of individual conditional average treatment effects (CATE) resulting from the public data openness policy on green technological innovation. The results show that the CATE values for the majority of cities cluster within the range of 0 to 0.15, and their distribution closely resembles a normal distribution. This indicates that the public data openness platforms have a positive impact on green technological innovation in most cities. However, there are segments of the distribution that contain some negative values, suggesting that a small number of cities experienced a reduction in innovation outcomes after implementing this policy. This disparity may be attributed to differences among cities regarding policy implementation intensity, technological foundation, and data utilization levels, which can lead to varying policy effects. Overall, the CATE distribution reflects that while the policy holds promoting significance, it also demonstrates the diverse characteristics of policy effectiveness across different cities.

In the heterogeneity testing section of this study, a grouped regression analysis was first conducted from three dimensions: urban agglomeration affiliation, industrial foundation, and transportation hub. The results indicate that the public data openness platforms have a more pronounced positive impact on green technological innovation in core urban agglomerations such as Beijing–Tianjin–Hebei, the Yangtze River Delta, and the middle reaches of the Yangtze River, as well as in non-traditional industrial bases and transportation hub areas. Subsequently, when identifying heterogeneous characteristics using the causal forest method, the model results also demonstrate that the factors of urban agglomeration, industrial foundation, and transportation hub play a significant role in determining policy effects, which is largely consistent with the conclusions drawn from the grouped regression. Figure 5 further illustrates the conditional average treatment effects for each group, visually showcasing the prominent effects of public data openness in promoting green innovation in the aforementioned types of cities. The mutual confirmation of the results from both the grouped regression and the causal forest enhances the reliability of the research conclusions.

4.6. Mechanism Analysis

4.6.1. The Siphoning Effect of Gathering Innovative Talent

The effective exertion of the siphoning effect of innovative talent agglomeration constitutes a crucial pathway for optimizing the allocation of resources for green technological innovation. The openness of public data enhances the transparency and accessibility of information, creating an environment conducive to communication and collaboration among innovative talents. This, in turn, attracts more high-level professionals into the green technology sector. As talents gradually converge, innovative resources tend to become more concentrated, generating stronger synergies and continuously improving green technological innovation. Therefore, this study explores how public data openness can leverage the effectiveness of stimulating talent aggregation to support green technology growth from the perspective of talent inflow. The results are shown in Column (1) of Table 11, indicating that public data openness significantly promotes the influx of green innovation talent into the region. The talent introduction mechanism effectively strengthens the innovation capabilities of enterprises, enabling them to engage in deeper R&D investments in green technology, thereby enhancing their market competitiveness. As green innovation talent gathers, enterprises’ capacity to address environmental issues and achieve sustainable development is greatly amplified. Public data openness plays a significant role in promoting green technological innovation and highlights the critical significance of policies in advancing the transformation of the green economy.

4.6.2. The Multiplier Effect of Entrepreneurial Vitality

The effective activation of the multiplier effect of entrepreneurial vitality constitutes a crucial pathway for the optimization of resource allocation in green technological innovation. By enhancing the transparency and accessibility of information, public data openness fosters an environment conducive to creative communication and collaboration among entrepreneurs, which encourages more innovative enterprises to enter the green technology sector. Based on the entrepreneurial agglomeration effect, innovative resources can be allocated more efficiently, resulting in significant synergies that advance green technological innovation. This study analyzes how public data openness stimulates entrepreneurial vitality to drive green technological innovation. The results, as shown in Column (2) of Table 11, indicate that public data openness significantly improves entrepreneurs’ motivation and participation levels. The entrepreneurial support mechanism effectively strengthens enterprises’ innovation capabilities related to green technology, enabling deeper R&D investments and optimizing market competitiveness. As entrepreneurial vitality is stimulated, the potential of enterprises to address environmental challenges and achieve sustainable development is also significantly enhanced.

4.6.3. The Leapfrogging Effect of Driving Government Transparent Governance

Enhancing government governance transparency is a core channel for constructing an institutional environment conducive to green technological innovation. This mechanism is realized through public data openness, which breaks down the long-standing information barriers between the government and market entities. By improving the standardization of governmental openness, a highly interactive governance ecosystem is established, significantly enhancing the scientific nature of policy formulation and implementation processes. With this optimized policy environment and the foundation of data resources, enterprises can leverage precise insights to identify policy directions and market demands in green technological innovation, allowing for more rational and efficient optimization of their R&D decisions and resource allocation. The trust mechanism fostered by transparent governance not only reduces transaction costs arising from information asymmetry in government-enterprise collaborative innovation but also promotes deeper data sharing and technological cooperation across departments and fields. Consequently, this leads to an overall upgrade to the green technological innovation system. Based on this, the study selects government transparent governance capacity as a key variable to analyze the specific paths and mechanisms through which public data openness effectively drives green technological innovation. The results, shown in Column (3) of Table 11, indicate that public data openness significantly enhances government transparency and responsiveness. More importantly, as the transparent governance mechanism improves, enterprises’ intrinsic motivation for green technological innovation is greatly strengthened. This motivation encourages increased R&D investment and accelerates the conversion of technological achievements, gradually forming unique advantages in green development within the competitive market. Furthermore, it effectively enhances the efficiency of the regional green technological innovation system in resource integration, risk response, and sustainable development, highlighting the critical mediating role of government transparent governance capacity in promoting green technological innovation through public data openness.

5. Discussions

5.1. Comparative Analysis Between This Study and Existing Literature

This study systematically examines the causal relationship between public data openness and green technological innovation, and conducts a comprehensive comparison and dialogue between its core findings and existing relevant literature.

First, in terms of the core impact effect, mainstream existing studies generally agree that public data openness can reduce information asymmetry and lower R&D trial-and-error costs, thereby promoting corporate innovation activities. For example, studies by Dong et al. (2026) [49] and Yin et al. (2025) [50] confirm that government data openness significantly improves green innovation efficiency. The baseline regression results of this paper show that public data openness has a significantly positive promoting effect on urban green technological innovation. This conclusion is consistent with the mainstream views in the existing literature, further verifying the rationality and reliability of the theoretical framework and empirical design in this paper.

Second, regarding heterogeneity characteristics, existing studies mostly explore regional differences from single dimensions such as economic development level and city size. On this basis, this paper further expands the analysis to three dimensions: urban agglomerations, industrial foundations, and transportation locations. The results indicate that the driving effect of public data openness is more significant in the Yangtze River Delta, urban agglomerations in the middle reaches of the Yangtze River, non-traditional industrial bases, and transportation hub cities, whereas its effect is relatively weak in the Beijing–Tianjin–Hebei, Chengdu–Chongqing, and Pearl River Delta urban agglomerations as well as traditional industrial bases. This conclusion supplements and enriches the discussion on spatial heterogeneity in existing studies and provides new empirical evidence for explaining the regional imbalance in the effectiveness of data openness policies.

Third, in terms of action mechanisms, previous literature mostly explains innovation-driven paths from the perspectives of R&D investment, environmental regulation, and digital finance. Based on the policy attributes and public goods characteristics of public data, this paper innovatively proposes and empirically tests three transmission mechanisms: the siphoning effect of attracting innovative talents, the multiplier effect of stimulating entrepreneurial vitality, and the leapfrogging effect of improving government governance transparency. These three mechanisms systematically reveal how public data openness aggregates innovation factors, activates market entities, and optimizes the institutional environment, thereby deepening the understanding of the internal mechanism of data elements empowering green technological innovation.

Fourth, in terms of research methods, most existing policy evaluation studies adopt traditional difference-in-differences, fixed effects, or synthetic control methods, which have limitations in handling high-dimensional variables, nonlinear relationships, and endogeneity issues. This paper adopts the Double Machine Learning (DML) method combined with the SHAP interpretability framework, which not only achieves robust estimation of causal effects but also completes quantitative interpretation of control variable importance and nonlinear relationships. This method effectively alleviates the curse of dimensionality and estimation bias in traditional econometric models, improves the accuracy and credibility of the conclusions, and constitutes a significant methodological improvement compared to similar studies.

5.2. Core Contributions

Compared with existing literature, the substantial contributions of this paper are mainly reflected in the following three aspects:

Theoretical contribution: This paper incorporates public data openness into the analytical framework of green technological innovation, clarifies the net effect and heterogeneous performance of data elements on green innovation, and expands the theoretical boundary of the digital economy empowering green development. Different from most studies focusing on the micro-enterprise level, this paper conducts research from the urban macro level, reveals the governance value of public data openness, and provides a new theoretical perspective for understanding the coordinated development of data elements and ecological innovation.

Methodological contribution: This paper introduces double machine learning and the SHAP interpretability method into the field of public policy evaluation, effectively solving the problems of model specification bias, difficulty in high-dimensional control, and weak result interpretability in traditional methods. By constructing a generalized interactive double machine learning model and combining causal forests for micro-heterogeneity identification, a more rigorous and flexible policy evaluation paradigm is formed, which can provide methodological references for subsequent research on data policies and innovation policies.

Practical contribution: The conclusions of this paper have clear policy implications. The regional heterogeneity results can provide a basis for the government to implement differentiated and precise data openness strategies; the three action mechanisms indicate a path for localities to optimize the innovation ecosystem and improve the data governance system; the robust positive effect provides empirical support for the nationwide promotion of public data platform construction and the release of data element value. Overall, the research conclusions can directly serve the policy practice of optimizing public services and achieving high-quality economic development in the digital era.

6. Conclusions

6.1. Conclusion and Policy Recommendations

The open sharing and utilization of public data is an essential component for establishing a foundational institutional framework for data resources. It plays a fundamental and leading role in the development and utilization of data elements. Against this backdrop, this study employs the DML method to empirically assess the impact of public data openness on green technological innovation. The empirical results indicate that public data openness significantly promote green technological innovation. After conducting a series of robustness tests—adjusting the research sample, removing outliers, and accounting for simultaneous policy interference—the conclusions remain valid. Additionally, we analyze the heterogeneity of policy impacts based on urban clusters, industrial foundations, and transportation hubs, with causal forest analysis results aligning closely with the group regression findings. The siphoning effect of gathering innovative talent, the multiplier effect of stimulating entrepreneurial vitality, and the leapfrogging effect of enhancing government transparency is a key mechanism through which public data openness exerts its green innovation effects.

This study provides important policy implications for improving the public data openness system and promoting green technological innovation.

Firstly, increase investment in public data openness platforms to enhance technological capabilities and data quality. The government should systematically increase investment and construction efforts for these platforms, ensuring technological completeness and data reliability. This should encompass effective financial resource allocation and the establishment of intelligent data management systems to facilitate convenient and efficient data acquisition, storage, and analysis. Additionally, cities above the prefecture level should develop unique public data openness platforms tailored to their economic and social characteristics, providing local enterprises with timely and comprehensive green technology-related data. A standardized data release system should be established to ensure the authenticity and regulation of data publication, enhancing enterprises’ capabilities to respond to policies and market demands. Meanwhile, the government should regularly organize training and promotional activities regarding data usage to improve awareness and capabilities among enterprises and the public, ensuring public data can play a substantive role in green technology research and innovation decision-making.

Secondly, establish a cross-regional data sharing mechanism to enhance the interconnection of public data. In light of the significant regional disparities among city clusters such as Beijing–Tianjin–Hebei, the Yangtze River Delta, and the middle reaches of the Yangtze River, it is recommended to establish a cross-regional data sharing mechanism to achieve effective interconnectivity among cities. This mechanism can provide comprehensive market information for enterprise decision-making and stimulate the innovative drive of regional enterprises, optimizing resource allocation. The government should lead the establishment of collaborative and communication platforms across regions, allowing successful experiences, technical solutions, and best policy practices to be effectively shared. Simultaneously, cooperation should be strengthened to develop relevant cross-regional data standards and interoperability agreements, fostering closer collaborative relationships and enhancing the trust foundation to create a favorable data ecosystem for green technological innovation.

Thirdly, formulate policies to attract high-end talent and encourage cooperation in the green technology sector. The government should develop targeted policies to attract and cultivate high-end talent in green technology, particularly individuals who can engage in research and innovation activities on public data openness platforms. Incentives such as special funds, tax reductions, and scholarships should encourage deep collaboration between universities, research institutions, and enterprises to cultivate talent that meets the demands of green technological innovation. Furthermore, an innovative talent exchange and cooperation platform should be established to facilitate tight collaboration between higher education institutions and enterprises, promoting talent mobility and knowledge sharing. Additionally, collaboration between national and local governments should be encouraged, leveraging local industrial characteristics to conduct targeted training and continuing education programs to enhance the skills of existing employees, providing robust talent support for the ongoing development of green technologies.

Finally, enhance policy transparency and public participation to ensure information access and policy feedback. The government should commit to improving policy transparency, ensuring that the public and enterprises can conveniently access green technology-related policy information. Specific measures include establishing multi-tiered online platforms to publish policy information, implementation details, and related laws and regulations, along with regular updates for easy access. Furthermore, the government should encourage broad participation from various sectors in the formulation, evaluation, and revision of policies, creating a strong interactive mechanism. Active public participation can enhance policy credibility and ensure that policies are dynamically adjusted to meet market demands. On this basis, the government should regularly publish assessment reports on policy implementation effectiveness and solicit feedback through forums and public hearings, ensuring continuous optimization of policy design to remain relevant and aligned with current needs.

6.2. Research Limitations and Future Research Directions

This study still has certain limitations. First, it only takes the launch of open data platforms as a proxy variable for public data openness, which cannot fully measure the quality, depth, and actual utilization efficiency of data openness. Second, the analysis is conducted at the prefecture-level city level without further combining micro-enterprise data to reveal the differentiated mechanisms. Third, the spatial spillover effects and cross-regional interactions of public data openness have not been considered. Future research can be expanded in three aspects: first, construct a multi-dimensional indicator system for the quality of public data openness to improve the measurement accuracy of core variables; second, use micro-enterprise data to further explore the heterogeneous impacts and micro-mechanisms; third, introduce spatial econometric methods to investigate the spatial spillover effects and regional collaborative innovation effects of data openness, so as to improve the relevant research system.

Author Contributions

Conceptualization, M.W. and B.G.; methodology, M.W.; validation, M.W.; formal analysis, M.W.; investigation, M.W.; resources, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Social Science Fund Project of China (Grant No.: 25BJY112).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Deng, M.; Gao, J.; Yuan, G.; Ke, L.; Yao, J. Data-Driven Hiring in Firms: Evidence from the Launch of Chinese Public Data Platforms. Financ. Res. Lett. 2025, 86, 108282. [Google Scholar] [CrossRef]
Peng, X.; Li, M. Public Data Openness and Regional Economic Development. Financ. Res. Lett. 2025, 86, 108496. [Google Scholar] [CrossRef]
Lv, L.; Guo, B. Do Pilot Zones for Green Finance Reform and Innovation Policy Enhance China’s Energy Resilience? Sustainability 2025, 17, 5757. [Google Scholar] [CrossRef]
Ban, G.; Chankoson, T.; Wang, Y. The Impact of Public Policy on Enterprise Innovation Performance: Panel Data on Financial Subsidy Policy. Heliyon 2025, 11, e41230. [Google Scholar] [CrossRef]
Guo, Y.; Yan, J.; Zhuang, P. How Does Data Sharing Affect the Sustainable Development of Agribusiness? Evidence from Public Data Openness. Int. Rev. Econ. Financ. 2025, 97, 103785. [Google Scholar] [CrossRef]
Wang, S.; Guo, B. Impact of Green Finance on Urban Ecological and Environmental Resilience: Evidence from China. Sustainability 2026, 18, 706. [Google Scholar] [CrossRef]
You, W.; You, B.; Guo, Y. Can Big Data Development Drive Green Technology Innovation? The Spillover Role of Supply Chain Partners. Energy Econ. 2025, 150, 108811. [Google Scholar] [CrossRef]
Ansari, B.; Barati, M.; Martin, E.G. Enhancing the Usability and Usefulness of Open Government Data: A Comprehensive Review of the State of Open Government Data Visualization Research. Gov. Inf. Q. 2022, 39, 101657. [Google Scholar] [CrossRef]
Zong, H.; Wang, Y. Impacts of Government Public Data Openness and Digital Financial Development on Corporate Information Reliability. Int. Rev. Econ. Financ. 2025, 103, 104495. [Google Scholar] [CrossRef]
Siddharth, L.; Luo, J. Data-Driven Innovation for Trustworthy AI. She Ji J. Des. Econ. Innov. 2025, 11, 261–283. [Google Scholar] [CrossRef]
Taylor, L. What Is Data Justice? The Case for Connecting Digital Rights and Freedoms Globally. Big Data Soc. 2017, 4, 205395171773633. [Google Scholar] [CrossRef]
Yang, Y.; Li, Y.; Liang, X. The Role of Data Trading Platforms (DTPs) in Digital Technology Innovation: Mechanisms and Evidence from China. J. Policy Model. 2025, 47, 1372–1396. [Google Scholar] [CrossRef]
Stuart, D. The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. Online Inf. Rev. 2015, 39, 272. [Google Scholar] [CrossRef]
Veldkamp, L.; Chung, C. Data and the Aggregate Economy. J. Econ. Lit. 2024, 62, 458–484. [Google Scholar] [CrossRef]
Li, X.; Li, Y.; Yu, X.; Yuan, C. Public Data Openness and Trade Credit: Evidence from China. J. Empir. Financ. 2025, 83, 101636. [Google Scholar] [CrossRef]
Jin, J.; Wang, Y. Can Public Data Openness Reduce Carbon Emissions of Listed Companies? Evidence from China. Energy Rep. 2025, 13, 5512–5524. [Google Scholar] [CrossRef]
Wu, D.; Xie, Y. Unveiling the Impact of Public Data Access on Collaborative Reduction of Pollutants and Carbon Emissions: Evidence from Open Government Data Policy. Energy Econ. 2024, 138, 107822. [Google Scholar] [CrossRef]
Guo, B.; Li, M. Does the Application of Industrial Robots Enhance Urban Energy Resilience? Evidence from China. Energies 2026, 19, 1555. [Google Scholar] [CrossRef]
Zou, G.; Fan, C.; Yang, C. Research on the Impact of Public Data Openness on Urban-Rural Income Inequality. Financ. Res. Lett. 2025, 79, 107193. [Google Scholar] [CrossRef]
Liao, W.; Zhuo, L. The Impact of Public Data Openness on Corporate Supply Chain Efficiency: A Quasi-Natural Experiment Based on a Local Government Data Openness Platform. Econ. Anal. Policy 2025, 87, 1557–1574. [Google Scholar] [CrossRef]
Nie, S.; Wang, S.; Ji, Q. Break down Data Silos: Does Public Data Openness Improve Corporate ESG Performance? Int. Rev. Financ. Anal. 2025, 106, 104480. [Google Scholar] [CrossRef]
Li, S.; Wang, D.; Wen, J. Value Effect of Public Data Access—A Perspective Based on Manufacturing Firms’ Markup. Manag. Decis. Econ. 2025, 46, 3486–3503. [Google Scholar] [CrossRef]
Schmidthuber, L.; Ingrams, A.; Hilgers, D. Government Openness and Public Trust: The Mediating Role of Democratic Capacity. Public Adm. Rev. 2020, 81, 91–109. [Google Scholar] [CrossRef]
Goldfarb, A.; Tucker, C. Digital Economics. J. Econ. Lit. 2019, 57, 3–43. [Google Scholar] [CrossRef]
Dong, Z.; Wang, J. Does Public Data Access Stimulate the Efficiency of Corporate Green Innovation? Financ. Res. Lett. 2024, 65, 105560. [Google Scholar] [CrossRef]
Xu, R.; Xu, C. How Government Open Data Platforms Affect Corporate ESG Performance. Sustainability 2025, 17, 9768. [Google Scholar] [CrossRef]
Wang, X.; Feng, Y.; Qian, L.; Liang, F. Talent Introduction Policies, Optimal Labor Allocation, and Corporate Green Innovation. Sustainability 2025, 17, 1112. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Y.; Loh, L. Can Talent Policy Promote Green Technology Innovation? Res. Eval. 2024, 33, rvae056. [Google Scholar] [CrossRef]
Li, J.; Wang, L.; Huang, Y. Open Government Data and Entrepreneurship: Evidence from China. Int. Rev. Econ. Financ. 2026, 106, 104957. [Google Scholar] [CrossRef]
Tan, Z.; Li, F.; Liu, F.; Zhao, J. Public Data Openness and Enterprise Innovation Efficiency. Int. Rev. Financ. Anal. 2026, 109, 104752. [Google Scholar] [CrossRef]
Stroila, I.; Steffens, P.; Plewa, C. Reimagining Entrepreneurial Ecosystems for Sustainable Development. J. Bus. Res. 2026, 210, 116132. [Google Scholar] [CrossRef]
Han, N.; Liu, P.; Zhong, F.; Zhao, D. Does public data access improve fiscal transparency?—On a quasi-natural experiment from government data platform access. Socio-Econ. Plan. Sci. 2025, 98, 102184. [Google Scholar] [CrossRef]
Zhang, S.; Wang, L. The Influence of Government Transparency on Governance Efficiency in Information Age: The Environmental Governance Behavior of Guangdong, China. J. Enterp. Inf. Manag. 2020, 34, 446–459. [Google Scholar] [CrossRef]
Xiao, H.; Wang, X. Fiscal Transparency Practice, Challenges, and Possible Solutions: Lessons from Covid 19. Public Money Manag. 2024, 44, 196–207. [Google Scholar] [CrossRef]
Wang, X.; Wang, M.; Hu, J.; Hu, H.; Guo, B. The Impact of Climate Adaptive Pilot Cities Policy on Urban Pollution Reduction and Carbon Emission Mitigation. Front. Public Health 2026, 14, 1803462. [Google Scholar] [CrossRef] [PubMed]
Bian, Z.; Luo, M. Impact of Climate Policy Uncertainty on Enterprises’ Green Technology Innovation: Based on Growth Option Theory. Sustain. Futures 2025, 10, 101305. [Google Scholar] [CrossRef]
Liu, H.; Chen, Q.; Song, Y. Does the Carbon Emissions Trading Scheme Promote Green Technology Innovation? New Evidence from Chinese Cities. Econ. Anal. Policy 2025, 87, 1045–1057. [Google Scholar] [CrossRef]
Liu, G.; Hu, E.; Jian, W.; Kong, L. Public Data Openness and Corporate Violations: Evidence from China. Econ. Anal. Policy 2025, 88, 437–451. [Google Scholar] [CrossRef]
Yang, Y.; Peng, S.; Xie, J. Public Data Openness and Digital Finance Development. Financ. Res. Lett. 2025, 86, 108525. [Google Scholar] [CrossRef]
Zhong, Y.; Lai, H.; Zhang, L.; Guo, L.; Lai, X. Does Public Data Openness Accelerate New Quality Productive Forces? Evidence from China. Econ. Anal. Policy 2025, 85, 1409–1427. [Google Scholar] [CrossRef]
Reynolds, P.; Bosma, N.; Autio, E.; Hunt, S.; De Bono, N.; Servais, I.; Lopez-Garcia, P.; Chin, N. Global Entrepreneurship Monitor: Data Collection Design and Implementation 1998–2003. Small Bus. Econ. 2005, 24, 205–231. [Google Scholar] [CrossRef]
Li, M.; Liu, T.; Wu, G.; Lin, J.; Guo, B. Has the Development of Artificial Intelligence Promoted Urban Pollutant and Carbon Emission Reduction? Evidence from China. Front. Public Health 2026, 13, 1739342. [Google Scholar] [CrossRef] [PubMed]
Ling, L.; Hu, L.; Li, S.; Zhao, X.; Ye, X. How Does Open Public Data Affect Enterprise Green Transformation? Socio-Econ. Plan. Sci. 2025, 102, 102342. [Google Scholar] [CrossRef]
Chen, K.; Zhang, S. How Does Open Public Data Impact Enterprise Digital Transformation? Econ. Anal. Policy 2024, 83, 178–190. [Google Scholar] [CrossRef]
Wang, A.; Si, L.; Hu, S. Can the Penalty Mechanism of Mandatory Environmental Regulations Promote Green Innovation? Evidence from China’s Enterprise Data. Energy Econ. 2023, 125, 106856. [Google Scholar] [CrossRef]
Chen, Z.; Zuo, W.; Xie, G. How Does Institutional Investor Preference Influence Corporate Green Innovation in China? Eur. J. Financ. 2024, 30, 1239–1269. [Google Scholar] [CrossRef]
Shen, N.; Zhang, L.; Huang, H.; Zhang, G.; Zhang, J.; Zhou, J. Can Digitalization Break the Political Resource Curse? A Study on Political Connections and Corporate Green Innovation. J. Environ. Manag. 2025, 380, 124992. [Google Scholar] [CrossRef]
Shen, N.; Zhang, L.; Wang, Q. The Empowerment of Smart City Construction on the Synergy Effect of Pollution Control and Carbon Reduce in China: Empirical Analysis with Double Machine Learning Method. J. Clean. Prod. 2025, 519, 145958. [Google Scholar] [CrossRef]
Dong, Q.; Jia, J.; Zhao, W. The Impact of Government Data Disclosure on Corporate Green Innovation: A Quasi-Natural Experiment of Public Data Openness. Environ. Dev. Sustain. 2026. [Google Scholar] [CrossRef]
Yin, Y.; Xing, Y.; Luo, X. Towards Low-Carbon Development: Could Public Data Openness Become an Accelerator for Reducing Carbon Emission Intensity in China? Int. J. Glob. Warm. 2025, 37, 56–80. [Google Scholar] [CrossRef]

Figure 1. Distribution of Pearson Correlation Coefficients Between Variables.

Figure 2. Global interpretability of indicators. The vertical axis represents the control variables, with the indicators arranged in descending order of importance; the horizontal axis represents the SHAP values and the average SHAP value. Each point represents a sample, and a darker point color indicates a higher value of the corresponding indicator in that sample. The further a point is to the right, the larger the corresponding SHAP value in that sample. The width of the color bands indicates the sample density at that point.

Figure 3. SHAP dependence plots for control variables. The vertical axis represents the SHAP values of the samples on this indicator, with larger values indicating that the sample points are more likely to be classified as positive cases, and vice versa for negative cases. The horizontal axis represents the values of the indicator. The bar chart at the bottom of the SHAP dependence plot depicts the density distribution of this indicator.

Figure 4. Distribution of the treatment effects of public data openness on green technological innovation. The horizontal axis represents individual treatment effects, while the vertical axis indicates density. The green dashed line represents the median of the conditional average treatment effect, and the red dashed line denotes the mean of the conditional average treatment effect.

Figure 5. Conditional average treatment effects of public data openness. The horizontal axis represents the average treatment effect, while the vertical axis indicates different groups. The gray dashed line at 0 represents the reference point for the null effect (ATE = 0).

Table 1. Descriptive statistics of variables.

Var	Obs	Mean	Std.dev	Min	Median	Max
Innov	5824	4.1040	2.0203	0.0000	4.1109	10.3724
PData	5824	0.1537	0.3607	0.0000	0.0000	1.0000
Pflow_in	5824	−0.2059	0.8568	−7.2803	−0.0694	3.6033
Tten	5824	5.6649	1.3982	0.0000	5.5872	11.1319
Transparency	5819	44.1424	16.7625	0.0000	41.8500	100.0000
PD	5824	5.7070	0.9694	1.6531	5.7650	9.0886
Pgdp	5824	10.3290	0.8457	7.5417	10.4126	12.4565
Gov	5824	0.1799	0.1204	0.0135	0.1516	2.3522
Open	5824	0.2058	0.5050	−0.1446	0.0842	28.3684
Fdi	5824	2.3254	1.1953	0.2066	2.0136	21.2969
Ued	5824	6.7718	1.4946	0.4838	6.8330	11.9965
Hcl	5824	0.0181	0.0234	−0.0094	0.0093	0.1469
Fisc	5824	0.4635	0.2239	0.0464	0.4328	1.5413
Urban	5824	0.5059	0.1779	−0.0605	0.4944	1.0068
Pass	5824	8.1052	1.3723	−1.3755	8.2336	12.1838
Fre	5824	8.6358	1.2946	0.0000	8.7480	12.5860
Unemp	5824	0.0081	0.0962	−0.1651	0.0049	7.2634

Table 2. Benchmark regression results.

Var	(1)	(2)	(3)	(4)
Var	Innov	Innov	Innov	Innov
PData	0.0935 ***	0.0916 ***	0.0969 ***	0.0937 ***
PData	(0.0326)	(0.0322)	(0.0311)	(0.0309)
Constant	−0.0103	−0.0101	−0.0099	−0.0091
Constant	(0.0073)	(0.0072)	(0.0086)	(0.0085)
Control	Yes	Yes	Yes	Yes
Control²	No	Yes	No	Yes
City FE	Yes	Yes	Yes	Yes
Time FE	Yes	Yes	Yes	Yes
Dml	RF	RF	RF	RF
Obs	5824	5824	3787	3787