1. Introduction
Carbon markets have become an important policy tool for supporting low-carbon transition and reflecting climate-related costs in economic activities. As carbon trading becomes increasingly data-driven, accurate carbon price forecasting is gaining importance for market participants, regulators, and project developers. Carbon price expectations influence not only trading and compliance decisions, but also the estimation of carbon costs in engineering and infrastructure projects [
1,
2,
3,
4]. From an information-theoretic perspective, carbon price formation can be viewed as a heterogeneous and policy-sensitive market system in which different markets exhibit marked differences in information processing efficiency, signal-response patterns, and information integration capacity [
5]. In highly liquid and continuously traded markets, historical prices alone may already embed most of the relevant information. In contrast, in less mature or sporadically traded markets, trading volume, cross-market price linkages, and macroeconomic signals often play a more important role as additional predictive information [
6]. Therefore, cross-market carbon price forecasting is not merely a time-series modeling problem, but rather an information-structure alignment problem—that is, how to identify and integrate the subset of information that is predictively valuable for a given market [
7].
China’s carbon market shows strong regional heterogeneity. Before the launch of the national carbon market in 2021, several regional pilot markets had already been established. These regional carbon markets differ in sector coverage, trading activity, market maturity, and price behavior [
8,
9,
10]. Even after the national market was introduced, the regional pilots still provide valuable evidence for understanding carbon price formation under different market conditions. Compared with mature financial markets, these pilot markets often have lower liquidity, stronger policy influence, more irregular trading, and weaker information transmission [
11,
12]. As a result, carbon price forecasting in China should be understood not only as a technical problem, but also as a problem shaped by market heterogeneity [
13].
Previous studies have provided a useful basis for carbon price forecasting. Early research mainly relied on decomposition-based, econometric, and hybrid statistical approaches to address non-stationarity and noise in carbon price series [
14,
15,
16]. Later, machine learning and deep learning methods, including ensemble learning, LSTM-based models, CNN-LSTM, and hybrid forecasting frameworks, were increasingly used to capture nonlinear patterns in carbon markets [
17,
18,
19]. At the same time, external information such as trading volume, energy prices, macroeconomic indicators, news text, and cross-market signals was gradually incorporated into forecasting models [
20,
21]. This shifted carbon price forecasting from single-series analysis toward multi-source modelling. More recently, greater attention has also been paid to market heterogeneity, interpretability, and application-oriented forecasting [
22,
23]. Carbon price forecasting involves multiple sources of uncertainty, including market microstructure uncertainty (whether trading is continuous and liquid), external signal uncertainty (whether macroeconomic variables are promptly reflected in prices), and model uncertainty regarding the selection of information combinations (which features are genuinely predictive) [
24]. These uncertainties combine differently across markets, making it unlikely that a single information structure performs best in all markets [
25,
26]. Consequently, a cross-market forecasting framework needs to explicitly assess the marginal contribution of different information groups to predictive uncertainty, which is a core motivation for the feature ablation design in this paper.
Despite these advances, several issues remain insufficiently explored. First, many existing studies focus on model comparison in a single market or on a single dataset [
27]. Although this line of research has produced valuable methodological insights, it is less effective for explaining why the value of external information differs across markets. In China’s regional carbon markets, the predictive contribution of trading volume, cross-market linkage, and macroeconomic signals is likely to vary with market maturity, liquidity conditions, and trading continuity [
28]. A unified cross-market framework is therefore needed to compare different information structures under heterogeneous market conditions. From a market heterogeneity perspective, China’s regional carbon markets can be viewed as a cluster of heterogeneous regional markets. Each market has its own trading rules, participant structure, and information response function, leading to differentiated predictive values of the same set of external variables across markets [
29]. This heterogeneity is not noise but an intrinsic manifestation of market heterogeneity [
30]. Therefore, an effective cross-market forecasting framework should not pursue a single universal model but rather provide a method that can adaptively align with market-specific information structures. Following this rationale, this paper uses a unified cross-market modeling framework and a hierarchical feature ablation design to reveal the adaptive boundaries of different information structures in complex market environments [
31].
Second, most existing studies focus mainly on forecast accuracy [
32]. This focus is both necessary and valuable. At the same time, the practical relevance of carbon price forecasting may extend beyond error reduction alone. Forecast carbon prices can support trading decisions for market participants and may also inform broader decision-making under carbon constraints. They can also be used to estimate future carbon costs in engineering projects that involve embedded emissions or life-cycle carbon accounting. Extending carbon price forecasting to decision-support tasks therefore has clear practical value. Moreover, from an uncertainty quantification perspective, decision-support tasks require not only point forecasts but also an understanding of the reliability and potential variability of predictions—an aspect that remains underexplored in existing the carbon price forecasting literature.
To address these aspects, this study develops a multi-market machine learning framework for carbon price forecasting across China’s regional carbon markets. The purpose is not to propose a new forecasting algorithm, but to compare the performance of different machine learning models across heterogeneous regional markets within a unified framework. The objective of this study is not to conduct an exhaustive benchmark of all recently proposed machine learning algorithms, but to establish a controlled and interpretable cross-market framework for examining how different information structures affect carbon price predictability. Therefore, representative tree-based ensemble models were selected as the modelling basis, rather than introducing a large number of additional model families that may confound market-structure effects with algorithmic complexity. The study also evaluates the added value of trading volume, cross-market features, and macroeconomic variables relative to historical price information. Unlike conventional approaches that assume a universal information structure, our framework explicitly recognizes that the predictive value of external signals is contingent on market-specific conditions—a perspective rooted in the understanding of carbon markets as heterogeneous regional markets. Furthermore, the forecast price paths are extended to two decision-support tasks, namely allowance-selling window identification and dynamic carbon-cost estimation. These tasks are particularly relevant for low-carbon infrastructure projects such as offshore wind farms, where future carbon prices may influence the economic evaluation of alternatives with different carbon footprints. It should be clarified that this study does not aim to provide a formal information-theoretic analysis of carbon-market complexity. The term “information structure” is used in an empirical and feature-based sense, referring to the composition and relative predictive usefulness of price, volume, cross-market, and macroeconomic variables in the forecasting task. Therefore, the analysis focuses on market-dependent feature relevance rather than on formal entropy, mutual information, transfer entropy, or other complexity metrics.
This study makes four main contributions. First, it develops a hierarchical feature ablation framework with five configurations to evaluate the marginal contribution of different feature groups across markets. Second, it provides a unified model comparison across seven regional carbon markets in China, offering a consistent basis for analysing market heterogeneity. Third, it shows that the usefulness of external information is strongly market-dependent, and that no single information structure consistently performs best across all markets. This finding challenges the conventional assumption of a universal feature set and supports the view that adaptive information selection is essential under market heterogeneity. Fourth, it extends carbon price forecasting to two decision-support tasks: allowance-selling window identification and dynamic carbon-cost estimation for offshore wind engineering applications. Within the engineering carbon-cost module, a carbon price threshold is further derived to illustrate potential design-switching conditions under changing carbon market conditions.
The remainder of this paper is organized as follows.
Section 2 introduces the data sources, feature construction strategy, and ablation configurations.
Section 3 presents the forecasting methodology, model settings, and the extension to decision-support tasks.
Section 4 reports the empirical results and discusses market heterogeneity.
Section 5 applies the framework to allowance-selling window identification and dynamic carbon-cost estimation.
Section 6 concludes the paper.
2. Data and Feature Construction
This study examines seven representative regional carbon markets in China: Shanghai, Beijing, Guangdong, Hubei, Tianjin, Fujian, and Shenzhen. These pilot markets were selected because they provide relatively complete trading records and show clear differences in trading continuity, market activity, and price behavior. As shown in
Table 1, the seven markets differ in launch year, sample size, and general trading characteristics. These differences provide a suitable basis for testing a unified forecasting framework under heterogeneous market conditions.
For each market, the sample period starts in the launch year and ends in December 2023. Because the launch dates are different, the number of retained observations also varies across markets, ranging from about 1600 to 2500. The target variable is the closing price of carbon emission allowances (CEA). In this study, however, the model predicts the next-day price change rather than the next-day price level. The future price level is then reconstructed from the predicted change series. Price changes were used instead of returns for three reasons. First, several regional carbon markets experienced low-price regimes during part of the sample period, and return-based measures may become unstable or excessively amplified when the price base is low. Second, the downstream decision-support modules, including allowance-selling revenue and carbon-cost estimation, require forecast outputs expressed in CNY/t; price changes can be directly reconstructed into price levels without additional transformation. Third, using absolute price changes provides a consistent and interpretable prediction target for comparing market-specific price paths across heterogeneous regional markets.
As shown in
Figure 1, the framework starts from three sources of information: target-market trading data, data from other regional carbon markets, and macroeconomic or energy-related variables. These inputs are then organised into four feature groups and transformed into model-ready predictors. The prediction target is the next-day price change, and the price level is reconstructed from the predicted changes. In this way,
Figure 1 summarises the link between raw inputs, feature construction, and the final forecasting task.
2.1. Data Preprocessing
The market data were collected from original trading records for the seven regional carbon markets. During preprocessing, the date, price, and, where available, trading-volume fields were identified and standardised across datasets. Price values were converted into numeric form, duplicate daily records were aggregated, and a complete daily calendar was constructed between the first and last available observations. Missing price values were filled by linear interpolation applied to the complete daily calendar. We acknowledge that this procedure may introduce a limited look-ahead risk, because future observed prices can be used when interpolating missing values before lagged predictors are constructed. However, the proportion of missing price observations is low in all seven markets, and the same preprocessing protocol is applied consistently across markets and feature configurations. Therefore, the reported results should be interpreted as a controlled empirical comparison under a consistent preprocessing setting, rather than as a fully leakage-free operational forecasting evaluation. A strictly time-causal imputation strategy, such as forward filling within each training, validation, and test segment, should be adopted in future work. Missing trading-volume values were set to zero to indicate the absence of recorded trading activity on those dates.
Two external data sources were used in the baseline framework. The first was a cross-market dataset built from the other regional carbon markets. For each target market, price series and, where available, trading-volume series from the other pilot markets were aligned by date and merged into the modelling dataset. To avoid target leakage, the target market’s own cross-market columns were removed during training. The second was a daily macro feature table containing 16 macroeconomic and energy-related variables. These variables were merged with the cross-market price and volume data before model fitting.
2.2. Feature Groups
Based on the processed market data and external datasets, the predictor set was organised into four feature groups. Depending on the market and the availability of trading-volume data, the final input space contains several dozen variables. As illustrated in
Figure 1, these four groups together define the information space used in the forecasting framework.
The first group (G1) contains historical price information and technical indicators derived from the target market itself. These variables include lagged price changes, lagged price levels over multiple short- and medium-term horizons, rolling statistics such as mean, standard deviation, minimum, and maximum, mean-reversion signals, RSI, MACD-related indicators, Bollinger Band position, and calendar variables. Together, these features describe the internal time pattern of the target market and form the core information basis of the model.
The second group (G2) contains trading-volume features. For markets with available volume records, the framework generates lagged volume terms, rolling volume statistics, a volume-surge indicator, and a price–volume divergence signal. These variables are used to capture short-term changes in trading activity and market participation. Their contribution may differ across markets because liquidity conditions and trading continuity vary substantially among China’s regional carbon markets.
The third group (G3) contains cross-market linkage features. To reflect interactions across regional markets, the framework incorporates both cross-market prices and cross-market trading volumes from the other regional markets. These external series are aligned by date and then transformed into lagged and change-based signals. In addition, spread features are constructed for cross-market price series as the difference between each external market price and the target market price. This allows the model to capture both short-term co-movement and relative price positions across regional carbon markets.
The fourth group (G4) contains macroeconomic variables. In the baseline setting, the macro feature table includes 16 variables. As shown in
Table 2, these variables fall into three categories: economic activity, energy and power, and derived or momentum indicators. These variables provide additional information on broader economic conditions and energy-market movements that may affect short-term carbon price dynamics.
A key principle of the feature design is that external information enters the model mainly in transformed form rather than as raw series alone. This helps reduce scale differences across heterogeneous data sources and improves consistency between the predictors and the next-day price-change target. As indicated in
Figure 1, lag terms, first differences, spread signals, rolling statistics, and calendar variables are used to convert raw inputs into model-ready features.
The same macroeconomic and energy-variable pool was used for all seven markets to ensure cross-market comparability. Since the objective of this study is to examine how the same information structure performs under heterogeneous market conditions, adopting market-specific covariate sets at the preprocessing stage would introduce an additional source of subjective selection bias. Instead, market heterogeneity is evaluated through the ablation design in
Section 2.3, where the marginal contribution of volume, cross-market, and macroeconomic information is compared across markets. Therefore, the unified 16-variable macroeconomic pool should be interpreted as a controlled common information set rather than as a claim that all variables are equally relevant to every market.
2.3. Ablation Configuration
To assess the contribution of each feature group in a systematic way, this study adopts a hierarchical ablation design with five configurations, as shown in
Table 3: M1 (Full Model, including all four feature groups), M2 (without volume features), M3 (without cross-market features), M4 (without macroeconomic variables), and M5 (Price-Only, retaining only the target market’s own historical price information and technical indicators). The baseline full-feature setting is aligned with M1 so that the ablation results directly reflect the marginal effect of removing a given feature group, without being affected by additional differences in model setting. As shown in
Table 3, G1 is retained in all five configurations because it represents the core internal information of the target market. By contrast, G2, G3, and G4 are removed one by one or jointly excluded to test the added value of trading-volume information, cross-market linkage, and macroeconomic factors. This design provides a clear basis for comparing the role of different information groups across markets under a consistent modelling framework. These five configurations form a hierarchical information removal sequence, allowing us to quantify the marginal contribution of each feature group to predictive uncertainty. From an information-theoretic perspective, this design is equivalent to evaluating the “empirical predictive contribution” of different information sources under heterogeneous market conditions.
3. Methodology
3.1. Forecasting Models and Ensemble Strategy
This study adopts a unified forecasting framework built on three tree-based predictive models: XGBoost, LightGBM, and Random Forest. Tree-based models are chosen because they naturally handle mixed-type predictors, are robust to feature-scale differences across heterogeneous data sources, and are well suited to medium-sized tabular datasets. In addition, all three models provide feature importance scores, which can be used for post hoc inspection of model behavior across different regional markets.
XGBoost, LightGBM, and Random Forest were selected because they represent two widely used tree-based ensemble-learning mechanisms: boosting and bagging. XGBoost and LightGBM provide sequential boosting structures that are effective for capturing nonlinear relationships, whereas Random Forest provides a bagging-based benchmark with different variance-reduction behaviour. This combination offers model diversity while keeping the comparison transparent and consistent across seven markets and five feature configurations [
33].
XGBoost minimises a regularised objective function that combines a differentiable loss term with explicit L1 and L2 penalties on tree weights, which helps control overfitting in a high-dimensional feature space. LightGBM uses a leaf-wise tree growth strategy together with histogram-based binning, which improves computational efficiency when the feature set becomes large after cross-market and macroeconomic variables are included. Random Forest builds an ensemble of independently grown trees through bootstrap aggregation, and its structural difference from the two boosting models provides useful model diversity in the final ensemble.
The main hyperparameter settings of the three base learners are summarised in
Table 4. To ensure consistency in cross-market comparison, the same settings are used across all seven regional carbon markets. These settings were determined through a coarse validation-set grid search on two representative markets, Shanghai and Hubei, which differ in trading activity and market characteristics. The search focused on the main complexity-control parameters of the tree-based models, including tree depth, number of leaves, minimum child samples, learning rate, number of boosting rounds, and regularisation strength. The selected settings were those that provided a reasonable balance between model flexibility and overfitting control under different liquidity conditions. Market-specific hyperparameter tuning was not performed, because the purpose of this study is to maintain a controlled basis for cross-market and cross-feature-configuration comparison. Although market-specific tuning may improve local predictive accuracy, it would introduce an additional source of variation and make it more difficult to attribute performance differences to feature-group effects. The potential benefit of market-specific tuning is further discussed in the limitations.
The three base models are further combined through a weighted ensemble. Rather than assigning fixed weights in advance, the framework determines the optimal convex combination on the validation set through grid search. The weight combination is selected by minimising the mean squared error on the validation set. In this sense, the proposed ensemble is not a simple arithmetic average, but an optimization-based weighted ensemble in which model weights are selected by minimizing validation-set MSE. This lightweight optimization strategy is intended to improve the adaptability of the ensemble while preserving the comparability of the cross-market experiment. This ensemble design does not assume that the ensemble always outperforms every individual model. Instead, it is intended to provide a more balanced benchmark across heterogeneous markets.
3.2. Training, Validation, and Evaluation Protocol
The observations are split chronologically into training, validation, and test sets at a fixed ratio of 75:12:13. This temporal split avoids look-ahead bias and ensures that future information does not enter model fitting. The 75:12:13 ratio was chosen to balance three requirements: retaining sufficient historical observations for model training, reserving an independent validation period for early stopping and ensemble-weight optimization, and keeping a temporally separated test period for final out-of-sample evaluation. Because the seven markets have different launch dates and sample sizes, a fixed proportional split also ensures that the training, validation, and test periods are defined consistently across markets. The validation set serves two purposes: early stopping for the boosting models and ensemble-weight selection. The test set is reserved exclusively for final out-of-sample evaluation.
In this framework, the prediction target is the next-day price change rather than the next-day price level itself. After the one-step-ahead change is predicted, the corresponding price level is reconstructed by adding the predicted change to the last observed price. All reported evaluation metrics are computed on reconstructed price levels rather than on predicted price changes. This setting is used to improve the handling of non-stationarity in price levels while preserving the practical interpretability of the forecasting results (see
Table 5 for a summary of the training and evaluation protocol).
For multi-step future forecasting in the decision-support application, the framework uses recursive updating, in which each predicted price is appended to the historical series before the next step is computed. Forecast accuracy is expected to decline as the horizon extends due to error accumulation, and long-horizon results should therefore be interpreted with caution.
One additional feature engineering detail should be noted. Several macroeconomic variables are published at monthly frequency. When these series are converted into daily differenced signals, artificial spikes may appear at the beginning of each month. To reduce this problem, the framework applies a seven-day moving average to the differenced macroeconomic series before they are used in model training.
Forecast performance is evaluated using four metrics: RMSE, MAE, MAPE, and R2. RMSE is treated as the primary metric because it penalises larger errors more heavily and is more sensitive to substantial deviations in reconstructed price levels. MAE provides an absolute-error complement, MAPE offers a scale-relative perspective, and R2 measures the proportion of price variation explained by the model. Because MAPE can be amplified in markets with relatively low price levels, it is treated as a secondary rather than primary criterion in those cases. The benchmark set in this study is limited to representative tree-based machine learning models and their weighted ensemble. Classical econometric benchmarks, such as ARIMA, GARCH, and random-walk-type models, are not included in the present version. Therefore, the empirical results should be interpreted as a controlled comparison within a machine-learning framework rather than as an exhaustive benchmark against the full carbon-price forecasting literature.
3.3. Decision-Support Methodology
Beyond one-step forecasting, the framework is extended to two decision-support tasks. These extensions are intended as application-oriented decision aids rather than fully validated decision-support demonstrations. First, the predicted future price path is used to identify favourable carbon allowance selling windows. Under a user-specified allowance quantity, forecast prices are converted into expected revenues, and both the single forecast-based best selling day and a set of high-return windows defined by a quantile threshold are identified over the forecast horizon.
Second, forecast prices are coupled with life-cycle carbon footprint estimates from engineering scenarios to quantify dynamic carbon cost under different design choices and to derive a threshold-based threshold-based screening indicator for engineering design comparison. In the offshore wind energy context, this extension makes it possible to compare alternative scour protection strategies under changing carbon-price expectations. In this way, the role of the forecasting framework is extended from market prediction to engineering-oriented decision support.
For multi-step forecasting, the empirical forecast range is constructed as the point forecast plus or minus 1.96 times the standard deviation of out-of-sample residuals. Under the assumption that forecast errors accumulate approximately independently across steps, the residual standard deviation is scaled by the square root of the forecast step number. This range should be interpreted as an empirical uncertainty band rather than a formal statistical confidence interval.
The framework described above is applied to each regional market under the five feature configurations defined in
Section 2.3, and the resulting performance is reported in
Section 4.
The engineering case used in the carbon-cost extension is based on a refined O&M-LCA model developed by the authors for a 202 MW offshore wind farm in southeastern China. The model compares two representative scour protection strategies for monopile and high-pile cap foundations: S1, rock dumping scour protection, and S2, cement-stabilised soil scour protection. S1 involves periodic quarrying, land–sea transport, and offshore placement of crushed rock around foundations, with maintenance frequency depending on hydrodynamic conditions. S2 replaces rock with in situ seabed stabilisation using cementitious binders, with material demand depending on geological conditions and stabilisation efficiency. The LCA model was developed following ISO 14040/14044 [
34,
35] and the ReCiPe 2016 Midpoint method, with the functional unit defined as 1 MWh of net electricity delivered over a 25-year design life.
The LCA results show a clear environmental trade-off between the two strategies. S1 causes only a modest increase in global warming potential (GWP) relative to the baseline O&M scenario (from 4.36 to 4.55 kg CO
2-eq/MWh under the medium-frequency scenario), but substantially increases air-pollution- and mineral-resource-related burdens due to large-scale quarrying and long-distance transport. S2 reduces mineral resource scarcity by 98% compared with S1, but raises GWP to 9.94 kg CO
2-eq/MWh due to cement production and offshore treatment. Three scenarios were defined for each strategy to reflect site-condition uncertainty: for S1, low-frequency (2 interventions/25 yr), medium-frequency (5 interventions/25 yr, baseline), and high-frequency (8 interventions/25 yr) maintenance; for S2, fast-stabilisation, medium-stabilisation (baseline), and slow-stabilisation conditions.
Table 6A summarises the scenario-level life-cycle carbon emissions used in the decision-support extension.
For the engineering cost comparison, the direct cost difference between S1 and S2 is treated as a user-defined parameter rather than a single fixed value, because it depends on site-specific factors such as material prices, transport distance, vessel availability, and construction conditions. The carbon price threshold is calculated as follows:
where
P∗ denotes the carbon price threshold, ΔC
direct is the direct cost premium of S1 over S2, and ΔE is the emission difference between S2 and S1. In the baseline comparison (S1 mid vs. S2 mid), ΔC
direct is set to 15 million CNY, representing the direct cost premium of S1 over S2.
The sensitivity of the carbon price threshold
P* to ΔC
direct is further examined in
Section 5.2.3. The emission difference between S1 mid and S2 mid is 68,593 t CO
2-eq, which yields the baseline threshold P* = 218.7 CNY/t as derived in
Section 5.2.2.
4. Results and Discussion
4.1. Baseline Model Performance Under the Full-Feature Configuration
This section presents the empirical results of the proposed multi-market forecasting framework. Under the full-feature configuration (M1) defined in
Section 2, all baseline models were trained on the same information space, including historical price indicators, trading volume, cross-market linkage features, and macroeconomic variables. This unified setting provides a controlled basis for comparing the three individual models and the weighted ensemble across the seven regional carbon markets.
As shown in
Table 7 and
Figure 2, the weighted ensemble achieves an average R
2 of 0.908 across the seven markets, accompanied by an average RMSE of 2.638 CNY/t, an average MAE of 1.567 CNY/t, and an average MAPE of 4.69%. These results indicate that the full-feature framework reconstructs carbon price paths with high accuracy, even under the substantial heterogeneity observed across regional market conditions.
A closer comparison of the individual models shows that no single algorithm dominates in all seven markets. Among the three base learners, LightGBM delivers the strongest average performance, with the lowest mean RMSE (2.638) and the highest mean R2 (0.909), followed closely by XGBoost. Random Forest consistently performs worse than the two boosting models. A likely reason is that its bagging structure is less effective than sequential boosting in capturing the short-term nonlinear dependencies in carbon price series. The weighted ensemble does not outperform LightGBM in every market. However, it shows a more balanced performance profile across the full set of heterogeneous markets. For this reason, the ensemble is retained as the unified benchmark in the subsequent ablation analysis, rather than being treated as the universally best model.
Substantial cross-market heterogeneity is evident in the baseline results. The ensemble reaches R2 values above 0.96 in Tianjin, Shenzhen, and Shanghai, indicating near-complete price reconstruction in these markets. Beijing also performs well, with an R2 of about 0.94.
By contrast, Guangdong (R
2 ≈ 0.75) and Hubei (R
2 ≈ 0.82) are notably harder to forecast. This contrast likely reflects differences in market microstructure and price dynamics. To support this interpretation,
Table 6B reports full-sample descriptive statistics of daily CEA prices and trading activity for Guangdong and Hubei. Guangdong is characterized by stronger price volatility: its price coefficient of variation reaches 69.0%, compared with 37.7% for Hubei; the standard deviation of daily log-returns is 7.73%, compared with 4.10% for Hubei; and 23.2% of its trading days record an absolute daily return above 5%, compared with 13.7% for Hubei. In addition, Guangdong shows a wider price range and larger single-day price jumps, with the maximum single-day price change reaching 49.75 CNY/t. These features indicate that Guangdong’s price path contains stronger dispersion and more frequent abrupt movements, which is consistent with its relatively larger reconstruction error and lower R
2 in the baseline results.
Hubei, by contrast, does not show the same level of continuous price volatility. The phrase “more irregular trading patterns” refers mainly to stable price levels combined with uneven and burst-like trading activity. Although Hubei has a comparable average number of trading days per year, its trading-volume coefficient of variation reaches 255%, and the busiest 5% of trading days account for about 37% of total turnover. This suggests that Hubei alternates between relatively quiet periods and short episodes of concentrated trading, rather than exhibiting a smooth and continuous trading process. Therefore, the remaining forecasting difficulty in Hubei appears to be more closely associated with intermittent trading activity and heavy-tailed price changes than with persistent high volatility as observed in Guangdong. These market-microstructure features create additional forecasting difficulty that the model cannot fully resolve.
4.2. Feature Ablation Analysis and Market-Dependent Information Value
To examine the contribution of each feature group, the five ablation configurations defined in
Section 2.3 were evaluated across all seven markets. The weighted ensemble was used as the unified benchmark in this analysis, consistent with its role in
Section 4.1. The ablation results suggest that the predictive value of external information is market-dependent, although the magnitude of the differences varies across markets. No single feature configuration achieves the lowest RMSE in all markets.
Table 8 and
Table 9 report the detailed and summary results, and
Figure 3 shows the RMSE trajectories across configurations. It should be noted that some performance differences across feature configurations are numerically small. Therefore, the ablation results should be interpreted as indicative evidence of market-dependent information suitability rather than as statistically definitive proof of distinct information structures.
Several patterns can be identified. First, the full model (M1) performs best in Shanghai and Hubei. In Shanghai, however, the advantage is only marginal, indicating that performance is only weakly affected by feature selection. In Hubei, by contrast, M1 shows a clearer advantage, suggesting that external information provides meaningful additional predictive value.
Second, reduced configurations perform better in several markets. Beijing and Guangdong achieve the lowest RMSE under M4, indicating that macroeconomic variables do not improve short-term forecasting in these two markets and may instead introduce noise. Tianjin performs best under M3, although the differences across configurations are very small. Fujian performs best under M2, suggesting that volume-related signals are less informative in a market with discontinuous trading activity.
Third, Shenzhen shows a distinct pattern. The price-only model (M5) gives the lowest RMSE, while M3 also outperforms M1. This suggests that Shenzhen’s local price dynamics differ from those of the other markets, so additional external inputs, especially cross-market signals, may weaken rather than improve prediction.
Overall, the ablation results show that there is no universally optimal information structure across China’s regional carbon markets. The usefulness of volume, cross-market, and macroeconomic features depends on the characteristics of the target market. These findings support the main argument of this study: carbon price forecasting in China is shaped not only by model choice, but also by market structure and information suitability.
These results reveal a regularity at the level of market heterogeneity: a market’s information-processing structure is closely related to its trading activity and institutional maturity. In highly liquid and continuously traded markets (e.g., Shanghai), all information groups are effectively absorbed, making the full model (M1) optimal. In less mature or sporadically traded markets (e.g., Shenzhen, Tianjin), external signals not only fail to help but may introduce “informational noise” that mismatches the local price formation mechanism. This suggests that predictive uncertainty in carbon markets arises not only from data noise but also from mismatches between information structure and market microstructure. A unified forecasting framework should not pursue a universal feature combination but rather provide the ability to identify such mismatches and adaptively select the relevant information subset.
Since formal forecast-comparison tests, such as Diebold–Mariano tests or bootstrap confidence intervals, are not implemented in this study, conclusions based on small RMSE or R2 differences should be interpreted cautiously. The ablation analysis is therefore intended to provide an exploratory comparison of feature-group relevance under a unified modelling protocol, rather than a statistical test of dominance among feature configurations.
4.3. Metric Interpretation and Cross-Market Synthesis
Among the four evaluation metrics, RMSE and R
2 are given the greatest weight. RMSE directly measures reconstruction error in CNY/t and is therefore the most relevant metric for the decision-support tasks developed in
Section 5. R
2 complements RMSE by showing how much price variation is captured by the model. MAE serves as a useful absolute-error supplement. MAPE is reported only as a supplementary metric because percentage-based errors can become unstable when price levels are low or when markets experience prolonged low-price regimes. Therefore, the interpretation of model performance in this study relies primarily on RMSE and R
2 rather than on MAPE alone. This issue is most visible in Shenzhen. Under the baseline configuration, Shenzhen shows a high R
2 but also a relatively high MAPE. These results are not contradictory. The model captures the overall price trajectory well, but percentage errors are amplified because Shenzhen had a low-price regime in the early sample period. Similar caution may also apply to other markets when price levels are low. Future work should complement MAPE with more robust alternatives, such as sMAPE, MASE, RMSLE, or QLIKE-type metrics, especially for markets with prolonged low-price regimes.
This issue is most visible in Shenzhen. Under the baseline configuration, Shenzhen shows a high R2 but also a relatively high MAPE. These results are not contradictory. The model captures the overall price trajectory well, but percentage errors are amplified because Shenzhen had a low-price regime in the early sample period. Similar caution may also apply to other markets when price levels are low. Future work should complement MAPE with more robust alternatives, such as sMAPE, MASE, RMSLE, or QLIKE-type metrics, especially for markets with prolonged low-price regimes.
This issue is most visible in Shenzhen. Under the baseline configuration, Shenzhen shows a high R2 but also a relatively high MAPE. These results are not contradictory. The model captures the overall price trajectory well, but percentage errors are amplified because Shenzhen had a low-price regime in the early sample period. The ablation results further show that Shenzhen performs best under the price-only configuration, indicating that the high MAPE should not be interpreted as weak overall model quality.
The cross-market comparison also shows that model evaluation cannot be reduced to a single best specification. Beijing and Guangdong perform best without macroeconomic variables, Shanghai and Hubei perform best under the full model, Tianjin performs best without cross-market features, Fujian performs best without volume features, and Shenzhen performs best under a price-only setting. These results confirm that the predictive value of external information depends on market context. In practice, forecasting performance depends not only on algorithm design, but also on whether the information structure matches the characteristics of the target market.
6. Conclusions
This study presents a multi-market machine learning framework for carbon allowance price forecasting across seven regional carbon markets in China, and further demonstrates how forecast outputs can be translated into decision-support applications. The empirical results confirm that the proposed framework achieves strong predictive performance across most markets. Under the full-feature configuration, the weighted ensemble model attains an average RMSE of 2.638 CNY/t, MAE of 1.567 CNY/t, MAPE of 4.69%, and R2 of 0.908. Among the individual models, LightGBM delivers the highest average accuracy, while the ensemble approach ensures more stable and consistent performance across markets with heterogeneous characteristics.
A central insight of this study is that the effectiveness of external information varies significantly across markets. The feature ablation analysis reveals that no single predictor configuration dominates across all cases, highlighting the importance of market-specific information structures. Specifically, the full-feature model performs best in Shanghai and Hubei, while excluding macroeconomic variables improves performance in Beijing and Guangdong. Tianjin benefits from removing cross-market linkage features, Fujian from excluding trading volume, and Shenzhen from relying solely on price-based inputs. These findings suggest that carbon price dynamics in China are shaped not only by modeling techniques but also by differences in market structure and information relevance.
Beyond predictive accuracy, the framework extends carbon price forecasting into an application-oriented analytical paradigm. The forecasted price trajectories are utilized to identify favorable timing strategies for allowance selling and to evaluate dynamic carbon costs in offshore wind scour protection scenarios. This integration demonstrates how predictive models can support both market operations and engineering-related decision processes.
Despite these contributions, several limitations remain.
- (1)
The preprocessing procedure uses full-calendar linear interpolation for missing price values, which may introduce a limited look-ahead risk because future observations can be involved in the interpolation process. Although the proportion of missing price observations is low and the same preprocessing protocol is applied consistently across markets, the reported results should be interpreted as a controlled empirical comparison rather than as a fully leakage-free operational forecasting evaluation. Future work should adopt strictly time-causal imputation methods, such as forward filling within each training, validation, and test segment.
- (2)
The benchmark set in this study is limited to representative tree-based machine learning models and their weighted ensemble. Classical econometric and statistical forecasting benchmarks, such as ARIMA, GARCH, and random-walk-type models, are not fully examined. This limits the ability to quantify the incremental predictive gain of the proposed framework relative to traditional carbon-price forecasting approaches. Future work should incorporate these benchmarks into the same cross-market evaluation protocol.
- (3)
Some differences across feature-ablation configurations are numerically small. Since formal forecast-comparison tests, such as Diebold–Mariano tests, bootstrap confidence intervals, or rolling-window robustness checks, are not implemented, the ablation results should be interpreted as indicative evidence of market-dependent feature relevance rather than as statistically definitive proof of distinct information structures.
- (4)
Chinese ETS markets are strongly affected by regulatory announcements, compliance cycles, allowance-allocation rules, institutional interventions, and other policy shocks. These policy-related factors are not explicitly incorporated in the present feature set. As a result, the model may not fully capture abrupt price movements triggered by regulatory events. Future work could incorporate policy-event dummy variables, compliance-calendar indicators, allowance-allocation information, or text-based features from policy announcements.
- (5)
The multi-step forecasting strategy relies on recursive prediction, leading to error accumulation as the forecast horizon increases. While one-step-ahead forecasts show relatively strong accuracy within the tree-based benchmark considered in this study, the 30-day recursive predictions introduce greater uncertainty over time. Future work should quantify error propagation through multi-step RMSE analysis and determine effective forecasting horizons for different applications.
- (6)
Model hyperparameters are kept consistent across all markets to enable controlled comparison. Although this setting helps isolate cross-market and feature-group effects, market-specific tuning may further enhance performance, especially in structurally distinct markets such as Shenzhen and Guangdong. More complex hybrid ensemble models, metaheuristic optimization algorithms, and adaptive feature-selection strategies could also be incorporated in future work. However, these extensions should be evaluated carefully to avoid shifting the focus from market-dependent feature relevance to algorithmic benchmarking alone.
- (7)
The decision-support applications presented in this study are exploratory in nature and are not designed as fully optimized operational systems. In particular, the LCA-based offshore wind case is used to demonstrate how forecast carbon prices can be linked with life-cycle emission estimates, rather than to provide a complete standalone LCA assessment. Future research could further develop the LCA component as an independent study with more detailed engineering inventory data, site-specific uncertainty analysis, and broader engineering scenarios.
In summary, this study provides a systematic cross-market evaluation of carbon price forecasting in China within a controlled tree-based machine learning framework. The results suggest that the usefulness of external information varies across regional carbon markets, and that no single feature configuration consistently gives the lowest prediction error across all cases. The decision-support extension further illustrates how forecast price paths can be translated into allowance-selling and engineering carbon-cost signals when relevant decision parameters are available. However, these outputs should be interpreted as exploratory decision-support signals rather than fully validated operational recommendations. More broadly, this study suggests that carbon price forecasting in heterogeneous markets requires attention not only to model selection, but also to the suitability of different feature groups under market-specific conditions.