The Optuna–LightGBM–XGBoost Model: A Novel Approach for Estimating Carbon Emissions Based on the Electricity–Carbon Nexus

: With the challenge posed by global warming, accurately estimating and managing carbon emissions becomes a key step for businesses, especially power generation companies, to reduce their environmental impact. Optuna–LightGBM–XGBoost, a novel power and carbon emission relationship model that aims to improve the efficiency of carbon emission monitoring and estimation for power generation companies, is proposed in this paper. Deeply exploring the intrinsic link between power production data and carbon emissions, this model paves a new path for “measuring carbon through electricity”, in contrast to the emission factor method commonly used in China. Unit data from power generation companies are processed into structured tabular data, and a parallel processing framework is constructed with LightGBM and XGBoost, and optimized with the Optuna algorithm. The multilayer perceptron (MLP) is used to fuse features to enhance prediction accuracy by capturing characters that the individual models cannot detect. Simulation results show that Optuna–LightGBM–XGBoost can achieve better performance compared to existing methods. The mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), and coefficient of determination (R 2 ) of the model are 0.652, 0.939, 0.136, and 0.994, respectively. This not only helps governments and enterprises to develop more scientific and reasonable emission reduction strategies and policies, but also lays a solid foundation for achieving global carbon neutrality goals.


Introduction
Against the backdrop of increasingly severe global warming and climate change, the estimation and management of corporate carbon emissions have become key issues in scientific research and policy-making [1].As the primary tools for assessing and predicting corporate carbon emissions, carbon emission estimation models play a crucial role in formulating emission reduction strategies and achieving carbon neutrality goals [2,3].As the world's largest carbon market, China requires efficient and accurate carbon emission data collection for the success of large-scale carbon trading.An accurate carbon emission estimation is fundamental for establishing and maintaining a carbon trading system [4,5].Accurate carbon emission estimation not only helps governments and enterprises to develop more scientific and reasonable emission reduction strategies and policies, but also lays a solid foundation for achieving global carbon neutrality goals [6].
In general, the emission factor method and online monitoring method are the two main methods for monitoring carbon emissions.The emission factor method estimates CO 2 emissions indirectly by analyzing operational data and emission coefficients of companies or facilities.The online monitoring method calculates the actual CO 2 emissions directly, and relies on continuous emission monitoring systems (CEMSs) to track the concentration of CO 2 in gas emissions and the velocity of flue gas in real time.In practical applications in China, emission monitoring environments are complex and diverse, meaning CEMSs cannot be applied to applications in China [7].Currently, the emission factor method is the primary method used to estimate carbon emissions by third-party verification organizations in China [8]; however, this method is inefficient and lags behind because enterprises generally submit their carbon emissions data on a monthly or yearly basis.
At present, coal-fired power plants are one of the largest sources of carbon emissions in China.Coal composition, which is usually accompanied by mixed combustion, is different and diverse in China's coal-fired power plants [9].There is an urgent need to find a way to estimate carbon emissions in coal-fired power plants of China.Therefore, a novel carbon emission prediction method is proposed for power generation companies in this paper.Based on the production data and carbon emission data from 25 coal-fired power plants, the Optuna-LightGBM-XGBoost model, with a parallel processing framework, is proposed in this paper.The model aims to use artificial intelligence technology to establish the relationship between electricity consumption and carbon emissions.It uses real-time power data to measure carbon emissions from electricity, solving the problem of the reliance of the current emission factor method on manual post-examination verification, and thereby effectively improving the efficiency of carbon emission estimation and reducing costs.The MAE, MSE, MAPE, and R 2 of the model are 0.652, 0.939, 0.136, and 0.994, respectively.Compared with the optimal MAE, MSE, and MAPE of the traditional machine learning model, they are further reduced by 26.8%, 47%, and 2.9%, and R 2 is increased by approximately 0.6 percentage points.
In summary, the main contributions of this study are encapsulated in the development and implementation of the Optuna-LightGBM-XGBoost model, which is a novel artificial intelligence framework that substantially enhances the accuracy and efficiency of carbon emission estimations.This model uniquely combines the LightGBM and XGBoost algorithms, which are optimized by the Optuna hyperparameter tuning framework.It represents a significant departure from traditional methods that primarily rely on post-analysis of emission factors, by utilizing real-time electricity consumption data to directly estimate emissions.This innovative approach not only improves the timeliness and reliability of emission reporting, but also significantly reduces operational costs and resource consumption through the automation of the estimation process.These advancements are vital for power generation companies navigating stringent environmental regulations and seeking cost-effective compliance strategies.Moreover, the enhanced accuracy and efficiency of our model provide governments and enterprises with improved tools for developing scientific and feasible emission reduction strategies, which are crucial for alignment with global carbon neutrality goals and supporting the practical implementation of these policies in the power industry.
The rest of this paper is organized as follows: Section 2 introduces related research.Section 3 describes materials and our method.Section 4 describes the performance evaluation of the proposed method.Finally, Section 5 presents the conclusions.

Related Work
In recent years, research has identified two main methods for estimating carbon emissions: the emission factor method and the online monitoring method [10].The emission factor method estimates CO 2 emissions indirectly by analyzing operational data and emission coefficients of companies or facilities [11].The online monitoring method for calculating the actual CO 2 emissions directly relies on continuous emission monitoring systems (CEMSs) to track the concentration of CO 2 in gas emissions and the velocity of flue gas in real time [12].Although both methods for estimating carbon emissions have been recognized internationally [13], the emission factor method is the main method used in China.
To further study the practicality of the direct measurement method in the CO 2 emission accounting of thermal power plants in China, Lin Yue-ting et al. [14] developed an online carbon emission monitoring and management system for coal-fired power plants.Wang Linhan [15] compared the monitoring system and accounting strategy of carbon emissions from thermal power plants.Tan Chao [16] focused on the study of the system design of the direct monitoring method for a certain 300 MW unit.Duan Zhijie and co-workers [17] analyzed the coal-fired power generation company's GHG emissions quantification method.Liu Tonghao et al. [18] proposed a strategy to strengthen the monitoring of GHG emissions from stationary sources in China based on a review of GHG emissions monitoring at home and abroad.All of these studies recommended the use of a continuous carbon dioxide emission monitoring system (CO 2 -CEMS) as a basis for estimating carbon emissions, and noted that manual measurements are usually used as a reference method for calibrating the data of the online monitoring system under the current CEMS management framework.The emission factor method, which estimates CO 2 emissions by combining operational data and specific emission factors, has been widely used in China and EU countries.Meanwhile, the real-time monitoring technique, which allows immediate monitoring of CO 2 emissions through the deployment of continuous emission monitoring systems (CEMSs), is more common in the United States.Nonetheless, both methods have obvious shortcomings.As reliance on annual reports of carbon emissions has become the norm, the emission factor approach suffers from problems including inefficient data collection and significant delays in results.The problem of timeliness is further exacerbated by the additional time required to review the reports once they are submitted.At the same time, the data collected are limited in terms of coverage and level of detail.In the case of real-time monitoring technologies, although they provide instantaneous data, they risk compromising stability and accuracy due to the vulnerability of the monitoring equipment to environmental factors such as temperature and humidity [19].In view of this, it is particularly important to explore new methods that can effectively track and estimate CO 2 emissions in real time.
In the power industry, carbon dioxide is a major source of greenhouse gas emissions, including emissions from the combustion of fossil fuels, desulphurization processes, and purchased electricity [20].Accounting tools usually rely on data such as the low-level calorific value of the coal combusted and the results of carbon per unit of heat assays.However, accurate carbon emission data are only available once a month, as the assaying step usually lags behind the production activities.In addition, considering the blended coal widely used as fuel in domestic power plants, there may be significant differences between the assay samples and the actual coal burned, which makes it challenging for power plants to realize real-time monitoring of carbon emissions.
Estimating carbon dioxide emissions from electricity consumption data is a wellestablished program.Electricity consumption information is not only easily accessible in available real-time datasets, but can also be tracked in real time, which significantly improves the efficiency and timeliness of data collection [21].Electricity data are considered more reliable due to their higher accuracy and stability.Numerous research results have confirmed that it is feasible to use electricity consumption to estimate CO 2 emissions.As an example, Lai et al. [22] proposed a carbon emission prediction model for the flat glass industry based on electricity consumption.By processing and analyzing the electricity data of China's flat glass industry, the study built an electricity-carbon model using support vector regression (SVR), and experimentally verified the validity and accuracy of the model, which proved that it is effective to use electricity data for carbon emission modeling.Xia et al. [23] proposed an innovative carbon emission estimation method based on the correlation between electricity and carbon emissions and non-intrusive load monitoring (NILM), which is dedicated to improving the accuracy and interpretability of carbon emission estimation in the field of electricity production.The core of the methodology is to decompose the total power consumption of the enterprise, specify the power consumption of each piece of key equipment, and calculate the carbon dioxide emissions accordingly.In order to strengthen the accuracy of the analysis, the study adopts a two-stage learning structure optimized by a deep learning algorithm, which is validated using actual data from a power plant in China.Recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and gated recurrent unit (GRU) networks were applied in the construction of the electricity-carbon model, demonstrating the potential of the method in accurately estimating carbon emissions.Shuhan Zhang et al. [24] proposed a new method for monitoring carbon emissions of cement enterprises based on electricity data using machine learning algorithms.By building single-sample daily and multi-sample annual electricity-carbon monitoring models and comparing nine regression methods, they found that the Lasso model performs the best, and effectively reveals the key influence of the electricity emission factor of clinker production and the percentage of electricity consumption in clinker section on carbon emissions.Chen et al. [25] proposed a method to monitor corporate carbon emissions using big data on electricity, and verified the effectiveness of the method through a practical case study of 810,000 enterprises in Beijing.Xu et al. [26] proposed an analysis method based on automated machine learning (AutoML) for predicting indirect carbon emissions caused by electricity consumption in multiple wastewater treatment plants.
Overall, although the current research on exploring the relationship between electricity consumption and carbon emissions using AI techniques is still in the preliminary stage, its great potential and broad application prospects have attracted extensive attention from both academia and industry.By accurately analyzing the complex relationship between electricity use and carbon emissions, these studies can not only help identify the key links between high energy consumption and high emissions, and then guide the implementation of energy efficiency and emission reduction strategies, but also provide a scientific basis and technical support to respond to the challenges of global climate change, and promote the realization of the goal of carbon neutrality.

Methods
With growing global concern for climate change, the power generation industry, as a major source of carbon emissions, has attracted widespread attention regarding the relationship between electricity consumption in production activities and carbon emissions.The study aims to establish a highly accurate electricity-carbon relationship prediction model by analyzing the production and carbon emission data from 25 coal-fired power generation enterprises and employing the Optuna-optimized LightGBM [27] and XGBoost [28] algorithms.This model not only improves prediction accuracy and robustness, but also offers an effective solution for "measuring carbon through electricity", creating a new path that is different from the commonly used emission factor method in China.In this study, a comprehensive collection and exploratory analysis of data from 25 coal-fired power generation enterprises were first conducted, focusing on the types and sizes of generating units and related monthly indicators such as electricity generation, power supply, heating supply, and carbon emissions.Data preprocessing included handling missing values, eliminating outliers, and selecting features to ensure the quality and applicability of the data.Unit data from power generation companies were processed into structured tabular data, and a parallel processing framework was simultaneously constructed with LightGBM and XGBoost.In this framework, hyperparameters are optimized by the Optuna algorithm [29].The multilayer perceptron (MLP) [30] is used to fuse features to enhance prediction accuracy by capturing characters that individual models cannot detect.The electricity-carbon nexus Optuna-LightGBM-XGBoost model is shown in Figure 1.
LightGBM and XGBoost are gradient boosting frameworks that are widely known for their efficiency and effectiveness in handling large-scale data.LightGBM optimizes the traditional gradient boosting decision tree (GBDT) using a histogram-based algorithm for faster training and lower memory usage.XGBoost, on the other hand, provides a regularized model formalization to control over-fitting, which is crucial for predictive accuracy.LightGBM and XGBoost are widely used in various machine learning tasks due to their efficient processing speeds and excellent predictive performance.Optuna is an automated hyperparameter optimization framework that intelligently explores the best combination of hyperparameters by defining the search space and optimization objectives, thereby improving the model's prediction accuracy and robustness.For LightGBM, the selected hyperparameters for tuning included learning_rate, max_depth, n_estimators, num_leaves, and min_child_samples.The search range for Optuna-LightGBM hyperparameters is shown in Table 1.For XGBoost, the selected hyperparameters for tuning included learning_rate, max_depth, n_estimators, and subsample.The search range for Optuna-XGBoost hyperparameters is shown in Table 2. LightGBM and XGBoost are gradient boosting frameworks that are widely known for their efficiency and effectiveness in handling large-scale data.LightGBM optimizes the traditional gradient boosting decision tree (GBDT) using a histogram-based algorithm for faster training and lower memory usage.XGBoost, on the other hand, provides a regularized model formalization to control over-fitting, which is crucial for predictive accuracy.LightGBM and XGBoost are widely used in various machine learning tasks due to their efficient processing speeds and excellent predictive performance.Optuna is an automated hyperparameter optimization framework that intelligently explores the best combination of hyperparameters by defining the search space and optimization objectives, thereby improving the model's prediction accuracy and robustness.For LightGBM, the selected hyperparameters for tuning included learning_rate, max_depth, n_estimators, num_leaves, and min_child_samples.The search range for Optuna-LightGBM hyperparameters is shown in Table 1.For XGBoost, the selected hyperparameters for tuning included learn-ing_rate, max_depth, n_estimators, and subsample.The search range for Optuna-XGBoost hyperparameters is shown in Table 2.After conducting hyperparameter tuning with Optuna, based on the search ranges in Tables 1 and 2, the optimal parameters obtained for LightGBM and XGBoost are shown in Table 3 and Table 4, respectively.After the hyperparameter tuning was completed, to enhance the accuracy of the predictions and to integrate the strengths of the optimized XGBoost and LightGBM models, we employed a multilayer perceptron (MLP) as a model fusion tool.The motivation behind this method is to leverage the nonlinear learning capabilities of the MLP to learn and integrate the complex relationships between the prediction results of the two gradient boosting tree-based models.By fusing the prediction results of XGBoost and LightGBM as new feature inputs into the MLP, we expect this combined model to capture data patterns that individual models cannot detect, thereby improving the overall predictive performance.By integrating the LightGBM and XGBoost models, we were able to combine the strengths of these two powerful gradient boosting frameworks.Specifically, the efficient handling of large-scale data and excellent handling of categorical features by LightGBM, along with the advantages of XGBoost in regularization and system optimization, were effectively integrated.Moreover, this fusion strategy not only enhanced the model's generalization ability for unseen data, but also reduced the risk of over-fitting by integrating the strengths of different models.

Data Collection
The core of this study lies in exploring the relationship between power production and carbon emissions in coal-fired power generation enterprises.For this purpose, we meticulously selected 25 representative coal-fired power generation enterprises, and collected and compiled their production and carbon emission data, forming a dataset with 372 detailed data records.Each data record targets an individual power generation unit, ensuring the precision and detail of the research.In terms of data feature selection, this study comprehensively considered variables directly related to power production, including unit type, installed capacity, cooling method, and product type, while focusing on collecting key data such as power generation, power supply, and heating supply.Additionally, the model's predicted output variable, the unit's carbon emission, was included.These selected features not only cover the main aspects of power production, but also show significant differences from the characteristics required by the commonly used carbon emission estimation method in China-the emission factor method.Our goal is to simplify the carbon emission estimation process by constructing an electricity-carbon relationship model for power generation enterprises, directly using power-related data to estimate carbon emissions, thereby effectively reducing the cumbersome calculations and extensive manpower and material resources required by the traditional emission factor method, making subsequent verification and validation work more efficient.
Through such a data collection and feature selection strategy, this study aims to break through the limitations of traditional carbon emission estimation methods and provide a new, more direct, and efficient approach to carbon emission estimation.Furthermore, the application potential of this method is not limited to coal-fired power generation enterprises; it can be extended to a wider range of energy production and industrial production fields in the future, contributing to low-carbon development and environmental protection in China and globally.

Data Preprocessing
In this study, we meticulously preprocessed the collected data from coal-fired power plant units to ensure the accuracy and reliability of subsequent analyses.The preprocessing steps mainly included checking for missing values, detecting outliers, confirming data types, and checking for duplicates, followed by data normalization before inputting them into the model.
We first checked for missing values, as data completeness is crucial for ensuring the reliability of analysis results.Through a comprehensive examination of the dataset, we confirmed that all key fields, including power generation, power supply, heating supply, and unit emissions, had no missing values.This check ensured that each piece of data in the 372 unit records was complete, providing a solid foundation for further analysis.However, we found 45 records of unit emissions with zero values and, upon closer inspection, determined that these units did not engage in production activities during the monthly statistical period.Therefore, these data were considered irrelevant and were excluded.Based on statistical descriptive analysis, we evaluated potential outliers in the data.Specifically, we focused on the maximum and minimum values of numerical fields such as power generation, power supply, heating supply, and unit emissions to identify possible extreme cases in the data.While some extreme values were present, considering the significant differences in the collected unit data volumes, we chose to retain these records to avoid mistakenly deleting valid data points.We checked the data types of columns in the dataset to ensure they were suitable for subsequent analysis.Major numerical variables such as installed capacity, power generation, power supply, heating supply, and unit emissions were stored in appropriate numerical types, while categorical variables like unit type, cooling method, and product type were represented as object types, requiring no additional data type conversions.Finally, we checked the dataset for duplicate records.The absence of duplicates confirmed the uniqueness of the data.
After completing the above steps, the data were normalized to ensure that numerical variables were on the same scale during the analysis, reducing the impact of differences in dimensions among variables on the analysis results.The study applied Z-score normalization to key numerical variables in the dataset, including power generation, power supply, heating supply, and unit emissions.Z-score normalization is a common data preprocessing technique designed to transform raw data into a dataset with zero mean and unit variance.The calculation of the Z-score is as shown in Equation (1).

Exploratory Data Analysis
We conducted an exploratory analysis of the 327 records selected during the data preprocessing stage, after removing 45 records of units that did not participate in production activities.
First, we performed a basic statistical description of the data.Based on the fundamental statistical analysis of the power plant unit data, we find that the average installed capacity of the units is about 270.38 megawatts, with a median of 300 megawatts.This indicates that the installed capacity of most units is concentrated around this level.However, the range of installed capacities (12 to 700 megawatts) reflects a greater diversity.In terms of power generation, power supply, and heating supply, although the average values are respectively 99,212.87 kWh, 94,019.92kWh, and 548,923.90GJ, the distribution of the data shows a right-skewed characteristic.This means that a few higher values have raised the average, especially for heating supply, where the maximum value reached 8,533,608 GJ, indicating that some units have particularly strong heating capabilities.
Subsequently, we conducted a visualization analysis of univariate data.For continuous variables such as installed capacity, power generation, power supply, heating supply, and unit emissions, we used histograms to observe their distribution.The distribution of continuous variables is shown in Figure 2, where the blue line represents the kernel density estimation (KDE) curve.
mental statistical analysis of the power plant unit data, we find that the average installed capacity of the units is about 270.38 megawatts, with a median of 300 megawatts.This indicates that the installed capacity of most units is concentrated around this level.However, the range of installed capacities (12 to 700 megawatts) reflects a greater diversity.In terms of power generation, power supply, and heating supply, although the average values are respectively 99,212.87 kWh, 94,019.92kWh, and 548,923.90GJ, the distribution of the data shows a right-skewed characteristic.This means that a few higher values have raised the average, especially for heating supply, where the maximum value reached 8,533,608 GJ, indicating that some units have particularly strong heating capabilities.
Subsequently, we conducted a visualization analysis of univariate data.For continuous variables such as installed capacity, power generation, power supply, heating supply, and unit emissions, we used histograms to observe their distribution.The distribution of continuous variables is shown in Figure 2, where the blue line represents the kernel density estimation (KDE) curve.For categorical variables, such as unit and product type, bar charts are used to understand the frequency of each category.As shown in Figure 3, the distribution of categorical variables is clearly observed.Among all the types of units considered, those classified as conventional coal-fired units are significantly higher than the other types, occupying an absolute majority share.Furthermore, from the perspective of product type, the vast majority of power generation companies prefer to choose cogeneration technology as their main product form, a trend that is very common in the current field of energy production.This finding not only reveals the mainstream technology trends in the current power industry, but also points to possible future directions, namely, further optimizing cogeneration technology to improve energy efficiency and reduce environmental impact.
For categorical variables, such as unit type and product type, bar charts are used to understand the frequency of each category.As shown in Figure 3, the distribution of categorical variables is clearly observed.Among all the types of units considered, those classified as conventional coal-fired units are significantly higher than the other types, occupying an absolute majority share.Furthermore, from the perspective of product type, the vast majority of power generation companies prefer to choose cogeneration technology as their main product form, a trend that is very common in the current field of energy production.This finding not only reveals the mainstream technology trends in the current power industry, but also points to possible future directions, namely, further optimizing cogeneration technology to improve energy efficiency and reduce environmental impact.For bivariate analysis, which explores the relationship between input features and the target variable (unit emissions), scatter plots are directly used to identify the relationships between them, as shown in Figure 4. Given the close connection between power generation and power supply determined by generation efficiency, the numerical closeness of these two reveals their highly similar impact on unit emissions.In the case of heating supply, a clear proportional effect on unit emissions is observed, indicating that as heating supply increases, unit emissions also rise.
In the correlation analysis of the main numerical characteristics of power plant units, the correlation graph presented in Figure 5 reveals an extremely strong positive correlation (correlation coefficient of 0.9997) between power generation and power supply.This relationship almost perfectly mirrors the increases and decreases between these variables, demonstrating a high degree of similarity in their impact on unit emissions, as shown in Figure 4.This finding aligns with the conventional understanding that as electricity generation increases, electricity supply also increases accordingly.The installed capacity also shows a strong positive correlation with the amount of electricity generated and supplied (correlation coefficients of 0.8641 and 0.8598, respectively), suggesting that units with larger installed capacity perform better in terms of electricity generation and supply.The correlation between installed capacity and emissions (correlation coefficient of 0.3825) is not as strong as the correlation between electricity generation and electricity supply, but it still shows that units with larger installed capacity tend to have higher emissions, which suggests that we need to take into account the importance of environmental protection while increasing the power generation capacity.Comparatively, the correlation between heat supply and both electricity generation and supply is relatively weak, with correlation coefficients of 0.3557 and 0.3653, respectively.However, a strong positive correlation exists between heat supply and unit emissions, as evidenced by a correlation coefficient of 0.9181 shown in Figure 4.This indicates that emissions increase with heating.The results highlight the challenges in balancing production efficiency with environmental protection, as increasing heat supply significantly impacts emissions.The moderate positive correlation between unit emissions and electricity generation and supply (correlation For bivariate analysis, which explores the relationship between input features and the target variable (unit emissions), scatter plots are directly used to identify the relationships between them, as shown in Figure 4. Given the close connection between power generation and power supply determined by generation efficiency, the numerical closeness of these two reveals their highly similar impact on unit emissions.In the case of heating supply, a clear proportional effect on unit emissions is observed, indicating that as heating supply increases, unit emissions also rise.
Appl.Sci.2024, 14, x FOR PEER REVIEW 10 of 17 coefficients of 0.6820 and 0.6886, respectively) further emphasizes the need for effective measures to control emissions and protect the environment as electricity production increases.This comprehensive analysis not only reveals the interdependence of key indicators of power plant unit operation, but also emphasizes the importance of considering the environmental impact of power production while pursuing its efficiency, thus providing data support for optimizing production processes and formulating appropriate environmental policies.In the correlation analysis of the main numerical characteristics of power plant units, the correlation graph presented in Figure 5 reveals an extremely strong positive correlation (correlation coefficient of 0.9997) between power generation and power supply.This relationship almost perfectly mirrors the increases and decreases between these variables, demonstrating a high degree of similarity in their impact on unit emissions, as shown in Figure 4.This finding aligns with the conventional understanding that as electricity generation increases, electricity supply also increases accordingly.The installed capacity also shows a strong positive correlation with the amount of electricity generated and supplied (correlation coefficients of 0.8641 and 0.8598, respectively), suggesting that units with larger installed capacity perform better in terms of electricity generation and supply.The correlation between installed capacity and emissions (correlation coefficient of 0.3825) is not as strong as the correlation between electricity generation and electricity supply, but it still shows that units with larger installed capacity tend to have higher emissions, which suggests that we need to take into account the importance of environmental protection while increasing the power generation capacity.Comparatively, the correlation between heat supply and both electricity generation and supply is relatively weak, with correlation coefficients of 0.3557 and 0.3653, respectively.However, a strong positive correlation exists between heat supply and unit emissions, as evidenced by a correlation coefficient of 0.9181 shown in Figure 4.This indicates that emissions increase with heating.The results highlight the challenges in balancing production efficiency with environmental protection, as increasing heat supply significantly impacts emissions.The moderate positive correlation between unit emissions and electricity generation and supply (correlation coefficients of 0.6820 and 0.6886, respectively) further emphasizes the need for effective measures to control emissions and protect the environment as electricity production increases.This comprehensive analysis not only reveals the interdependence of key indicators of power plant unit operation, but also emphasizes the importance of considering the environmental impact of power production while pursuing its efficiency, thus providing data support for optimizing production processes and formulating appropriate environmental policies.coefficients of 0.6820 and 0.6886, respectively) further emphasizes the need for effective measures to control emissions and protect the environment as electricity production increases.This comprehensive analysis not only reveals the interdependence of key indicators of power plant unit operation, but also emphasizes the importance of considering the environmental impact of power production while pursuing its efficiency, thus providing data support for optimizing production processes and formulating appropriate environmental policies.

Feature Engineering
In power plant unit data analysis projects, conducting effective feature engineering is a crucial step towards achieving high-quality model predictions.This section delves into the feature engineering strategy implemented in this study, including encoding of categorical variables, creation of new features, and feature selection, which were all aimed at enhancing the model's predictive accuracy and interpretability.
For the encoding of categorical variables, given that "unit type", "cooling method", and "product type" are categorical variables within the dataset, directly utilizing these non-numeric features could potentially limit the effectiveness of certain algorithms.Hence, we employed the one-hot encoding technique to convert these categorical variables into numerical form, thereby transforming each category into a new binary feature to facilitate easier processing and understanding by the model.
In the process of feature engineering, we identified the power supply efficiency of the unit as a key indicator, which is particularly crucial for predicting the unit's emissions.Therefore, we decided to develop a new feature named "power supply efficiency", which is calculated by the ratio of "power supply" to "power generation".Specifically, power supply efficiency reflects the capability to convert produced electricity into electricity that is available for end-users.High efficiency indicates optimized energy utilization, whereas low efficiency may suggest losses of energy during the conversion or transmission processes.The calculation of power supply efficiency is shown in Equation ( 2).Furthermore, we constructed a feature known as equivalent full load hours, which is derived by dividing the electricity generation by the installed capacity.The result of this calculation can be interpreted as the equivalent full load operating time, that is, the duration for which a power generating facility would need to operate at its maximum capacity (installed capacity) without interruption to produce the actual amount of generated electricity.Equivalent full load hours provide a means to assess the efficiency and utilization of a power generating facility, aiding in the understanding of a facility's operational performance over a certain period.The method of calculation is illustrated in Equation (3).power supply efficiency(%) = power supply power generation × 100 equivalent full load hours = power generation installed capacity Finally, we selected features for model training.Given that observations from the data indicate that the "cooling method" is "open water cooling cycle" for almost all units, it can be determined that this feature contributes minimally to the training effectiveness of the model, and thus it was decided to exclude it.The remaining features were retained for model training.Additionally, the two newly added features, "power supply efficiency" and "equivalent full load operating hours", were also considered, with the aim of enhancing the predictive performance of the model.

Performance Analysis
In order to ensure the validity and generalization ability of the proposed model, we divided the collected dataset in the ratio of 8:2.Specifically, 80% of the data were randomly selected as the training set for model training and optimization, while the remaining 20% was used as the test set for independent validation of the model's prediction performance.This division strategy aims to validate the generalization ability of the model by ensuring that the model's performance on unseen data matches that of the training through the validation results of the independent test set.Our proposed model was compared and analyzed against support vector machine (SVM), K-nearest neighbors (KNN), random forest, multilayer perceptron (MLP), LightGBM, and XGBoost, and their versions optimized using Optuna.To optimize the comparative analysis of indicators while preventing the large variance in target values, namely unit emissions, from adversely affecting distancebased machine learning models such as KNN and SVM models, and to avoid potential gradient vanishing or explosion during the model fusion stage using MLP, we adopted a strategy of converting the unit of unit emissions into "ten thousand tons" for calculation.This approach not only makes the scale of data more uniform, facilitating model processing, but also enhances the convenience of analysis, ensuring the efficiency of the data processing workflow and the stability of the model training process.Furthermore, using this unit conversion, we avoid the step of denormalizing when outputting model prediction results, simplifying the post-processing.Thus, we can directly display clear and intuitive prediction results in "ten thousand tons", making the application of the model and the interpretation of results more direct and convenient.Model comparisons are based on the following metrics: mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), and coefficient of determination (R 2 ), as shown in Table 5.In a comprehensive performance comparison, our Optuna-LightGBM-XGBoost model, with its superior performance, significantly outperforms traditional SVM, KNN, random forest, MLP, LightGBM, and XGBoost models alone, including their Optuna-optimized versions.Specifically, our model achieves 0.652 in mean absolute error (MAE), which is an improvement of about 27% over the closest Optuna-LightGBM and Optuna-XGBoost models at 0.891, and significantly outperforms the other models with 0.939 in mean squared error (MSE), which is significantly lower than that of Optuna-LightGBM's 1.772, i.e., a reduction of about 47%.In addition, the mean absolute percentage error (MAPE) is only 0.136, which is an improvement of about 16.6% compared to 0.163 for Optuna-LightGBM.In terms of the coefficient of determination (R 2 ), our model achieves 0.994, which is about 0.6 percentage points better than the 0.988 of the second-best model, Optuna-LightGBM.
The improvement is even more significant when compared to conventional models; for example, compared to the original SVM model, our MAE and MSE are improved by 63.8% and 90.5%, respectively, and MAPE and R 2 are also significantly improved.This leapfrog performance improvement highlights the overall advantages of our model in terms of accuracy, error control, and ability to account for data variability.Compared to models such as KNN and random forest, our model also demonstrates a clear lead, both in terms of accuracy and in explaining complex data structures.
By combining the strengths of LightGBM and XGBoost and using Optuna for hyperparameter fine-tuning, our model not only improves prediction accuracy and reduces errors, but also significantly improves the ability to explain data variability, demonstrating the great potential of integrated learning and parameter optimization in improving the performance of machine learning tasks.This all-encompassing performance advantage not only demonstrates the power of our model in handling complex prediction tasks, but also shows its clear superiority in direct comparison with other traditional and optimization models, providing an efficient and reliable solution to address difficult prediction problems.
To visualize the predictive results, we used scatter plots to intuitively demonstrate the relationship between model predictions and actual values.By comparing the closeness to the perfect prediction line (y = x), we effectively assessed the model's predictive accuracy.The model prediction result charts listed in Table 5 are shown in Figure 6.
optimization models, providing an efficient and reliable solution to address difficult prediction problems.
To visualize the predictive results, we used scatter plots to intuitively demonstrate the relationship between model predictions and actual values.By comparing the closeness to the perfect prediction line (y = x), we effectively assessed the model's predictive accuracy.The model prediction result charts listed in Table 5 are shown in Figure 6.The traditional SVM, KNN, random forest, LightGBM, and XGBoost models exhibit wider distributions of data points around the perfect prediction line, especially in the prediction of extreme values.This suggests variability in performance across different data segments.In addition, MLP exhibits deviations within smaller data segments, potentially affecting its accuracy in specific scenarios.The Optuna-SVM, Optuna-KNN, Optunarandom forest, Optuna-LightGBM, and Optuna-XGBoost models show improvements in both the aggregation of predictions around the line y = x and a reduced spread of outliers, highlighting the benefits of hyperparameter optimization.The Optuna-LightGBM-XGBoost model clearly demonstrates data points that are closer to the perfect prediction line compared to baseline models such as SVM, KNN, random forest, MLP, LightGBM, and XGBoost, as well as their optimized versions via Optuna.The data points are highly concentrated around the perfect prediction line and closely coincide with it, indicating excellent predictive accuracy.In contrast, the scatterplots of the baseline model and its optimized version show a wider distribution of data points and more outliers, exhibiting greater fluctuations in the predictions, especially in the prediction of extreme values.The Optuna-LightGBM-XGBoost model not only matches the overall trend better, but also demonstrates superior consistency, stability, and accuracy of the data points.
The experimental results validate the exceptional performance of our proposed Optuna-LightGBM-XGBoost fusion model in establishing the relationship between electricity and carbon emissions.The model achieved optimal results across key performance indicators, including mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), and the coefficient of determination (R 2 ), highlighting the model's superiority in precision and generalization capability.In the comparison of the predicted results to the perfect prediction line, our model adheres closely to this line across both lower and higher target value ranges, further confirming its excellent predictive accuracy and robust generalization.Hence, estimating the carbon emissions of coal-fired power plants using electricity production data is not only feasible but also effective, underscoring the tight link between electricity production and carbon emissions.This is crucial for enhancing the accuracy of carbon emission estimates, optimizing the estimation process, and supporting sustainable development goals.

Conclusions
As artificial intelligence technology rapidly advances, its potential applications in environmental protection and sustainable development are increasingly recognized.In an in-depth analysis of production and carbon emission data from 25 coal-fired power plants, this study not only explored the complex connections between electricity generation, power supply, heat supply, and carbon emissions, but also enhanced the predictive accuracy and robustness of the model through the optimization and fusion of machine learning models.Specifically, by finely tuning hyperparameters through the Optuna framework and using a multilayer perceptron (MLP) as the fusion module to integrate LightGBM and XGBoost, we not only improved the performance of individual models, but also successfully built a comprehensive predictive model, providing a practical solution for "estimating carbon emissions from electricity consumption".
While this study has achieved certain successes in constructing electric-carbon relationships, it also faces some non-negligible limitations.A significant limiting factor is that the data collection process relies primarily on manual summarization, allowing us to collect only slightly more than 300 data records.Such a limited dataset poses a challenge to supporting the construction of more complex analysis models and is a major reason we could not use deep learning methods in our research.This issue is especially prominent against the backdrop of the increasing importance of data-driven research.
However, with the continuous progress of information technology and the acceleration of digital transformation in enterprises, it is expected that data acquisition will become more convenient in the future, with significant increases in both breadth and depth.This will not only help overcome the existing data limitations but also provide strong data support for the use of more complex artificial intelligence models to deeply explore the relationship between electricity consumption and carbon emissions.
We will optimize the data collection process in future research and explore automated data collection methods to significantly improve the efficiency and quality of data collection.Additionally, we will consider applying advanced modeling techniques such as deep learning, aiming to construct more accurate and higher-performance enterprise electriccarbon relationship models.With these efforts, we hope to provide enterprises with a more accurate carbon emission estimation model, thereby contributing to the realization of more sustainable development goals.

Figure 2 .
Figure 2. Distribution of continuous variables.Figure 2. Distribution of continuous variables.

Figure 2 .
Figure 2. Distribution of continuous variables.Figure 2. Distribution of continuous variables.

Figure 5 .
Figure 5. Correlation analysis chart for power plant units.Figure 5. Correlation analysis chart for power plant units.

Figure 5 .
Figure 5. Correlation analysis chart for power plant units.Figure 5. Correlation analysis chart for power plant units.

Figure 6 .
Figure 6.Comparison of predictive results and actual emissions across different models.Figure 6.Comparison of predictive results and actual emissions across different models.

Table 5 .
Results of predictive performance comparative analysis among different models.