Classifying the Level of Energy-Environmental Efficiency Rating of Brazilian Ethanol

The present study aimed to assess and classify energy-environmental efficiency levels to reduce greenhouse gas emissions in the production, commercialization, and use of biofuels certified by the Brazilian National Biofuel Policy (RenovaBio). The parameters of the level of energy-environmental efficiency were standardized and categorized according to the Energy-Environmental Efficiency Rating (E-EER). The rating scale varied between lower efficiency (D) and high efficiency + (highest efficiency A+). The classification method with the J48 decision tree and naive Bayes algorithms was used to predict the models. The classification of the E-EER scores using a decision tree using the J48 algorithm and Bayesian classifiers using the naive Bayes algorithm produced decision tree models efficient at estimating the efficiency level of Brazilian ethanol producers and importers certified by the RenovaBio. The rules generated by the models can assess the level classes (efficiency scores) according to the scale discretized into high efficiency (Classification A), average efficiency (Classification B), and standard efficiency (Classification C). These results might generate an ethanol energy-environmental efficiency label for the end consumers and resellers of the product, to assist in making a purchase decision concerning its performance. The best classification model was naive Bayes, compared to the J48 decision tree. The classification of the Energy Efficiency Note levels using the naive Bayes algorithm produced a model capable of estimating the efficiency level of Brazilian ethanol to create labels.


Introduction
Developing renewable energy is one of the leading global interests in promoting sustainability and environmental quality, including modern electricity grids worldwide, which have begun to rely more heavily on renewable energy sources [1][2][3]. Ethanol is a renewable fuel produced by the fermentation of sugarcane extract and molasses. The product has a lower carbon footprint, is biodegradable, and has greater energy-environmental efficiency (renewable energy) compared to oil due to its sustainability in the production chain with better use of natural resources [4][5][6][7][8][9][10]. Ethanol is one of the main biofuels consumed in Brazil. Biofuel partially (or entirely) replaces fossil fuels in engines (flex vehicles) [11]. The addition of 27% ethanol in gasoline (Cgasoline, with the addition of anhydrous ethanol fuel) has been mandatory in Brazil since 2015 [8,[12][13][14]. Such an initiative expanded the country's consumption of biofuels and increased the energy matrix's sustainability. It also supported the goal of reducing 37% of GHG emissions by 2025, compared to 2005 levels [2,12,15,16]. Differently, some European Union countries also include electromobility (electric cars) in their GHG emission reduction forecasts [17].

Classifying the Biofuel Energetic-Environmental Efficiency
The classification levels of the biofuel environmental efficiency were performed based on the Energy-Environmental Efficiency Rating. Such a rating is the result of the certificates of the production or efficient import of biofuels informed through the calculation (RenovaCalc), linked to the volume of biofuel produced and commercialized, generating Decarbonization Credits (CBio) in the RenovaBio's program. The energy-environmental efficiency level parameters were standardized according to Table 1. The dataset was categorized according to the Energy-Environmental Efficiency Rating (RenovaBio) [12], the rating scale varying between lower efficiency and high efficiency + (highest efficiency), and an example of the label's design of the energy-environmental efficiency for the performance of a certificate is given (Figure 1).  The calculation and indication of the environmental energy efficiency score (E-EER) of the certification for efficient production of biofuel, made available by the ANP, was discretized into categories of levels (classes) (pre-processing of the dataset) for the classification (data mining) and to create one of the labels of the energy-environmental performance (post-processing) of Brazilian ethanol. This labeling system allowed it to be classified into five classes (A+, A, B, C, and D) to provide consumers with a differentiation of the ethanol consumed from different producers, regions, or states.
The RenovaBio Program allows producers and importers to be able to declare the energy-environmental efficiency of their product, which is economically attractive for decarbonization and the competitiveness of biofuels in the oil market, with a complex and solid structure ( Figure 2) [2,12,18,33,34]. The label may be shown at fuel pumps to consumers with a validity of one to three years, a validity that is applied to the Certification of Efficient Production of Biofuels when approved by the ANP. It can also endorse the information and increase transparency in the biofuel market at the consumer level, helping to make a purchase decision. The objective of energy-environmental labeling is to encourage Brazilian sugar and alcohol industries to develop innovations and improvements beyond the minimum levels of efficiency. However, it is expected that more ethanol producers will be able to adhere to the ANP certifications of RenovaBio [2,12,19,33,34], and consequently, the labeling system can be improved with the inclusion of more data on the platform.

Classification of Model Prediction
Data mining applies to this study, through techniques (algorithms), for the classification of the levels (classes) of energy-environmental efficiency in the search of strategic information that allows the extraction of implicit information existing in the databases, contributing to the process of identifying and classifying new patterns [35][36][37]. The steps of the data mining method were selection, pre-processing, data mining, and post-processing (knowledge filtering, interpretation and explanation, evaluation, and knowledge integration) for knowledge discovery from the classifiers [36,37]. The results obtained could be used in information management, information processing, decision making, and process control.
The data contained in the databases could be used to learn a specific target concept [35][36][37][38]. The tasks performed by data mining techniques and machine learning, the classification, build models that can be applied to unclassified data to categorize them into classes, to relate the meta attribute (whose value will be predicted) and a set of forecasting attributes [35][36][37][38].
The data were assessed in the ANP database for the registration of certificates of the production or efficient import of biofuels approved and included in the RenovaBio program in 2019 [12,33,34]. We considered only anhydrous ethanol and hydrated ethanol products, generating two products for the same biofuel producer and importer. The data pre-processing was performed in Excel spreadsheets for further processing in the data mining software Weka c (Waikato Environment for Knowledge Analysis) Version 3.8.4 [39][40][41][42]. The attributes used to build the predictive model were: "biofuel-type", "state", "eligible-volume (%)", "emission factor", "Energy-Environmental Efficiency Rating", and "LER" (Level of Efficiency Rating). Figure 3 presents the modeling process used to classify the Energy-Environmental Efficiency Rating. A classifier is a mapping from unlabeled instances to (discrete) classes. Classifiers have a form (classification tree) plus an interpretation procedure (including how to handle unknown values). Most classifiers also can provide probability estimates (or other likelihood scores), which can be thresholded to yield a discrete class decision, thereby taking into account a cost/benefit or utility function [43,44].
During the pre-processing of the data, the dataset was extracted from the RenovaBio Program (ANP) platform, selecting only the data on the ethanol product and organized in a spreadsheet. The implementation of the supervised filter Resample was applied to maintain the distribution of classes in the subsample and to reach a uniform distribution for comparing the data not submitted to the filter (noResample). The filter Resample produces a random subsample of a dataset using either sampling with replacement or without replacement. The filter was made to maintain the class distribution in the subsample or to bias the class distribution toward a uniform distribution.
In supervised learning, each data input object is preassigned a class label. The main task of supervised algorithms (J48 and naive Bayes) is to learn a model that produces the same labeling for the provided data [43][44][45]. The decision tree algorithm is a widely used algorithm for classification, which uses attribute values to partition the decision space into smaller subspaces in an iterative manner, and the decision processes can be represented graphically as a tree. Each possible decision is covered and represented as a branch, and a complete decision process is essentially a path or branch from the root node to a leaf [43,44]. Naive Bayes is a classification algorithm widely used in problems due to its simplicity, effectiveness, and robustness, being a probabilistic approach based on assumptions that resources are independent of each other and that their weights are equally important [46]. They can better represent the complex relationships between input variables found in real problems [46]. Probabilistic inference can be studied as an approach based on the assumption that decision variables follow probable distributions. The essence of a Bayesian classifier is to estimate the probabilities of all alternative models or hypotheses, given data as evidence, and then to find the most probable classification to be assigned to each new input [44,46].
In the step of the post-processing for filtering, interpretation and explanation, evaluation, and knowledge integration generated by the algorithms for knowledge discovery from the classifiers, the metrics of the performance of the algorithms were used. Post-processing procedures usually include various pruning routines, rule quality processing, rule filtering, rule combination, model combination, or even knowledge integration [47].
The last step is to evaluate the prediction, and such an analysis was made based on the performance values obtained through the test of the prediction model [48]. The performance evaluation measures of the prediction models used were the confusion matrix, accuracy, precision, and recall and the correlation coefficient between classes (Matthews Correlation Coefficient (MCC)) for testing with the resample filter. The Kappa statistic measured the learning capacity of the algorithm.
The confusion matrix presented a matrix with results obtained during the test phase of the model, and it was used in models that used classification algorithms. Considering a confusion matrix of a hypothesis, it offered an adequate measure of the classification model, by showing the number of correct classifications versus the predicted rankings for each class, over a set of instances. The number of correct answers, for each class, was located on the main diagonal of the matrix, and the other elements represented errors in the classification.
The precision represents what has been classified correctly. The values obtained in correctly classified instances and incorrectly classified instances are determinant for predicted accuracy, since they display the values of correct classification and incorrect classification obtained by the algorithm (Equation (1)).
where TP = True Positive; FP = False Positive . The sensitivity (recall) signifies the proportion of wrong classifications or the occurrence of defects. In addition to accuracy and precision, its value varies from 0 to 1, with values closer to 1 being indicators of a good performance prediction model obtained by Equation (2).
where TP = True Positive; FN = False Negative. The Kappa statistic is a metric that compares an observed precision with an expected precision (random chance). It is a measure used to deal with multi-class and unbalanced class problems. The Kappa statistic can be defined as a measure of the degree of agreement between two categorized datasets. The Kappa result varies between 0 and 1. The higher the Kappa value, the stronger the bond [49] (Equation (3)).
where P O = proportion of observed agreements; P E = proportion of agreements expected by chance. The Matthews Correlation Coefficient (MCC) is a correlation coefficient between the dependent classes and represents a measure of quality. Unlike accuracy, precision, and sensitivity, its value ranges from −1 to 1, where values closer to −1 are indicators of a poor prediction model, values equal to 0 indicate that the prediction model is entirely random, and values closer to 1 are indicators of a prediction model with good performance (Equation (4)).
where TP = True Positive; TN = True Negative; P = False Positive; FN = False Negative.

Classification of the Energy-Environmental Efficiency Level for Biofuels
The If-Then classification rules are presented related to the energy-environmental efficiency level for biofuels.
The decision tree generated by the J48 algorithm presented the following classification rules (Figure 4): If the energy-environmental efficiency deficiency (E-EER) was higher than 60.3, then the classification was A (high efficiency). If the Energy-Environmental Efficiency Rating (E-EER) was less than or equal to 60.3, then the rating depended on the state where the ethanol was produced. If the state was Goiás (GO), then the classification was C (standard efficiency). If the state was São Paulo (SP), then the classification was B (average efficiency). If the state was Mato Grosso do Sul (MS), then the rating was B (average efficiency). The results indicated that the model classified with precision above 60. The performance of the classifiers in predicting the classes of the E-EER level using the J48 decision tree algorithm (Table 2) showed 74.07% of instances classified correctly and 25.93% for those classified incorrectly, and the learning capacity of the algorithm was 0.56 for the Kappa statistic. Class A showed the best 100% True Positive (TP) rate, with a 0.07 False Positive (FP) rate, a precision of 0.93, and high recall. However, for Class C, the precision was zero, not classifying any. The performance of the naive Bayes algorithm showed 81.48 of instances correctly classified and 18.52% for those classified incorrectly, with a learning capacity of 0.70 for the Kappa statistic. It classified Classes A and B with high performance, with an accuracy of 1.00 and 0.73, respectively (Table 3). Comparing the two algorithms, naive Bayes presented better performance indexes concerning the J48 decision tree. The decision tree generated by the J48 algorithm for the dataset with the application of the resample filter ( Figure 5) presented the following classification rules (Figure 4). If the Energy-Environmental Efficiency Rating (E-EER) is higher than 60.1, then the classification is A (high efficiency). If the Energy-Environmental Efficiency Rating (E-EER) is less than or equal to 60.1, then the classification depends on the eligible volume (%) of the biomass. If the eligible volume is higher than 97.43, then the rating is B (average efficiency). If the eligible volume is less than or equal to 97.43, then the rating is C (standard efficiency) ( Figure 5). The results indicated that the use of the resample filter had higher weight for the appropriate attribute volume during the evaluation of the E-EER and that ethanol production depended directly on this volume. The performance of the classifiers in the prediction of the classes of the E-EER level using the J48 decision tree algorithm and with the application of the resample filter (Table 4) presented 81.48% of instances correctly classified and 18.52% for those classified incorrectly, and the algorithm learning ability was 0.70 for the Kappa statistic. Classes A and B showed high prediction with values of 1.00 and 0.78, respectively. However, the forecast for the C class was 0.40. Table 4. Classifier performance for scale class prediction models of the level of the environmental efficiency of biofuel with the resample filter. The performance of the naive Bayes algorithm showed 77.78% of instances classified correctly and 22.22% for those classified incorrectly, with a learning capacity of 0.64 for the Kappa statistic (Table 5). It classified Classes A and B with high performance, with an accuracy of 0.77 and 0.90, respectively ( Table 5). The J48 decision tree showed better performance indexes when compared to naive Bayes. The confusion matrix for the J48 decision tree algorithm presented results with positive gains for the classes when the resample filter was applied, going from 20 to 22 strikes, specifically for Class C ( Table 6). This gain in the resample increased the normal balance or distribution of the dataset, shown in Table 4. The results of the Matthews Correlation Coefficient (MCC) proved this gain with values of 1.00 for A, 0.85 for B, and 0.79 for C. Table 6. Confusion matrix of the classification of the level of energy-environmental efficiency with the J48 decision tree. The confusion matrix for the naive Bayes algorithm also showed results with positive gains for the classes when the resample filter was applied, going from 20 to 21 hits, specifically for Class C, with two hits (Table 7). However, this gain did not have good accuracy when we observed the results of Matthews Correlation Coefficient (MCC) ( Table 5), which presented values of 0.80 for A, 0.64 for B, and 0.41 for C, when compared to the J48 decision tree algorithm in the same analysis condition, which performed better. The results showed that the classification using the naive Bayes algorithm was better than J48, in the approach without the resample filter. It classified the minority class (C) better and presented a higher degree of agreement (Kappa statistic). It also indicated high performance by the Matthews correlation coefficient concerning the J48 decision tree algorithm. However, the use of the resample filter in both algorithms improved the distribution of classes in the confusion matrix (Tables 6 and 7), especially Class C.

Discussion
The decision tree signalized the production state importance. The E-EER value was higher for the most productive states (SP, MS). However, Brazilian ethanol producers' certification was still small concerning the high number of states and the volume of ethanol produced in Brazil. The classification of the energy-environmental efficiency of Goiás state did not show its status as one of the states that produces the most ethanol in Brazil [50]. The number of certified producers was still less than it should be. Such a public policy (RenovaBio) was implemented in 2018, so far with little adhesion by the producers and importers in producing states.
The comparison of the J48 decision tree and naive Bayes algorithms, applied to the dataset without using filters, showed differences in the performance indicators. The naive Bayes algorithm, when compared to the J48 decision tree, presented better performance results, predicted Class C better (50% higher in the TP rate), and had better learning capacity (higher Kappa statistic value), indicating that the prediction model had a good performance with this algorithm. However, the sensitivity of the model decreased. The application of the Resample filter in the pre-processing of data for classification with the J48 decision tree algorithm was the one that showed the best performance compared to naive Bayes, as the Kappa statistic was high and the MMC for all classes was higher, indicating that the prediction model performed well.
The data mining technique is also applied in the area of environmental impacts of sugarcane production, for predicting the energy produced and the environmental impacts. Artificial intelligence methods, artificial neural networks, and adaptive neural fuzzy inference system models are also used to predict the environmental impacts of the life cycle and energy output of sugarcane production on planted farms [51].
Integrated approaches based on complex systems for forecasting the growth of sugarcane based on meteorological parameters using extreme machine learning and neural networks were able to show a more generalized model of forecasting for the growth of sugarcane, bringing benefits to industry and the community [52].
Brazil is a major producer of sugarcane and a major consumer of ethanol, with intensive production in order to meet the demands of biofuel (clean energy) and to reduce the use of fossil fuel (oil). However, several efforts in public policy must come into synergy regarding the growing consumption of biofuel and the sustainable development of the sugarcane chain, mainly regarding its energy-environmental efficiency, use of inputs, and agricultural processes [5]. The ethanol energy-environmental performance labels are applicable in this context, as they would help consumers understand how the chain is evaluated mainly in terms of environmental impact and also collaborating with the transparency of public sector policies.
The most evaluated and used energy efficiency labels are for buildings, appliances, equipment, and lamps. These labeling systems are part of public energy-saving policies; since their implementation, there have been improvements in the evaluation standard, and they have also impacted consumption behavior [53][54][55][56][57][58][59][60]. The implementation of energy efficiency measures can guarantee a sustainable economy. In this context, the energy efficiency labeling program for buildings is generally designed with performance processes and standards, with a rigorous database. Eco-labeling is another system with an approach based on the environmental performance of products that also influences consumer choice [61]. Batista et al. [57] investigated the contribution of labeling to reducing the electricity consumption of buildings and noted that conventional buildings that adopt measures such as painting walls and ceilings white, in addition to using smoked glass, were sufficient to raise the rating to an A level.
The evaluation of buildings generally includes energy classification schemes and shows the difference between the Brazilian scheme and those applied by other countries to improve the labeling method [62]. Other studies have included a review of international building energy efficiency codes and labeling schemes to establish standards for the assessment and classification of buildings in terms of energy performance [63]. Lopes et al. [60] reviewed studies on energy efficiency policies and regulations for buildings, highlighting how the Brazilian program can be improved compared to the American and Portuguese programs; however, this labeling system does not inform about the reduction of GHG emissions.
Another study evaluated two new proposals for an energy efficiency label and a new method for assessing the energy efficiency of public lighting systems. The main difference between the proposals was the number of parameters evaluated. However, the current evaluation system evaluates only one parameter (energy efficiency index), and the study's proposals recommend five parameters: lamps, energy efficiency index, light pollution, renewable energy contribution, and light dimming [64].
All of these studies demonstrated that the labeling system can be implemented and improved for consolidation and active contribution to consumer behavior. However, it could also increase the contribution of the ethanol production sector with greater participation with the goal of reducing greenhouse gases and reducing energy use. The biofuel sector could benefit from the implementation of an energy-environmental labeling system. With a successful approach, it could increase adherence to the RenovaBio program and consequently increase the sector's decarbonization credits.

Conclusions
The Brazilian National Biofuel Policy (RenovaBio) is one of the main strategies to encourage the reduction of pollutants in the renewable energy sector from sugarcane. For this reason, the implementation of simple labeling can impact consumer behavior and increase the transparency of the incentive program to reduce environmental impacts.
After testing two classifiers, the best model evaluated was naive Bayes without the use of the resample filter compared to the J48 decision tree, also with the use of the resample filter. The classification of the Energy Efficiency Note (RenovaBio) levels using a Bayesian classifier, the naive Bayes algorithm, produced a model capable of predicting the efficiency level of Brazilian ethanol producers and importers certified to create labeling. The rules generated by the models were capable of estimating the classes according to the scale discretized into high efficiency (Classification A), average efficiency (Classification B), and standard efficiency (Classification C), with more accurate forecasts for the observed classes. These results could generate an ethanol energy-environmental efficiency label for the end consumers and resellers of the product.
However, RenovaBio's database of ethanol was small, concerning the records of efficiency scores registered in the program, as adherence to the program is still voluntary, and the implementation is recent, hindering deeper learning in the classification of labels. Besides, a more in-depth analysis could improve the model's forecast in the generation of energy-environmental labels for biofuels.