Development of an Improved Model to Predict Building Thermal Energy Consumption by Utilizing Feature Selection

: Humans spend approximately 90% of the daytime in buildings, and greenhouse gases (GHGs) emitted by buildings account for approximately 20% of total GHG emissions. As the energy consumed during building operation from a building life-cycle perspective amounts to approximately 70–90% of the total energy, it is essential to accurately predict the energy consumption of buildings for their e ﬃ cient operation. This study aims to optimize a model for predicting the thermal energy consumption of buildings by (i) ﬁrst extracting major variables through feature selection and deriving signiﬁcant variables in addition to the collected data and (ii) predicting the thermal energy consumption using a machine learning model. Feature selection using random forest was performed, and 11 out of 17 available data were selected. The accuracy of the prediction model was signiﬁcantly improved when the hour of day variable was added. The prediction model was constructed using an artiﬁcial neural network (ANN), and the improvement in the prediction accuracy was analyzed by comparing di ﬀ erent cases of variable combinations. The ANN prediction accuracy was improved by 15% using the feature selection process compared to when all data were used as input data, and 25% coe ﬃ cient of variation of the root mean square error (CVRMSE) accuracy was achieved.


Background
The emission of greenhouse gas (GHG) has steeply increased by approximately 82.5% since 1979 [1], which has been a major contributor to climate change globally. This gives rise to considerable global efforts to reduce its emissions. As a part of it, South Korea announced a target of reducing GHG emissions by 37% compared to Business As Usual (BAU) by 2030 in the Paris Agreement of the Intergovernmental Panel on Climate Change (IPCC) in 2016 [2].
Humans, who are mostly responsible for GHG emissions, spend approximately 90% of the daytime in buildings [3]. The estimated GHG emissions from buildings are as high as 20% of the total emissions [4]. Therefore, many studies have been conducted to reduce energy consumption in buildings in an effort to reduce GHG emissions from buildings. A large number of candidate input data exist for building energy consumption prediction. Feature selection is a method for increasing the prediction accuracy of a model when there are many candidate input data. Better learning and inference can be possible for a prediction model by removing ineffective candidates and reducing the size of the input set, leading to improved accuracy. The purpose of feature selection is to find the minimal subset from the entire set while preserving the major information to facilitate future analysis. Oveis Abedinia et al. significantly reduced the training time, forecast time, and daily mean absolute percentage error (DMAPE) by selecting seven out of 41 features [24]. Ping Jiang et al. revealed that high-quality feature selection improves the prediction accuracy and training speed, and minimizes modeling complexity. In other words, feature selection is a major element for successful prediction and plays an important role in predicting the load [25]. Liu et al. performed feature selection based on Pearson's correlation analysis to predict the load of a building and created a prediction model using an improved Elman neural network (IENN). In the study, feature selection optimized the weight of the model and brought better prediction results [26]. Zhao et al. used a support vector regression (SVR) model for building energy consumption prediction. They reduced the data set using feature selection because the building energy consumption was complicated and was affected by many factors. When SVR was trained by selecting subsets from three data sets, the accuracy of the model was improved, and the runtime was reduced [27]. Sooyoun et al. selected electric energy as a subset of total building energy consumption and identified the variables that contribute to electric energy use. The K-means and density-based spatial clustering of applications with noise (DBSCAN) clustering techniques were used to select the features directly affecting the consumption among 16 features, and they compared the prediction results using only the main variables with the results using all variables. As a result of this study, they explained that selecting and using sensors also make it possible to find the most significant measurement points because the data can be used to obtain clustering results for correlation analysis [28]. Aurora et al. dealt with multivariate time-dependent series of data points for energy forecasting in smart buildings. They applied different types of feature selection methods for regression tasks. The results of the experiments carried out show that the proposed methodology effectively reduces both the complexity of the forecast model and their RMSE and mean absolute error (MAE) [29]. Therefore, the use of feature selection in predicting energy consumption can improve the prediction accuracy and reduce the runtime by extracting the most important input data set.

Study Objectives
Despite the development of technologies for predicting energy consumption due to advances in AI technology, many of the previous studies performed prediction using the collected raw data as they were or did not systematically consider the variable selection process when constructing prediction models. Dynamic data have been accumulated in large quantities due to the development of the internet of things (IoT) technology and information and communication technology (ICT), and methods for processing such data have been established [30]. When an energy consumption prediction model is implemented, it is necessary to secure the prediction accuracy by removing data with low importance that interfere with prediction and select only major variables. Moreover, it is possible to improve the prediction accuracy of a model if variables with significant relevance are added to the model in addition to the data collected by conventional sensors [31]. Therefore, this study was conducted to optimize a model for predicting the thermal energy consumed for heating and domestic hot water (DHW) in buildings during winter. In this study, a model for accurately predicting thermal energy consumption was developed by (i) collecting sensor data to determine the indoor and outdoor conditions of the chosen building (e.g., indoor temperature, indoor humidity, outdoor temperature, irradiation, etc.), (ii) extracting important variables through feature selection, (iii) deriving significant variables in addition to the collected sensor data, and iv) constructing the ANN model using the selected input data. The performance of the final optimized model was then analyzed and compared with previous prediction models.

Building and Sensor Descriptions
The Jincheon eco-friendly energy town is located in the Chungbuk Innovation City of South Korea. The center of the town is 36.9 • N, 127.5 • E, and Köppen's climate characteristic is humid subtropical (Cwa) [32]. The experimental period, from 1 December 2017 to 30 April 2018, has a temperature range of −16.2 to 34.2 • C and a humidity range of 15.4% to 99.3%. For this town, located in the central inland basin area, the wind speed is relatively weak, the weather is usually clear, and the sunshine time is long. The average annual temperature is 12.5 • C.
There are 6 types of public buildings in town: central machine room, high school, youth center, library, health care center, daycare center. A high school building located in the Jincheon town was specifically used for this study. The town has a large-capacity thermal storage tank for storing solar heat, which supplies the stored thermal energy to public buildings in the town for heating and DHW. Eight hundred square meters of two types of solar collectors (i.e., flat-plate type and evacuated type) are connected in series, and the collected heat is stored in the large-capacity thermal storage tank throughout the year. The large-capacity thermal storage tank is a solar heat storage system, where a storage tank of 25.2m × 17.2m × 9.6m (L × W × H) is installed on the ground. If the stored solar heat is not sufficient, it is supplemented by a heat pump in the central machine room of the town. Therefore, the large-capacity thermal storage tank with solar collectors acts as solar district heating for a net-zero energy community in the town [33]. Table 1 shows the details of the building, and Figure 1 shows the front view and the floor plan of the building. There is also a machine room with a distribution system that receives heat from the thermal storage tank and delivers it to each room of the building. After the thermal energy is supplied from the central large-capacity thermal storage tank to the machine room of the subject building, the thermal energy is supplied to each room by fan coil units (FCU). Technical information on the chosen building is presented in Table 2. As this study focused on winter, the cooling energy in summer was not considered.  Table 3 exhibits the information of the three rooms by a field survey conducted in the high school. The chosen building has three types of occupants: support staff, teachers, and students. Each occupant differs from its daily pattern of the activity, which is rather regulative. Support staff tend to stay continuously in the office room (HS3) during working hours except for lunch time and break times. Teachers spend most of their working hours teaching in classrooms, not in the teachers' room (HS8). The occupancy rate of the teachers' room, therefore, is relatively lower than in other rooms. Contrary to this, students rarely leave the classroom (HS10), hence, a high occupancy rate. According to this, the office room, the teachers' room, and a classroom that can be representative of three main occupants were selected for the study. Particularly, a classroom with the most similar conditions to the office room and teachers' room was selected in the contexts of sizes and orientations. Consequently, indoor environment data were collected from the office room, the teachers' room, and the classroom considering the occupancy rate and occupancy density.  Table 4 shows a list of the data collected to create a thermal energy consumption prediction model. Figure 1b shows the locations of the three rooms where the indoor environment sensors were installed. Figure 2 shows the electricity energy sensor, indoor environment sensor, and central heating sensor installed in the building. The data from the sensors were stored in chronological order   Table 3 exhibits the information of the three rooms by a field survey conducted in the high school. The chosen building has three types of occupants: support staff, teachers, and students. Each occupant differs from its daily pattern of the activity, which is rather regulative. Support staff tend to stay continuously in the office room (HS3) during working hours except for lunch time and break times. Teachers spend most of their working hours teaching in classrooms, not in the teachers' room (HS8). The occupancy rate of the teachers' room, therefore, is relatively lower than in other rooms. Contrary to this, students rarely leave the classroom (HS10), hence, a high occupancy rate. According to this, the office room, the teachers' room, and a classroom that can be representative of three main occupants were selected for the study. Particularly, a classroom with the most similar conditions to the office room and teachers' room was selected in the contexts of sizes and orientations. Consequently, indoor environment data were collected from the office room, the teachers' room, and the classroom considering the occupancy rate and occupancy density. Table 4 shows a list of the data collected to create a thermal energy consumption prediction model. Figure 1b shows the locations of the three rooms where the indoor environment sensors were installed. Figure 2 shows the electricity energy sensor, indoor environment sensor, and central heating sensor installed in the building. The data from the sensors were stored in chronological order at one-hour intervals. As shown in Table 5, there were three types of sensors used for the monitoring of this experiment: watt-hour sensor, indoor environment sensor, and calorimeter sensor. All sensors could communicate over Wi-Fi networks. The watt-hour sensor showed a voltage and current measurement error of up to ±0.2%, and the power measurement error was ±0.1%. The indoor environment sensor could measure room temperature, humidity, and CO 2 concentration. The room temperature showed a measuring range of 0-50 • C, and the measuring range of humidity was 0-95%. CO 2 could be measured from 0 to 10,000 ppm, with a measurement error of ± 5%. The calorimeter measures temperature and flow rate, the temperature showed the measuring range of 0-135 • C, and the flow rate measured 10.0-250 m 3 /h. Each error range is ±5% and ±2 m 3 /h, respectively.  of this experiment: watt-hour sensor, indoor environment sensor, and calorimeter sensor. All sensors could communicate over Wi-Fi networks. The watt-hour sensor showed a voltage and current measurement error of up to ±0.2%, and the power measurement error was ±0.1%. The indoor environment sensor could measure room temperature, humidity, and CO2 concentration. The room temperature showed a measuring range of 0-50 °C , and the measuring range of humidity was 0-95%. CO2 could be measured from 0 to 10,000 ppm, with a measurement error of ± 5%. The calorimeter measures temperature and flow rate, the temperature showed the measuring range of 0-135 °C , and the flow rate measured 10.0-250 m 3 /h. Each error range is ±5% and ±2 m 3 /h, respectively. For the indoor environment, the temperature, humidity, and CO2 concentration were measured. The data were used as variables to identify the behavioral pattern of the occupants as well as the use schedules of the rooms. Since building energy is significantly affected by the outdoor environment [34], outdoor environment data were used to secure the accuracy of the prediction model. For the outdoor environment data, the temperature, humidity, and solar radiation data provided by the Korea Meteorological Administration were used. The central thermal data included the supply water temperature from the center, return water temperature, and flow rate data. The corresponding sensor data were measured to represent how much of the hot water supplied from the central machine room was consumed in the building. The electricity energy data were composed of the lighting and plug loads used in the building. Based on these data, the thermal energy consumption of the chosen building was predicted. Thermal energy consumption (kWh)    For the indoor environment, the temperature, humidity, and CO 2 concentration were measured. The data were used as variables to identify the behavioral pattern of the occupants as well as the use schedules of the rooms. Since building energy is significantly affected by the outdoor environment [34], outdoor environment data were used to secure the accuracy of the prediction model. For the outdoor environment data, the temperature, humidity, and solar radiation data provided by the Korea Meteorological Administration were used. The central thermal data included the supply water temperature from the center, return water temperature, and flow rate data. The corresponding sensor data were measured to represent how much of the hot water supplied from the central machine room was consumed in the building. The electricity energy data were composed of the lighting and plug loads used in the building. Based on these data, the thermal energy consumption of the chosen building was predicted. Table 6 shows the operation plan of the studied building. In winter, heating was operated from 0700 to 2200 hours. Heating was operated until the late hour of 2200 for the students' after-school self-study. There were 590 users of the building, including 510 students and 80 employees and teachers. Heating was operated from November to April, and the indoor set-point temperature was 20 • C in winter.

Prediction Process
For the construction of the machine learning model for predicting the building thermal energy consumption, the steps shown in Figure 3 were taken.   Table 6 shows the operation plan of the studied building. In winter, heating was operated from 0700 to 2200 hours. Heating was operated until the late hour of 2200 for the students' after-school self-study. There were 590 users of the building, including 510 students and 80 employees and teachers. Heating was operated from November to April, and the indoor set-point temperature was 20 °C in winter.

Prediction Process
For the construction of the machine learning model for predicting the building thermal energy consumption, the steps shown in Figure 3 were taken.

In
Step 1, data were collected and analyzed. In this step, the characteristics and the environment of the building were analyzed using the collected data to create the optimal energy prediction model. In addition, similar types of data were grouped.
Step 2 is the feature selection step. In this step, unnecessary data or data that may act as noise were identified and removed so that only major variables were used. Random forest was used as a method for selection, and variables that significantly affect the thermal energy consumption to be In Step 1, data were collected and analyzed. In this step, the characteristics and the environment of the building were analyzed using the collected data to create the optimal energy prediction model. In addition, similar types of data were grouped.
Step 2 is the feature selection step. In this step, unnecessary data or data that may act as noise were identified and removed so that only major variables were used. Random forest was used as a method for selection, and variables that significantly affect the thermal energy consumption to be predicted were extracted. In this instance, variables with high importance for the output value were considered as major variables. In Step 3, a model for predicting the thermal energy consumption was constructed. For the prediction model, ANN was used. The model for predicting the building's thermal energy consumption was constructed using the variables selected in Step 2 as input data. The ANN model was constructed according to three cases. Case 1 was a model that utilized all data as input data without feature selection. Case 2 was a model that applied feature selection and used only major variables as input data. Case 3 was a model that added variables determined to be significant from an analysis of the major variables selected in Case 2.
In Step 4, the machine learning models were evaluated. In this step, the accuracy of the Case 1, Case 2, and Case 3 models was compared to derive an improved model for thermal energy consumption prediction.

Feature Selection
Random forest is a data-driven method based on the basic properties of a decision tree. It is an ensemble learning methodology and relies on the combination of several decision trees via a voting scheme. The particularity of random forest is that their tree-based components are grown from a certain amount of randomness [35]. Based on this idea, random forest is defined as a generic principle of randomized ensembles of decision trees. Using a random selection of features to split each node is more robust with respect to noise [36]. The training procedure of a randomly generated forest can be summarized as follows [23]: first, build a bootstrap sample from the training dataset, second, grow a tree for each bootstrap sample and select the best split among a randomly selected subset of input variables, third, the tree is fully grown until no further splits are possible and repeat above procedure until all trees are grown. Random forest is a high-dimensional nonparametric method that works well on large numbers of variables [37]. It has been shown that the method is extremely accurate in a variety of applications [38].
Random forests can be used to rank the importance of each variable in a data analysis or prediction. Once random forest has created a lot of trees, the importance of variables that affect the output value is calculated. In general, the feature importance provided by random forest consisting of a large number of trees is more reliable than a simple decision tree method provided by one tree. In a random forest model, Gini importance is used as a measure for quantifying the importance of a feature [36]. Gini importance is derived from the Gini impurity value [39]. Gini importance is the averaged value of the total decrease in impurity over all individual decision trees in the random forest. As the value of Gini importance increases, the feature is considered more important. Gini importance is represented by a number between 0 and 1. A value closer to 1 means that a variable is more important among the data.
In this study, feature selection using random forest was used to improve the accuracy of the model and to reduce the run time. If there is extremely low or high Gini importance in its own group, variables are selected based on the 70th percentile as a major data, which have a significant effect on the prediction of energy consumption.

ANN
In this study, an ANN-based model was used to predict the thermal energy used in a building. ANN is a type of supervised learning developed by Warren S. McCulloch and Walter H. Pitt in 1943 [40]. ANN is constructed using the human central nervous system as a motif so that complex operations and calculations are possible. In particular, it has specialized features for the analysis and prediction of variables with nonlinear relationships. The ANN model for this study was constructed using the Python Scikit-learn library.
For the data used for machine learning, their ranges were matched, and preprocessing was performed so that they could be fairly reflected by the model. The normalized values ranged from zero to 1. In this instance, the equation used was as follows: where x max means the maximum value of data, x min is the minimum value of data. The data was split into training data for the ANN model and test data for validation. Raw data were divided based on a 7:3 ratio; 70% was used as training data and the remaining 30% as test data. To compare each case on the same condition, the ANN model, in every case, has the same structure with the same hyper-parameter. The structure of the final ANN model is shown in Table 7.

Evaluation
The coefficient of variation of the root mean square error (CVRMSE) and mean absolute error (MAE) were used to evaluate the prediction results using ANN. Both the CVRMSE and MAE verify the accuracy of a model by comparing the measured values with the predicted values. For both CVRMSE and MAE, the accuracy can be said to be higher if the result is closer to zero. According to the criteria specified in the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) guideline, the criterion for the hourly prediction is considered accurate and allowed if CVRMSE is less than 30% [41].
where P means the predicted value of ANN model, X is actual data, N is the number of actual data, µ is the mean of actual data. Figure 4 shows histograms of the data used to construct the thermal energy consumption prediction model. The outdoor environment, indoor environment, central heating, and electricity energy data were collected from the building. The data were collected from 1 December 2017 to 30 April 2018. The data were collected on an hourly basis by the sensors, and a total of 3576 hours of data were collected. The indoor temperature was maintained at 16-17 °C when the rooms were not occupied and at nighttime, and it was maintained at 18-23 °C when the rooms were occupied and during the daytime. The indoor humidity was maintained at 36% on average, which was quite comfortable. The average indoor CO2 concentration was approximately 252.7 ppm. From the data patterns, the CO2 concentration varied in proportion to the number of occupants, indicating that the occupancy of a room can be estimated through the data.

Overview of Data
The outdoor temperature was 4.8 °C on average. The hours of the day when the temperature was lower than 0 °C were fewer than those when it was higher than 0 °C , indicating that extremely cold weather was not experienced even in winter. The outdoor humidity ranged from 45% to 75%, fluctuating severely depending on the weather conditions, such as snow. The frequency of zero values for solar radiation was very high because the data were also collected during the night when the solar radiation cannot be measured.
The heat source supplied from the central machine room was transported to the subject building by the hot water supply. The supply water temperature was maintained at 45-48 °C , and the return water temperature was maintained at 43-46 °C . For the flow meter as a constant water volume supply system was adopted, and it was maintained at about 1000 ℓ/min when the rooms were occupied.
Thermal energy consumption was based on the energy used for heating and DHW in the building. When the rooms were not occupied, the value of the data was zero because no energy was consumed. For the lighting and plug electricity energy consumption data, a number of data points The indoor temperature was maintained at 16-17 • C when the rooms were not occupied and at nighttime, and it was maintained at 18-23 • C when the rooms were occupied and during the daytime. The indoor humidity was maintained at 36% on average, which was quite comfortable. The average indoor CO 2 concentration was approximately 252.7 ppm. From the data patterns, the CO 2 concentration varied in proportion to the number of occupants, indicating that the occupancy of a room can be estimated through the data.
The outdoor temperature was 4.8 • C on average. The hours of the day when the temperature was lower than 0 • C were fewer than those when it was higher than 0 • C, indicating that extremely cold weather was not experienced even in winter. The outdoor humidity ranged from 45% to 75%, fluctuating severely depending on the weather conditions, such as snow. The frequency of zero values for solar radiation was very high because the data were also collected during the night when the solar radiation cannot be measured.
The heat source supplied from the central machine room was transported to the subject building by the hot water supply. The supply water temperature was maintained at 45-48 • C, and the return water temperature was maintained at 43-46 • C. For the flow meter as a constant water volume supply system was adopted, and it was maintained at about 1000 /min when the rooms were occupied.
Thermal energy consumption was based on the energy used for heating and DHW in the building. When the rooms were not occupied, the value of the data was zero because no energy was consumed. For the lighting and plug electricity energy consumption data, a number of data points with low electricity energy consumption existed because standby electricity power was required during the non-occupancy period. During the occupancy period, however, constant electricity consumption was recorded.

Feature Selection by Random Forest
The purpose of feature selection is to eliminate the data, which is likely to interfere with the prediction. Random forest was used to extract major variables. The Gini importance of random forest is a coefficient for judging the importance of the influence on the output value. In the course of evaluating their importance, variables were compared with one another relatively. If the importance of certain variables was extremely high, the importance of other variables tends to be neglected in this group. Therefore, in this study, random forest was performed to avoid these issues by grouping raw data according to the characteristics of the variables. Raw data were divided into the following four groups: (i) outdoor environment data, (ii) indoor environment data, (iii) central heating supply data, and (iv) electricity energy consumption data. After grouping variables, distributions of results of Gini importance in each group should be checked. If they did not have an extremely low or high coefficient in their own group, they utilized all variables as input data of ANN. On the other hand, if extremely high or low Gini importance existed in their group, major variables were selected based on the 70th percentile in this study. Figure 5 shows the results of performing random forest for the outdoor environment data. The corresponding data were solar radiation, outdoor temperature, and outdoor humidity, which were representative variables for the outdoor environment. All of them exhibited a Gini importance of 0.3 or higher for the building thermal energy consumption, which was the output value, and it was judged that they evenly and significantly affected the output value. Therefore, for the outdoor environment data, solar radiation, outdoor temperature, and outdoor humidity were utilized as input data for the ANN model. with low electricity energy consumption existed because standby electricity power was required during the non-occupancy period. During the occupancy period, however, constant electricity consumption was recorded.

Feature Selection by Random Forest
The purpose of feature selection is to eliminate the data, which is likely to interfere with the prediction. Random forest was used to extract major variables. The Gini importance of random forest is a coefficient for judging the importance of the influence on the output value. In the course of evaluating their importance, variables were compared with one another relatively. If the importance of certain variables was extremely high, the importance of other variables tends to be neglected in this group. Therefore, in this study, random forest was performed to avoid these issues by grouping raw data according to the characteristics of the variables. Raw data were divided into the following four groups: (i) outdoor environment data, (ii) indoor environment data, (iii) central heating supply data, and (iv) electricity energy consumption data. After grouping variables, distributions of results of Gini importance in each group should be checked. If they did not have an extremely low or high coefficient in their own group, they utilized all variables as input data of ANN. On the other hand, if extremely high or low Gini importance existed in their group, major variables were selected based on the 70th percentile in this study. Figure 5 shows the results of performing random forest for the outdoor environment data. The corresponding data were solar radiation, outdoor temperature, and outdoor humidity, which were representative variables for the outdoor environment. All of them exhibited a Gini importance of 0.3 or higher for the building thermal energy consumption, which was the output value, and it was judged that they evenly and significantly affected the output value. Therefore, for the outdoor environment data, solar radiation, outdoor temperature, and outdoor humidity were utilized as input data for the ANN model.

Indoor Environment
There were more variables for data related to the indoor environment compared to the other groups. The representative rooms with high occupancy density were selected from the building, and each room had data on the temperature, humidity, and CO2 concentration. Three rooms in the chosen building had the same number of sensors to measure indoor conditions. Figure 6 shows the results of performing random forest for the indoor environment. Because there were several extremely low and high Gini importance in indoor conditions, major variables were selected based on the 70 th percentile (The 70th percentile on the group of indoor environmental data is around 0.1, as shown in Figure 6). The variables judged to have significant impacts on the building thermal energy consumption were indoor temperature and CO2 concentration of HS10 (classroom), and the indoor

Indoor Environment
There were more variables for data related to the indoor environment compared to the other groups. The representative rooms with high occupancy density were selected from the building, and each room had data on the temperature, humidity, and CO 2 concentration. Three rooms in the chosen building had the same number of sensors to measure indoor conditions. Figure 6 shows the results of performing random forest for the indoor environment. Because there were several extremely low and high Gini importance in indoor conditions, major variables were selected based on the 70th percentile (The 70th percentile on the group of indoor environmental data is around 0.1, as shown in Figure 6). The variables judged to have significant impacts on the building thermal energy consumption were indoor temperature and CO 2 concentration of HS10 (classroom), and the indoor temperature of HS3 (office room). As shown in Table 3, both rooms had quite high occupancy densities and occupancy rates. Therefore, the importance of the variables was high in such rooms. HS8, which was not selected as a major variable, was a teachers' room. As teachers moved to classrooms for classes, the occupancy rate was reduced, and this appears to have lowered the importance of the variable. Finally, for the indoor environment data, the temperature and CO 2 concentration of the classroom and the temperature of the office were utilized as input data.
Energies 2019, 12, x FOR PEER REVIEW 12 of 20 temperature of HS3 (office room). As shown in Table 3, both rooms had quite high occupancy densities and occupancy rates. Therefore, the importance of the variables was high in such rooms. HS8, which was not selected as a major variable, was a teachers' room. As teachers moved to classrooms for classes, the occupancy rate was reduced, and this appears to have lowered the importance of the variable. Finally, for the indoor environment data, the temperature and CO2 concentration of the classroom and the temperature of the office were utilized as input data.

Central heating Supply Data
The heating supply data included the temperature and flow rate when the central heat source was supplied to the building. Figure 7 shows the results of feature selection by random forest. As the Gini importance of all the variables exceeded 0.25, and there were no extreme coefficient values in this group, all of them were utilized as input data. In general, as the actual building starts operation, the supply flow rate increases, and changes in the return water temperature and the supply water temperature become larger. Owing to these correlations, the Gini importance results of random forest were quite high for all the variables.

Electricity Energy Consumption Data
The electricity energy consumption data were used to determine the occupancy status of the building. The results shown in Figure 8 were obtained because the use of equipment and lighting is closely related to the occupancy rate. It was found that both equipment and lighting significantly affected the building thermal energy consumption. Therefore, both variables were used as input data for the ANN model.

Central heating Supply Data
The heating supply data included the temperature and flow rate when the central heat source was supplied to the building. Figure 7 shows the results of feature selection by random forest. As the Gini importance of all the variables exceeded 0.25, and there were no extreme coefficient values in this group, all of them were utilized as input data. In general, as the actual building starts operation, the supply flow rate increases, and changes in the return water temperature and the supply water temperature become larger. Owing to these correlations, the Gini importance results of random forest were quite high for all the variables. temperature of HS3 (office room). As shown in Table 3, both rooms had quite high occupancy densities and occupancy rates. Therefore, the importance of the variables was high in such rooms. HS8, which was not selected as a major variable, was a teachers' room. As teachers moved to classrooms for classes, the occupancy rate was reduced, and this appears to have lowered the importance of the variable. Finally, for the indoor environment data, the temperature and CO2 concentration of the classroom and the temperature of the office were utilized as input data.

Central heating Supply Data
The heating supply data included the temperature and flow rate when the central heat source was supplied to the building. Figure 7 shows the results of feature selection by random forest. As the Gini importance of all the variables exceeded 0.25, and there were no extreme coefficient values in this group, all of them were utilized as input data. In general, as the actual building starts operation, the supply flow rate increases, and changes in the return water temperature and the supply water temperature become larger. Owing to these correlations, the Gini importance results of random forest were quite high for all the variables.

Electricity Energy Consumption Data
The electricity energy consumption data were used to determine the occupancy status of the building. The results shown in Figure 8 were obtained because the use of equipment and lighting is closely related to the occupancy rate. It was found that both equipment and lighting significantly affected the building thermal energy consumption. Therefore, both variables were used as input data for the ANN model.

Electricity Energy Consumption Data
The electricity energy consumption data were used to determine the occupancy status of the building. The results shown in Figure 8 were obtained because the use of equipment and lighting is closely related to the occupancy rate. It was found that both equipment and lighting significantly affected the building thermal energy consumption. Therefore, both variables were used as input data for the ANN model. As a result, 11 variables with high Gini importance out of 17 data were used as the input data of the ANN model, as shown in Figure 9.

Case 1
In Case 1, the ANN model was constructed using all sensor data as the input variables of the prediction model. The list of variables used to construct the ANN model contained 17 variables. The list was the same as the raw data list of Figure 9. When the prediction accuracy of the ANN model of Case 1 was evaluated, CVRMSE was approximately 40%, and MAE was approximately 11. As the prediction accuracy was not high, it was necessary to improve the model. Figure 10 compares the empirical data collected from the sensors in the building with the predictive data of the ANN model for a week from March 16 to 22 in 2018. Although the ANN model of Case 1 predicted a similar building energy consumption pattern with that of empirical data, the prediction for each time period was not accurate. In particular, the prediction was found to be inaccurate in the early morning immediately before the occupants entered the building and at lunch time. In the morning, energy consumption rapidly increased as students began to enter the building. Conversely, the energy consumption was dramatically reduced as the occupants left the classrooms at lunch time. The prediction accuracy was low because the existing data combination could not As a result, 11 variables with high Gini importance out of 17 data were used as the input data of the ANN model, as shown in Figure 9. As a result, 11 variables with high Gini importance out of 17 data were used as the input data of the ANN model, as shown in Figure 9.

Case 1
In Case 1, the ANN model was constructed using all sensor data as the input variables of the prediction model. The list of variables used to construct the ANN model contained 17 variables. The list was the same as the raw data list of Figure 9. When the prediction accuracy of the ANN model of Case 1 was evaluated, CVRMSE was approximately 40%, and MAE was approximately 11. As the prediction accuracy was not high, it was necessary to improve the model. Figure 10 compares the empirical data collected from the sensors in the building with the predictive data of the ANN model for a week from March 16 to 22 in 2018. Although the ANN model of Case 1 predicted a similar building energy consumption pattern with that of empirical data, the prediction for each time period was not accurate. In particular, the prediction was found to be inaccurate in the early morning immediately before the occupants entered the building and at lunch time. In the morning, energy consumption rapidly increased as students began to enter the building. Conversely, the energy consumption was dramatically reduced as the occupants left the classrooms at lunch time. The prediction accuracy was low because the existing data combination could not

Case 1
In Case 1, the ANN model was constructed using all sensor data as the input variables of the prediction model. The list of variables used to construct the ANN model contained 17 variables. The list was the same as the raw data list of Figure 9. When the prediction accuracy of the ANN model of Case 1 was evaluated, CVRMSE was approximately 40%, and MAE was approximately 11. As the prediction accuracy was not high, it was necessary to improve the model. Figure 10 compares the empirical data collected from the sensors in the building with the predictive data of the ANN model for a week from March 16 to 22 in 2018. Although the ANN model of Case 1 predicted a similar building energy consumption pattern with that of empirical data, the prediction for each time period was not accurate. In particular, the prediction was found to be inaccurate in the early morning immediately before the occupants entered the building and at lunch time. In the morning, energy consumption rapidly increased as students began to enter the building. Conversely, the energy consumption was dramatically reduced as the occupants left the classrooms at lunch time. The prediction accuracy was low because the existing data combination could not reflect the rapidly changing situation, or it acted as noise when such rapid changes in the energy consumption occurred.
Energies 2019, 12, x FOR PEER REVIEW 14 of 20 reflect the rapidly changing situation, or it acted as noise when such rapid changes in the energy consumption occurred.

Case 2
In Case 2, major variables were selected by performing feature selection by random forest, and then an ANN model was constructed using the selected variables as input data. The variables used were the same as those in the input data list of Figure 9, and 11 variables were selected. For the prediction accuracy of Case 2, CVRMSE was approximately 35%, and MAE was approximately 10. The prediction results were improved by approximately 5% compared to Case 1. In particular, the accuracy in the time period when the occupants were in the rooms was significantly improved.
When the results of Case 2 were compared with those of Case 1, the prediction of the energy consumption exhibited more accurate results when a small number of variables extracted through feature selection were used than when all data were used. This indicates that using all the building data is not beneficial for prediction and that it is necessary to consider the combination of variables for prediction to secure excellent prediction accuracy. The prediction accuracy of Case 1 was lower even though all data were used because several data acted as noise that interfered with prediction. Therefore, it is important to increase the prediction accuracy of the model by finding the optimal variable combination.
However, as shown in Figure 11, the energy consumption prediction accuracy was still significantly low for the early morning. In particular, the model performed poorly for the time period immediately before people began to enter the rooms. This appears to be because there was no variable that can be used as a criterion for the occupancy status of people among the input data for training the ANN model.

Case 2
In Case 2, major variables were selected by performing feature selection by random forest, and then an ANN model was constructed using the selected variables as input data. The variables used were the same as those in the input data list of Figure 9, and 11 variables were selected. For the prediction accuracy of Case 2, CVRMSE was approximately 35%, and MAE was approximately 10. The prediction results were improved by approximately 5% compared to Case 1. In particular, the accuracy in the time period when the occupants were in the rooms was significantly improved.
When the results of Case 2 were compared with those of Case 1, the prediction of the energy consumption exhibited more accurate results when a small number of variables extracted through feature selection were used than when all data were used. This indicates that using all the building data is not beneficial for prediction and that it is necessary to consider the combination of variables for prediction to secure excellent prediction accuracy. The prediction accuracy of Case 1 was lower even though all data were used because several data acted as noise that interfered with prediction. Therefore, it is important to increase the prediction accuracy of the model by finding the optimal variable combination.
However, as shown in Figure 11, the energy consumption prediction accuracy was still significantly low for the early morning. In particular, the model performed poorly for the time period immediately before people began to enter the rooms. This appears to be because there was no variable that can be used as a criterion for the occupancy status of people among the input data for training the ANN model.

Case 2
In Case 2, major variables were selected by performing feature selection by random forest, and then an ANN model was constructed using the selected variables as input data. The variables used were the same as those in the input data list of Figure 9, and 11 variables were selected. For the prediction accuracy of Case 2, CVRMSE was approximately 35%, and MAE was approximately 10. The prediction results were improved by approximately 5% compared to Case 1. In particular, the accuracy in the time period when the occupants were in the rooms was significantly improved.
When the results of Case 2 were compared with those of Case 1, the prediction of the energy consumption exhibited more accurate results when a small number of variables extracted through feature selection were used than when all data were used. This indicates that using all the building data is not beneficial for prediction and that it is necessary to consider the combination of variables for prediction to secure excellent prediction accuracy. The prediction accuracy of Case 1 was lower even though all data were used because several data acted as noise that interfered with prediction. Therefore, it is important to increase the prediction accuracy of the model by finding the optimal variable combination.
However, as shown in Figure 11, the energy consumption prediction accuracy was still significantly low for the early morning. In particular, the model performed poorly for the time period immediately before people began to enter the rooms. This appears to be because there was no variable that can be used as a criterion for the occupancy status of people among the input data for training the ANN model.

Case 3
Case 3 is an ANN model created by adding important variables to the input data used in Case 2. The chosen building was a school that had regular occupancy schedules during the weekend and the weekdays. For schools, the schedule involving school time, break time, lunch time, and home time is fixed, and the occupancy rate is also very predictable. Therefore, the energy consumption patterns of school buildings tend to be regular and clear. For this reason, the hour of the day data was used as a variable to reflect such regularity of the building schedule. As the raw data of the existing sensors could not be used as criteria for the occupancy hour, the energy prediction accuracy was significantly low in the interval where the energy consumption rapidly changed. To address this problem, the hour of the day data were utilized as input data.
In Case 3, the ANN model was constructed with 12 input data, adding the hour of the day data to the 11 variables used in Case 2. For Case 3, CVRMSE was approximately 25%, and MAE was 6.88, indicating the highest accuracy among all cases. The results of the cases confirmed that Case 3 exhibited a higher prediction accuracy than Case 1 and Case 2. It was also observed that the dispersed prediction values were tightening. In particular, as shown in Figure 12, the prediction accuracy of the pattern was significantly higher compared to Case 1 and Case 2. As the hour of the day data were used as input data, a variable for the criterion of the occupancy hour was generated, and the prediction accuracy during the occupancy period was significantly improved. Moreover, prediction accuracy was significantly improved for the morning and the lunch time when the energy consumption rapidly changed, making it possible to construct a model with high prediction accuracy.

Case 3
Case 3 is an ANN model created by adding important variables to the input data used in Case 2. The chosen building was a school that had regular occupancy schedules during the weekend and the weekdays. For schools, the schedule involving school time, break time, lunch time, and home time is fixed, and the occupancy rate is also very predictable. Therefore, the energy consumption patterns of school buildings tend to be regular and clear. For this reason, the hour of the day data was used as a variable to reflect such regularity of the building schedule. As the raw data of the existing sensors could not be used as criteria for the occupancy hour, the energy prediction accuracy was significantly low in the interval where the energy consumption rapidly changed. To address this problem, the hour of the day data were utilized as input data.
In Case 3, the ANN model was constructed with 12 input data, adding the hour of the day data to the 11 variables used in Case 2. For Case 3, CVRMSE was approximately 25%, and MAE was 6.88, indicating the highest accuracy among all cases. The results of the cases confirmed that Case 3 exhibited a higher prediction accuracy than Case 1 and Case 2. It was also observed that the dispersed prediction values were tightening. In particular, as shown in Figure 12, the prediction accuracy of the pattern was significantly higher compared to Case 1 and Case 2. As the hour of the day data were used as input data, a variable for the criterion of the occupancy hour was generated, and the prediction accuracy during the occupancy period was significantly improved. Moreover, prediction accuracy was significantly improved for the morning and the lunch time when the energy consumption rapidly changed, making it possible to construct a model with high prediction accuracy.

Discussion
In this study, models were constructed using combinations of various variables to increase the accuracy of the prediction model. Figure 13 compares the results of all cases using CVRMSE and MAE, which are accuracy evaluation indices. In particular, Table 8 illustrates the accuracy of each case with evaluation criterion based on the ASHRAE guideline 14. The results show that Case 2, which performed feature selection for input data, exhibited a higher prediction accuracy than Case 1 that used all data and that Case 3 that added a variable capable of having good influence on the output value of the prediction model showed higher prediction accuracy than Case 2. Therefore, an excellent prediction model for the thermal energy consumption of a building can be constructed by extracting major variables through feature selection and adding significant variables to the input data used for the model rather than by using all raw data as input data.

Discussion
In this study, models were constructed using combinations of various variables to increase the accuracy of the prediction model. Figure 13 compares the results of all cases using CVRMSE and MAE, which are accuracy evaluation indices. In particular, Table 8 illustrates the accuracy of each case with evaluation criterion based on the ASHRAE guideline 14. The results show that Case 2, which performed feature selection for input data, exhibited a higher prediction accuracy than Case 1 that used all data and that Case 3 that added a variable capable of having good influence on the output value of the prediction model showed higher prediction accuracy than Case 2. Therefore, an excellent prediction model for the thermal energy consumption of a building can be constructed by extracting major variables through feature selection and adding significant variables to the input data used for the model rather than by using all raw data as input data.  As in Case 2 of this study, random forest was used to perform feature selection for extracting major variables. Random forest creates models with various combinations, and the importance of variables is determined in the process. Therefore, the importance of the input variables to the output value can be distinctly determined through the accuracy index. Moreover, the results of this study showed that such feature selection was actually quite helpful in predicting energy consumption and improving the prediction accuracy.
Moreover, the results of Case 3 show that the prediction accuracy can be significantly improved by adding significant variables based on data analysis rather than by using only the raw data collected from sensors in the building. If input data alone are not sufficient for predicting the output value, it is quite helpful to add variables that may bring better results when combined with the input data. In this study, a variable for hour of the day was added. As there was no variable to be used as a criterion for the occupancy status of the building among the existing sensor data, the hour of the day variable was used, which significantly improved the prediction accuracy of the model.
The accuracy of all cases is compared, as shown in Figure 14, and the R2 values of Case 1, Case 2, and Case 3 were 0.9492, 0.9779, and 0.9877, respectively. In particular, the resulting slope of Case 3 was close to 1, indicating a high prediction accuracy. Unlike the dispersed prediction results for Cases 1 and 2, the prediction results of Case 3 exhibited a high density, indicating that an excellent model was constructed. Moreover, Case 3 met the 25% CVRMSE criterion of ASHRAE, indicating that reliable results were drawn.  As in Case 2 of this study, random forest was used to perform feature selection for extracting major variables. Random forest creates models with various combinations, and the importance of variables is determined in the process. Therefore, the importance of the input variables to the output value can be distinctly determined through the accuracy index. Moreover, the results of this study showed that such feature selection was actually quite helpful in predicting energy consumption and improving the prediction accuracy.
Moreover, the results of Case 3 show that the prediction accuracy can be significantly improved by adding significant variables based on data analysis rather than by using only the raw data collected from sensors in the building. If input data alone are not sufficient for predicting the output value, it is quite helpful to add variables that may bring better results when combined with the input data. In this study, a variable for hour of the day was added. As there was no variable to be used as a criterion for the occupancy status of the building among the existing sensor data, the hour of the day variable was used, which significantly improved the prediction accuracy of the model.
The accuracy of all cases is compared, as shown in Figure 14, and the R2 values of Case 1, Case 2, and Case 3 were 0.9492, 0.9779, and 0.9877, respectively. In particular, the resulting slope of Case 3 was close to 1, indicating a high prediction accuracy. Unlike the dispersed prediction results for Cases 1 and 2, the prediction results of Case 3 exhibited a high density, indicating that an excellent model was constructed. Moreover, Case 3 met the 25% CVRMSE criterion of ASHRAE, indicating that reliable results were drawn.

Conclusions
In this study, a method for creating an optimal variable combination was used to construct a model for predicting the thermal energy consumption of a high school building. High-quality input data were created using a reduced data set through feature selection, and the prediction accuracy was improved by adding a significant variable to the input data combination. In this study, feature selection for extracting major variables was constructed using the Gini importance of random forest, and the prediction model for building thermal energy consumption was implemented using ANN. The accuracies of three prediction models were compared to create the optimal variable combination.
When a model for predicting the energy consumption of a building is to be implemented, the variable combination should be fully considered. Many current prediction models are constructed in the same direction as Case 1, which uses all the raw data available. In addition, when raw sensor data collected from an actual building are used, it is difficult to predict the energy consumption in many cases because they are empirical data. In this case, a fairly excellent prediction model can be obtained using the following steps: (i) extraction of major variables through appropriate feature selection and (ii) addition of significant input data that may have a good influence on the output value. When this method was used in this study, the Case 3 prediction model was found to be more accurate than the prediction model that used all raw data. If combinations producing a synergistic effect between input data are used, as in this study, it can be possible to implement a highly accurate model for predicting the thermal energy consumption of a building.

Conclusions
In this study, a method for creating an optimal variable combination was used to construct a model for predicting the thermal energy consumption of a high school building. High-quality input data were created using a reduced data set through feature selection, and the prediction accuracy was improved by adding a significant variable to the input data combination. In this study, feature selection for extracting major variables was constructed using the Gini importance of random forest, and the prediction model for building thermal energy consumption was implemented using ANN. The accuracies of three prediction models were compared to create the optimal variable combination.
When a model for predicting the energy consumption of a building is to be implemented, the variable combination should be fully considered. Many current prediction models are constructed in the same direction as Case 1, which uses all the raw data available. In addition, when raw sensor data collected from an actual building are used, it is difficult to predict the energy consumption in many cases because they are empirical data. In this case, a fairly excellent prediction model can be obtained using the following steps: (i) extraction of major variables through appropriate feature selection and (ii) addition of significant input data that may have a good influence on the output value. When this method was used in this study, the Case 3 prediction model was found to be more accurate than the prediction model that used all raw data. If combinations producing a synergistic effect between input data are used, as in this study, it can be possible to implement a highly accurate model for predicting the thermal energy consumption of a building.
Compared with existing traditional methods, such as computer simulation and statistical method to predict energy consumption in buildings, it is not easy for those models to find a correlation between non-linear variables. And it is also difficult for machine learning using empirical data collected by a lot of sensors to predict with high accuracy because of the uncertainty of data. However, if the derivation strategy suggested by this research is applied, the ANN model can improve the quality of prediction considerably rather than before. Even though this research focused on the elimination of data, which act as a noise, an improved model with only a small number of variables could be achieved with high accuracy than other cases. In addition, if this strategy for the improved model is applied for actual buildings, it is possible to build economic monitoring systems efficiently by optimizing and installing sensors only where necessary.
In this study, however, only the data obtained during the short period of approximately four months from December to April were used, and the prediction model is limited to the corresponding period. Therefore, it is necessary to implement a prediction model for a longer period. Moreover, in Case 3, the accuracy was improved by adding only the hour of the day variable, but there are likely many variables that can improve the output prediction results. If further studies are conducted on the input variable combination that can improve the prediction accuracy, it will be possible to construct a model that predicts the thermal energy consumption of a building more effectively. In addition, other public buildings in town can be considered in the future works to make a more systemic feature selection according to the type of buildings and build more high-quality models while this study only focused on the high school building.