Prediction for Overheating Risk Based on Deep Learning in a Zero Energy Building

The Passive House standard has become the standard for many countries in the construction of the Zero Energy Building (ZEB). Korea also adopted the standard and has achieved great success in building energy savings. However, some issues remain with ZEBs in Korea. Among them, this study aims to discuss overheating issues. Field measurements were carried out to analyze the overheating risk for a library built as a ZEB. A data-driven overheating risk prediction model was developed to analyze the overheating risk, requiring only a small amount of data and extending the analysis throughout the year. The main factors causing overheating during both the cooling season and the intermediate seasons are also analyzed in detail. The overheating frequency exceeded 60% of days in July and August, the midsummer season in Korea. Overheating also occurred during the intermediate seasons when air conditioners were off, such as in May and October in Korea. Overheating during the cooling season was caused mainly by unexpected increases in occupancy rate, while overheating in the mid-term was mainly due to an increase in solar irradiation. This is because domestic ZEB standards define the reinforcement of insulation and airtight performance, but there are no standards for solar insolation through windows or for internal heat generation. The results of this study suggest that a fixed performance standard for ZEBs that does not reflect the climate or cultural characteristics of the region in which a ZEB is built may not result in energy savings at the operational stage and may not guarantee the thermal comfort of occupants.


Introduction
The building construction sector is responsible for more than one-third of global final energy consumption and nearly 40% of total direct and indirect CO 2 emissions [1,2]. The Zero-Energy Building (ZEB) has been proposed in several countries as a realistic solution to reduce energy demand and mitigate CO 2 emissions in the building sector [3,4]. A ZEB is a building with zero net energy consumption, i.e., its total annual energy consumption equals the amount of renewable energy created [5,6]. ZEBs involve two design implementation strategies-minimizing the need for energy use in buildings (especially for heating and cooling) through energy efficient measures and adopting renewable energy and other technologies to meet the remaining energy needs [7]. Energy efficient strategies for ZEBs can be categorized as passive or active. Passive strategies, as the risk in summer. In addition, the actual occupancy rate may be higher than the predefined ZEB guidelines [31], which may cause high internal heat gain.
Therefore, overheating issues should be analyzed and countermeasures are prepared in the building design stage in order to improve the indoor thermal comfort and avoid unnecessary energy consumption. In the occupied condition, the overheating risk can also be decreased by predicting and optimal air-conditioning control strategies [39].
Some papers dealing with overheating issues adopt the criteria of the Passivhaus Planning Package (PHPP) and Chartered Institution of Building Services Engineers (CIBSE) [25,40,41]. The representation of overheating as a percentage of the year in PHPP is shown to distort the effect of overheating, while the volume-weighted mean indoor temperature value overlooks variations in zonal temperature [42].
Although some studies on overheating issues in ZEBs have analyzed the overheating risk using a simulation approach during the building design stage [43,44], the application scope of the model is narrow, and the accuracy of the models' predictions remain to be verified. The most precise method is to conduct field measurements [26,34], but this method requires a huge amount of time and cost.
The aim of this study is to identify overheating that can be a problem in ZEB buildings through measured and predicted data. The method to predict overheating using the measured data provided in this study is as follows. Several short-term variables that affect the indoor temperature of a building are ranked through correlation analysis of the values of measured data. The ranked variables are subjected to clustering analysis to detect fault data and maximize similarity between data samples. Clustered data is predicted through a prediction model. The determination of prediction model compares accuracy by analyzing several prediction models. Based on the highly accurate prediction model, long-term overheating frequency prediction and influencing factors are evaluated.
This paper is structured as follows. In Section 2, the field measurement methods of the reference building are described. Additionally, the pre-processing of the measured data is described. Furthermore, principles of clustering analysis, three data mining algorithms, as well as the idea of developing simple model are presented. In Section 3, the overheating frequency is analyzed for the cooling season and intermediate season. Factors influencing overheating are also discussed. Conclusions are presented in Section 4.

Methods
The methodology is divided into two main categories. The first shows the measurement data and overheating issues through the experiment of the target building. The second is an overheating prediction method using clustered measurement data by evaluating the correlation.
First, data is collected through the actual measurement of the target building and building energy management system (BEMS) data. The data from the experiment are indoor temperature, indoor humidity, and indoor carbon dioxide. Among them, indoor temperature and indoor humidity are used to analyze overheating, and carbon dioxide is used in the overheating prediction model to evaluate the prediction accuracy with the indoor temperature and evaluate the association with overheating. BEMS data includes weather data (temperature, humidity, solar radiation, and wind speed) and the return water temperature, water flow rate, and pump power measured at the building operation stage. The overheating was assumed to be an indoor condition that exceeds the operating temperature of 25 • C and is out of the comfort range, considering both temperature and humidity.
The second step is the overheating prediction model. In this study, the data clustering method is used to increase the overheating prediction model's accuracy. The evaluation of the influence between the predicted variable, which is the indoor temperature, and the measured data is performed with the distance-weighted Pearson correlation coefficient. Clustering is performed using the self-organizing mapping (SOM) method by using variables of large influences that affect the predicted value, such as outside temperature, solar radiation, indoor humidity, and outdoor humidity. The most accurate prediction of indoor temperature and carbon dioxide is selected by comparing machine learning models using clustered data. In this study, the long short-term memory with SOM (SOM-LSTM) Sustainability 2020, 12, 8974 4 of 20 method was found to be the most accurate. The following sections describe the methodology of this study in more detail.

Analyzed Building
In this study, a Korean ZEB, Asan Library, which received a Level 5 ZEB certification as a public building, is analyzed. ZEB certification requires three criteria. The first is the building energy efficiency rating of 1++ or higher, which is the case when the primary energy consumption is less than 90 kWh/m 2 yr for residential buildings and less than 140 kWh/m 2 yr for commercial buildings. The second is the energy self-sufficiency rate, that is, the ratio of renewable energy production among the total energy consumed by buildings. Finally, BEMS or remote meter reading must be installed, which is a system that measures and manages energy consumption in real time. Level 5 ZEB certification must satisfy energy self-sufficiency rate of 20% or more and less than 40%. The levels of ZEB certification are shown in Table 1. The analyzed building has four floors above ground and one basement floor. Specifications of the analyzed building are shown in Table 2. The exterior walls have a high insulation performance with triple-glazed windows. The heat transfer coefficient (U-value) of the building envelope was planned to be lower than the Korean Building Design Criteria for Energy Savings (BDCES), a mandatory regulation for new construction buildings in Korea [45]. The analyzed building was equipped with a Building Energy Management System (BEMS) to monitor and control the building heating, ventilation, and air conditioning (HVAC) system and energy consumption efficiently. The analyzed space was the reading room located on the second floor in the target building. This space is exposed to the outside on the south wall. The east and north walls face the corridor. The west wall is shared with other adjacent study rooms. Except for the south-facing wall, all of the interior walls are in contact with the air-conditioned space. The analyzed space has a high occupancy density, has 50 desks, and is generally used for study purposes. During the measurement period, measurements were made under normal conditions without any other change in use. The windows were all closed and the sun-shade device was not working. The indoor set temperature was operated on a schedule of 9:00-22:00 at 25 degrees, and the energy recovery ventilator (ERV) was also operated on the same schedule. The measurement locations of room temperature, humidity, and CO 2 are shown as a total of six positions in Figure 1.  In order to analyze the overheating issues in a ZEB in Korea, the indoor and outdoor thermal environment were monitored during a two-week period in the summer. Significantly, three kinds of data were collected: local meteorological data, indoor thermal environmental data, and operational

Measurement Descriptions
In order to analyze the overheating issues in a ZEB in Korea, the indoor and outdoor thermal environment were monitored during a two-week period in the summer. Significantly, three kinds of data were collected: local meteorological data, indoor thermal environmental data, and operational data. The local meteorological data consisted of outdoor temperature, relative humidity, solar irradiation, and wind speed. The indoor thermal environmental data consisted of indoor temperature, relative humidity, and CO 2 density. Operating data from the air-conditioning system included flow rate, return water temperature, and pump power of the fan coil unit (FCU).
Meteorological data were obtained from a weather station, the operating data of the air-conditioning system data were collected by the BEMS, and the indoor thermal environmental data were monitored using various sensors and measuring instruments. The details of the measurement system are shown in Table 3. To ensure the accuracy and availability of the experimental data, the indoor measuring point for indoor temperature and relative humidity were set according to ISO 7726 [46], a standard for the measurement of indoor thermal environments. A total of 243 sets of sample data were obtained after smoothing the system operation data and data aggregation in consideration of the difference in the sampling interval between the meteorological data and indoor thermal environmental data.
The measurements were conducted during the cooling season in Korea. Therefore, the air-conditioning system was operated from 9:00 to 22:00. The data sampling period and sampling interval were whole-day sampling (0:00-24:00) and 1 h, respectively, in order to analyze the passive heat dissipation performance and ventilation as well as infiltration of the building at night.
The distribution of the raw data after normalization is shown in Figure 2. This normalization method turns all variables into z-values with a mean value of 0 and standard deviation of 1. The values shown in Figure 2 are the 1st and 3rd quartiles of the measured parameters.
Data preparation is important in model-based methods due to the unreliability of measurements [47]. To improve the reliability of the clustering and prediction, low-quality data should be removed. In this study, outliers are the main concern in data preparation; therefore, the quartile method is used to clean the outlier data in the raw dataset, which means a point is removed if it is greater than the 3rd Quartile, or less than the 1st Quartile, by more than 1.5 times the distance between the 1st Quartile and the 3rd Quartile. In addition, the range of the raw data determines the available range of the prediction model. The outliers shown in the raw data had values that occurred due to communication errors among BEMS data, and especially in the case of solar radiation, the outliers were seen in the excessive solar radiation measurement.
conditioning system was operated from 9:00 to 22:00. The data sampling period and sampling interval were whole-day sampling (0:00-24:00) and 1 h, respectively, in order to analyze the passive heat dissipation performance and ventilation as well as infiltration of the building at night.
The distribution of the raw data after normalization is shown in Figure 2. This normalization method turns all variables into z-values with a mean value of 0 and standard deviation of 1. The values shown in Figure 2 are the 1st and 3rd quartiles of the measured parameters. Data preparation is important in model-based methods due to the unreliability of measurements [47]. To improve the reliability of the clustering and prediction, low-quality data should be removed. In this study, outliers are the main concern in data preparation; therefore, the quartile method is used to clean the outlier data in the raw dataset, which means a point is removed if it is greater than the 3rd Quartile, or less than the 1st Quartile, by more than 1.5 times the distance between the 1st Quartile

Overheating Issues in the Analyzed ZEB
Existing overheating criteria are categorized in the CIBSE Guide A [48], ASHRAE 55 [49], and EN15251 [50]. Operative temperature, a thermal comfort indicator, is used in existing overheating criteria to evaluate the overheating risk and calculate the number of overheating hours exceeding the comfort range. In this study, the ASHRAE 55 method is used to evaluate the overheating risk of the analyzed ZEB because these criteria can reflect the relationship among various factors more comprehensively [51]. The ASHRAE 55 standard presents the comfort range of Graphic comfort zone method for typical indoor environments. This range is a method that determines the range of operating temperature and humidity that 80% of the occupants are satisfied within a specific environment, the metabolic equivalent of task (MET) is between 1.0 and 1.3 met and the amount of clothing is between 0.5 and 1.0 clo. In this study, overheating was set as an area outside the comfort range. Since the cooling period was considered, temperature conditions lower than the comfort range were excluded from overheating.
Based on the comfort zone from ASHRAE standard 55, the overheating frequency during the experimental duration was calculated, as shown in Figure 3. The analysis indicated that the ZEB was overheated for 33% of the analyzed period (10 July 2019-24 July 2019).
Sustainability 2020, 12, x FOR PEER REVIEW 7 of 20 and the 3rd Quartile. In addition, the range of the raw data determines the available range of the prediction model. The outliers shown in the raw data had values that occurred due to communication errors among BEMS data, and especially in the case of solar radiation, the outliers were seen in the excessive solar radiation measurement.

Overheating Issues in the Analyzed ZEB
Existing overheating criteria are categorized in the CIBSE Guide A [48], ASHRAE 55 [49], and EN15251 [50]. Operative temperature, a thermal comfort indicator, is used in existing overheating criteria to evaluate the overheating risk and calculate the number of overheating hours exceeding the comfort range. In this study, the ASHRAE 55 method is used to evaluate the overheating risk of the analyzed ZEB because these criteria can reflect the relationship among various factors more comprehensively [51]. The ASHRAE 55 standard presents the comfort range of Graphic comfort zone method for typical indoor environments. This range is a method that determines the range of operating temperature and humidity that 80% of the occupants are satisfied within a specific environment, the metabolic equivalent of task (MET) is between 1.0 and 1.3 met and the amount of clothing is between 0.5 and 1.0 clo. In this study, overheating was set as an area outside the comfort range. Since the cooling period was considered, temperature conditions lower than the comfort range were excluded from overheating.
Based on the comfort zone from ASHRAE standard 55, the overheating frequency during the experimental duration was calculated, as shown in Figure 3. The analysis indicated that the ZEB was overheated for 33% of the analyzed period (10 July 2019-24 July 2019). The results shown in Figure 3 are limited to a two-week period; however, it is necessary to review overheating issues throughout the year. The data-driven prediction model method enables yearly analysis with a limited amount of measurement data. Therefore, a simple data-driven model was developed in this study to extend the overheating risk analysis throughout the summer and intermediate seasons, producing a greater volume of data while avoiding difficult, expensive, and The results shown in Figure 3 are limited to a two-week period; however, it is necessary to review overheating issues throughout the year. The data-driven prediction model method enables yearly analysis with a limited amount of measurement data. Therefore, a simple data-driven model was developed in this study to extend the overheating risk analysis throughout the summer and intermediate seasons, producing a greater volume of data while avoiding difficult, expensive, and time-consuming large-scale field studies. (1) Concept used in the simple model With the penetration and integration of artificial intelligence (AI), the use of AI, machine learning, and data-driven methods for building environment analysis and optimization have become increasingly important [52,53]. Deep learning algorithms are based on representational learning of data in machine learning, which aims at finding better representations and creating better models to learn these representations from large amounts of unmarked data. In simple terms, a deep learning neural network is a neural system mimicking the human brain and constructing a non-linear relationship between input and output.

Prediction
The fundamental purpose of this paper is to propose a generalized simple model based on a deep learning algorithm that can accurately predict the overheating risk of a ZEB with a small number of input variables. This study further investigates the potential of the combination of unsupervised algorithms and supervised deep learning in predicting indoor thermal comfort.
Initially, the output variables are defined as indoor temperature and CO 2 density. Indoor temperature is the critical index used to evaluate the overheating risk of the building. In addition, indoor CO 2 density, which represents occupancy, has a high impact on indoor temperature and overheating risk and is selected as the output. But it is difficult to measure CO 2 density directly, and it generally varies with human activity, so it is essential to predict CO 2 density with the proposed model. The intensity of CO 2 shows the density of occupants, which means the amount of internal heat generated in the room. The amount of internal heat generation directly affects the indoor temperature and is one of the factors that cause overheating in ZEB. Furthermore, the forecast duration covers the period from 1 May to 31 October due to the climate characteristics of Korea. Usually, building design standards and indoor thermal environmental standards only specify the hygrothermal parameters of buildings in summer and winter but neglect the intermediate season (spring and autumn). However, there is a great temperature difference between day and night in the intermediate season, for example in May and October in Korea. Figure 4 presents the basic process used to establish the prediction model. Before establishing the prediction model, the raw data should be preprocessed. The box plot can be used to detect and process outlier data from experimental raw data, thus avoiding interference caused by physical errors in the modeling. The pre-processed data are then randomly divided into a training dataset and a testing dataset, and only the training dataset is used in the modeling process. After that, the feature variables are selected through Pearson correlation analysis, and the set of feature variables for modeling is determined. The first step of modeling is to use unsupervised deep learning to add operational pattern identification tags as model inputs. The second step is that supervised deep learning is applied for developing prediction models. Finally, the output result should be validated with the testing dataset. The purpose of input selection is as follows: (1) to find out the most effective and correlated variables among the entire dataset; (2) to discover the low repetitive and highly correlated variables to save computational time; (3) to select easily obtained variables so as to improve the applicability and robustness of the model.
In statistics, there are three commonly used correlation coefficients: Pearson correlation coefficient, Spearman correlation coefficient, and Kendall correlation coefficient. Among the three correlation coefficients, the Pearson correlation coefficient is used to measure the degree of linear correlation in this study. Spearman and Kendall are rank correlation coefficients [54] used to reflect the degree of rank correlation.
The Pearson correlation indexes [55] are shown in Table 4. Apparently, the FCU return water temperature and indoor relative humidity show the highest correlation with indoor temperature, followed by solar irradiation, outdoor temperature, pump power (operation variable), and outdoor relative humidity. Since the operation variables of cooling equipment can only be obtained via installing specific sensors, using these variables as input data will limit the broad applicability of the prediction model. Hence, the indoor relative humidity, solar irradiation, outdoor temperature, and outdoor relative humidity are ultimately chosen as input variables of the prediction model for indoor temperature prediction.
Similarly, as shown in Table 5, the correlation values with CO2 density ranked as follows: indoor temperature, return water temperature, indoor relative humidity, solar irradiation, outdoor temperature, pump power, and outdoor relative humidity. Therefore, it is necessary to add indoor temperature as a new variable to participate in predicting CO2 density for higher accuracy.

Initial dataset collected from ZEB
Pre-process data

Pattern identification
Operation pattern classification Fault detection and diagnosis Fault-free dataset consist

Prediction model for indoor environment
Model training Algorithm selection Input selection

Assessment of indoor thermal comfort
Identify reasons for low thermal comfort Assess indoor thermal comfort Define the most influence variable Early warning for overheating risk (2) Steps of model estimation (a) Input variables selection The purpose of input selection is as follows: (1) to find out the most effective and correlated variables among the entire dataset; (2) to discover the low repetitive and highly correlated variables to save computational time; (3) to select easily obtained variables so as to improve the applicability and robustness of the model.
In statistics, there are three commonly used correlation coefficients: Pearson correlation coefficient, Spearman correlation coefficient, and Kendall correlation coefficient. Among the three correlation coefficients, the Pearson correlation coefficient is used to measure the degree of linear correlation in this study. Spearman and Kendall are rank correlation coefficients [54] used to reflect the degree of rank correlation.
The Pearson correlation indexes [55] are shown in Table 4. Apparently, the FCU return water temperature and indoor relative humidity show the highest correlation with indoor temperature, followed by solar irradiation, outdoor temperature, pump power (operation variable), and outdoor relative humidity. Since the operation variables of cooling equipment can only be obtained via installing specific sensors, using these variables as input data will limit the broad applicability of the prediction model. Hence, the indoor relative humidity, solar irradiation, outdoor temperature, and outdoor relative humidity are ultimately chosen as input variables of the prediction model for indoor temperature prediction. Similarly, as shown in Table 5, the correlation values with CO 2 density ranked as follows: indoor temperature, return water temperature, indoor relative humidity, solar irradiation, outdoor temperature, pump power, and outdoor relative humidity. Therefore, it is necessary to add indoor temperature as a new variable to participate in predicting CO 2 density for higher accuracy. (b) Clustering algorithm selection A clustering analysis is used in this study to detect fault data and identify the indoor environment mode initially. Cluster analysis maximizes the similarity between data samples in the same cluster and minimizes the similarity between data objects in different clusters in the final partition results. The massive data are categorized to differentiate their patterns and explore stronger rules for the prediction model. A self-organizing mapping (SOM) neural network, also known as a Kohonen network, is an unsupervised competitive learning network proposed by Kohonen et al. in 1981. As a nonlinear unsupervised clustering algorithm, it has been applied widely in artificial neural networks [56]. The algorithm gathers similar samples into the same category according to the distance to achieve data clustering. In the learning process of this network, the competition among neurons is unsupervised. In the training process of the network, the network will automatically find possible laws from the distribution characteristics to topology of the input vectors and adjust the weights among nodes of the network adaptively, and finally complete the clustering of the input data. Therefore, this method has been used widely in clustering analysis, signal processing, data dimension reduction, and other fields [57].
(c) Prediction algorithm selection Three machine learning methods are selected in this study to participate in building the simple model: Back propagation (BP) neural network, radial basis function (RBF), and long short-term memory (LSTM). BP is a classical feed-forward neural network, RBF is a special neural network based on radial basis function, and LSTM represents a feedback neural network.
The long short-term memory neural network (LSTM) [58] is a special type of Recurrent Neural Network (RNN) that can learn to rely on information for a time series, which aligned with our research because LSTM can not only process single data points, but also entire sequences of data or historical states. It is suitable for numerical sequences of indoor temperature arranged chronologically, for multivariable, strongly coupled, and severely nonlinear relationships, and also for situations where it is difficult to describe their statistical significance in terms of functions. The LSTM neuron structure is shown in Figure 5. There are three door structures in the neuron structure: the input gate, the output gate, and the forget gate. The first step is to decide which information will be discarded from the cell Sustainability 2020, 12, 8974 11 of 20 status through the forget door. The second step is to determine which information will be placed in cells in the input gate, and the third step is to set the output value in the output door. The BP [59] has very good nonlinear fitting ability, which can be used to identify complex and nonlinear systems. In particular, BPNN can build a relatively good functional relationship between input signals and output signals using original samples to train the network, so it is more suitable for short-term prediction.
The RBF plays an important role in the field of neural networks. For example, RBF neural networks have the unique best approximation property. As a kernel function, a radial basis function can map input samples to high-dimensional feature space and solve some problems that are originally linear and inseparable.
The prediction model proposed in this study is shown in Figure 6. The whole prediction process is divided into two layers. Outdoor temperature, solar irradiation, relative humidity, and indoor humidity are input variables for the first prediction model, and indoor temperature is added as a new input to estimate CO2 density in the second loop. The accuracy of the second layer is determined by the first layer. That is, the prediction accuracy of CO2 density is guaranteed by the accurate prediction of indoor temperature.

(d) Evaluation index illustration
Three evaluation metrics, root mean square error (RMSE), mean square error (MSE), and rsquared (R 2 ), are used to evaluate the performances of those prediction models. RMSE [60], known as the standard error reflects the average deviation between the predicted values and the real value. MSE [61] refers to the average value of the relative error, which is used to compare the reliability of the prediction model. R-squared (R 2 ) [62] is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model.  The BP [59] has very good nonlinear fitting ability, which can be used to identify complex and nonlinear systems. In particular, BPNN can build a relatively good functional relationship between input signals and output signals using original samples to train the network, so it is more suitable for short-term prediction.
The RBF plays an important role in the field of neural networks. For example, RBF neural networks have the unique best approximation property. As a kernel function, a radial basis function can map input samples to high-dimensional feature space and solve some problems that are originally linear and inseparable.
The prediction model proposed in this study is shown in Figure 6. The whole prediction process is divided into two layers. Outdoor temperature, solar irradiation, relative humidity, and indoor humidity are input variables for the first prediction model, and indoor temperature is added as a new input to estimate CO 2 density in the second loop. The accuracy of the second layer is determined by the first layer. That is, the prediction accuracy of CO 2 density is guaranteed by the accurate prediction of indoor temperature. The BP [59] has very good nonlinear fitting ability, which can be used to identify complex and nonlinear systems. In particular, BPNN can build a relatively good functional relationship between input signals and output signals using original samples to train the network, so it is more suitable for short-term prediction.
The RBF plays an important role in the field of neural networks. For example, RBF neural networks have the unique best approximation property. As a kernel function, a radial basis function can map input samples to high-dimensional feature space and solve some problems that are originally linear and inseparable.
The prediction model proposed in this study is shown in Figure 6. The whole prediction process is divided into two layers. Outdoor temperature, solar irradiation, relative humidity, and indoor humidity are input variables for the first prediction model, and indoor temperature is added as a new input to estimate CO2 density in the second loop. The accuracy of the second layer is determined by the first layer. That is, the prediction accuracy of CO2 density is guaranteed by the accurate prediction of indoor temperature.

(d) Evaluation index illustration
Three evaluation metrics, root mean square error (RMSE), mean square error (MSE), and rsquared (R 2 ), are used to evaluate the performances of those prediction models. RMSE [60], known as the standard error reflects the average deviation between the predicted values and the real value. MSE [61] refers to the average value of the relative error, which is used to compare the reliability of the prediction model. R-squared (R 2 ) [62] is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model.  Figure 6. Prediction model process.
(d) Evaluation index illustration Three evaluation metrics, root mean square error (RMSE), mean square error (MSE), and r-squared (R 2 ), are used to evaluate the performances of those prediction models. RMSE [60], known as the standard error reflects the average deviation between the predicted values and the real value. MSE [61] refers to the average value of the relative error, which is used to compare the reliability of the prediction model. R-squared (R 2 ) [62] is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R 2 explains to what extent the variance of one variable explains the variance of the second variable.

Prediction Model Evaluation
Each of the predictive models mentioned above has its own advantages. For comparative analysis of the prediction accuracy of each model, the performance of the LSTM model without data clustering and the SOM-BP, SOM-RBF, and SOM-LSTM models with data clustering were evaluated. The performances of the four models, SOM-BP, SOM-RBF, SOM-LSTM, LSTM, are summarized in Table 6 and Figures 7 and 8. SOM-LSTM produces the most accurate results among the prediction models in this study; SOM-BP also performs well. In the case of the SOM-RBF model, the predictability decreased over time. In the case of LSTM without data clustering, it was shown that there is a deviation according to the prediction interval. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R 2 explains to what extent the variance of one variable explains the variance of the second variable.

Prediction Model Evaluation
Each of the predictive models mentioned above has its own advantages. For comparative analysis of the prediction accuracy of each model, the performance of the LSTM model without data clustering and the SOM-BP, SOM-RBF, and SOM-LSTM models with data clustering were evaluated. The performances of the four models, SOM-BP, SOM-RBF, SOM-LSTM, LSTM, are summarized in Table 6 and Figures 7 and 8. SOM-LSTM produces the most accurate results among the prediction models in this study; SOM-BP also performs well. In the case of the SOM-RBF model, the predictability decreased over time. In the case of LSTM without data clustering, it was shown that there is a deviation according to the prediction interval.  Figure 7a,b show the predicted results and actual value (measured value) comparison of the four models for indoor temperature and CO2 density, respectively. SOM-LSTM was the most similar to the actual data with the highest prediction accuracy. Since the SOM-LSTM model determines and uses the influence of past predicted values over time, it shows that accurate prediction is possible even after the elapse of time.  For predictive models, stability under large fluctuations of the dataset is as important as accuracy. Therefore, the boxplots of the accuracy results are shown in Figure 8 to analyze the stability of each prediction model.
Apparently, the SOM-LSTM model shows the best prediction performance, with an accuracy of over 95%, for the prediction of indoor temperature, and an acceptable accuracy of around 90% for the prediction of CO2 density. The results also demonstrated the feasibility of forecasting the CO2 density by introducing indoor temperature as a second time input variable.
(a) Indoor temperature prediction accuracy (b) CO2 density prediction accuracy  Table 6 shows the results of the four models with three evaluation indexes: MSE, the RMSE, and the R 2 . In terms of predicting the indoor temperature and CO2 density, the SOM-LSTM method has superior performance to the LSTM, SOM-BP, and SOM-RBF methods. Thus, the proposed model using the LSTM algorithm with the SOM clustering method (SOM-LSTM) can reliably predict the indoor temperature and CO2 density from 1 May to 31 October. Further thermal comfort assessment and association analysis can be performed based on this predicted dataset.

Overheating Frequency
Based on data obtained from measurements and the simple model, overheating risk to the analyzed ZEB occurs during summer and intermediate seasons (Table 7 and Figure 9). The SOM-LSTM model with the highest accuracy of the previously evaluated prediction was used. Among the input values, metrological observation data around the target building was used for external weather data, and BEMS data was used for indoor humidity. Table 7. Overheating assessment during the period from May to October in the Korean ZEB according to prediction model.

May
June July August September October Overheating frequency 33% 36% 64% 62% 33% 10% Figure 8. Comparison of the prediction accuracy by different simple models (a) Indoor temperature prediction accuracy; (b) CO 2 density prediction accuracy. Figure 7a,b show the predicted results and actual value (measured value) comparison of the four models for indoor temperature and CO 2 density, respectively. SOM-LSTM was the most similar to the actual data with the highest prediction accuracy. Since the SOM-LSTM model determines and uses the influence of past predicted values over time, it shows that accurate prediction is possible even after the elapse of time.
For predictive models, stability under large fluctuations of the dataset is as important as accuracy. Therefore, the boxplots of the accuracy results are shown in Figure 8 to analyze the stability of each prediction model.
Apparently, the SOM-LSTM model shows the best prediction performance, with an accuracy of over 95%, for the prediction of indoor temperature, and an acceptable accuracy of around 90% for the prediction of CO 2 density. The results also demonstrated the feasibility of forecasting the CO 2 density by introducing indoor temperature as a second time input variable. Table 6 shows the results of the four models with three evaluation indexes: MSE, the RMSE, and the R 2 . In terms of predicting the indoor temperature and CO 2 density, the SOM-LSTM method has superior performance to the LSTM, SOM-BP, and SOM-RBF methods. Thus, the proposed model using the LSTM algorithm with the SOM clustering method (SOM-LSTM) can reliably predict the indoor temperature and CO 2 density from 1 May to 31 October. Further thermal comfort assessment and association analysis can be performed based on this predicted dataset.

Overheating Frequency
Based on data obtained from measurements and the simple model, overheating risk to the analyzed ZEB occurs during summer and intermediate seasons (Table 7 and Figure 9). The SOM-LSTM model with the highest accuracy of the previously evaluated prediction was used. Among the input values, metrological observation data around the target building was used for external weather data, and BEMS data was used for indoor humidity.  The overheating frequency exceeded 60% in July and August, the midsummer in Korea. This result shows that during the nighttime when the air conditioner is not operating, the indoor temperature and humidity did not decrease due to the characteristics of high insulation level and airtightness in the ZEB. Overheating also occurred during periods when air conditioners were off, such as May and October in Korea. In particular, overheating frequency decreased during the noncooling season, and there was no significant difference between May and June. July showed the highest frequency of overheating, and the lowest occurred in October.

Contribution Rate of the Influencing Factors to Overheating Risk
The contribution rates of influencing factors to overheating risk are shown in Tables 8 and 9. Table 8 presents the results of the cooling season (June to September) and Table 9 presents the results of the intermediate season (May, October).
In the case of the cooling season, the occupancy rate (CO2 concentration) showed the highest impact on overheating as 60% contribution rate for the analyzed ZEB. The solar insolation followed with 44% contribution rate. Outdoor temperature, outdoor humidity, and wind speed had The overheating frequency exceeded 60% in July and August, the midsummer in Korea. This result shows that during the nighttime when the air conditioner is not operating, the indoor temperature and humidity did not decrease due to the characteristics of high insulation level and airtightness in the ZEB. Overheating also occurred during periods when air conditioners were off, such as May and October in Korea. In particular, overheating frequency decreased during the non-cooling season, and there was no significant difference between May and June. July showed the highest frequency of overheating, and the lowest occurred in October.

Contribution Rate of the Influencing Factors to Overheating Risk
The contribution rates of influencing factors to overheating risk are shown in Tables 8 and 9. Table 8 presents the results of the cooling season (June to September) and Table 9 presents the results of the intermediate season (May, October). In the case of the cooling season, the occupancy rate (CO 2 concentration) showed the highest impact on overheating as 60% contribution rate for the analyzed ZEB. The solar insolation followed with 44% contribution rate. Outdoor temperature, outdoor humidity, and wind speed had contribution rates of 38.65%, 34.67%, and 25.61%, respectively. These results indicate that the indoor temperatures exceeded the thermal comfort zone when the number of occupants or the amount of solar insolation increased. In this situation, the cooling system may not be able to maintain a comfortable indoor thermal environment in the analyzed ZEB.
Even though the overheating frequency was relatively low in the intermediate season compared to the cooling season, overheating occurred at a frequency of about 10 to 30%. The solar insolation significantly affects the overheating in intermediate season. The influence of occupancy rate was also high. This is because the amount of solar insolation in spring and autumn is higher than in summer due to the solar altitude angle to the space facing the south side. More sunlight enters the room in spring and autumn than in summer in Korea. The window performance regulated by building code is just the heat transfer rate (U-value), but there is no regulation on the amount of solar irradiation (solar heat gain coefficient, SHGC). ZEBs are focused on strengthening the insulation and airtightness of the building envelope. Therefore, heat gain caused by solar insolation is the main reason of the overheating risk in ZEB in Korea.

Overheating Risk under Different Conditions
Overheating is caused by the interaction of several factors rather than by a single factor (Tables 8 and 9). Therefore, it is important to analyze the degree of overheating in a situation in which two factors are combined. According to the results of this study, excessive solar insolation and occupancy density increase the indoor air temperature and cause overheating. The outdoor temperature also has a great influence on indoor overheating. However, anyone can predict that a room may overheat in this situation. Therefore, in this study, the effect of two factors on overheating was analyzed in a situation where one factor has a high value, but another factor does not (Table 10). Three sets of conditions are discussed in this study: (1) high occupancy rate with low solar insolation (condition 1), (2) low occupancy rates with high solar insolation (condition 2), (3) high solar insolation with high outdoor relative humidity (condition 3). Condition 3 occurs in the morning or evening, where the outside air temperature is low and the relative humidity is high but strong solar radiation is introduced to the east or west. The analyzed results are shown in Table 10 and Figure 10. When the occupancy is high but the solar insolation is low, the overheating frequency is about 24%. When the occupancy is low and solar insolation is high, the probability of overheating is about 26%. This indicates that solar insolation and occupancy level can be a key factor affecting overheating in ZEBs. Because ZEB has a high level of insulation and is airtight, it causes an increase in indoor temperature due to the solar insolation through the window and the heat generated by the occupants during the day. Compared to general buildings, ZEB has shown that the indoor temperature is kept relatively high because of less heat loss through the building envelopes during the night. In particular, the analyzed ZEB building found that about 40% of the wall-to-window ratio considered to reduce the heating load resulted from an increase in overheating frequency in a cooling period. However, the overheating possibility is only 8% when the solar insolation and outdoor relative humidity are high. This suggests that overheating rarely occurs in the morning or evening hours. occupancy level can be a key factor affecting overheating in ZEBs. Because ZEB has a high level of insulation and is airtight, it causes an increase in indoor temperature due to the solar insolation through the window and the heat generated by the occupants during the day. Compared to general buildings, ZEB has shown that the indoor temperature is kept relatively high because of less heat loss through the building envelopes during the night. In particular, the analyzed ZEB building found that about 40% of the wall-to-window ratio considered to reduce the heating load resulted from an increase in overheating frequency in a cooling period. However, the overheating possibility is only 8% when the solar insolation and outdoor relative humidity are high. This suggests that overheating rarely occurs in the morning or evening hours. The factors that most affects indoor overheating are solar insolation and occupancy level in the analyzed ZEB. However, this may be due to the lack of a standard for solar radiation through the windows in the Korean ZEB standard, and the fact that the design standard for occupancy density is lower than that of an actual building. In addition, since the ZEB standard is set to minimize the heating load and strengthen the insulation or airtightness performance, it is difficult for indoor heat to be discharged to the outside. This is the cause of overheating in ZEB in Korea. The factors that most affects indoor overheating are solar insolation and occupancy level in the analyzed ZEB. However, this may be due to the lack of a standard for solar radiation through the windows in the Korean ZEB standard, and the fact that the design standard for occupancy density is lower than that of an actual building. In addition, since the ZEB standard is set to minimize the heating load and strengthen the insulation or airtightness performance, it is difficult for indoor heat to be discharged to the outside. This is the cause of overheating in ZEB in Korea.

Conclusions
The aims of this study are to discuss overheating issues in a high-performance ZEB in Korea. Field measurements were carried out to analyze the overheating risk for a zero-energy library building. A data-driven model for prediction of overheating risk was developed, requiring only a small amount of measurement data and extending the analysis throughout the year. The main factors causing overheating during both cooling season and intermediate seasons were also analyzed in detail. The results of this study are as follows: A simple model based on a data-driven approach can accurately forecast overheating conditions throughout the year with easily obtained data from local weather stations and a few easily accessed indoor parameters, such as indoor temperature and CO 2 density.
The SOM-LSTM model shows the best prediction performance, with a high accuracy (over 95%) for the prediction of indoor temperature and an acceptable accuracy (around 90%) for the prediction of CO 2 density.
The overheating frequency exceeded 60% in July and August, the midsummer in Korea. Overheating also occurred during the intermediate seasons when air conditioners were off, such as in May and October in Korea.
In the case of the cooling season, the occupancy rate (CO 2 concentration) showed the highest impact on overheating, with a 60% contribution rate for the analyzed ZEB. In contrast, the solar insolation significantly affects overheating in the intermediate season. This is because the amount of solar insolation in spring and autumn is greater than in the summer due to the solar altitude angle to the space facing the south side. More sunlight enters the room in spring and autumn than in summer in Korea.
The factors that most affect indoor overheating are solar insolation and occupancy level in the analyzed ZEB. This is due to the lack of a standard for solar radiation through the windows in the Korean ZEB standard, and due to the fact that the design standard for occupancy density is lower than that of an actual building. In addition, since the ZEB standard is generally set to minimize the heating load, it is difficult for indoor heat to be discharged to the outside, causing overheating in ZEBs.
This study evaluated the problems of the ZEB performance standard in Korea through a predictive model. Korean ZEB performance standard based on passive houses' performance level imply that the local climate or building usage characteristics may not be reflected. As with the overheating problem shown in this study, the ZEB may not lead to energy saving at the operating stage and may not guarantee the occupant's thermal comfort. It implies that the ZEB performance standard with energy saving and the thermal comfort consideration of building users in all seasons is required.

Funding:
This research was funded by Ministry of Science and ICT of Korean government, grant number 2019M3E7A1113080.