Use of Machine Learning Methods for Indoor Temperature Forecasting

: Improving the energy efficiency of the building sector has become an increasing concern in the world, given the alarming reports of greenhouse gas emissions. The management of building energy systems is considered an essential means for achieving this goal. Predicting indoor tem ‐ perature constitutes a critical task for the management strategies of these systems. Several ap ‐ proaches have been developed for predicting indoor temperature. Determining the most effective has thus become a necessity. This paper contributes to this objective by comparing the ability of seven machine learning algorithms (ML) and the thermal gray box model to predict the indoor temperature of a closed room. The comparison was conducted on a set of data recorded in a room of the Laboratory of Civil Engineering and geo ‐ Environment (LGCgE) at Lille University. The re ‐ sults showed that the best prediction was obtained with the artificial neural network (ANN) and extra trees regressor (ET) methods, which outperformed the thermal gray box model.


Introduction
Improving buildings' energy efficiency is a priority area for progress. The design and the implementation of efficient energy management strategies to balance energy consumption and occupant comfort have a particular interest in this domain. The indoor temperature is a major key to such a strategy, being one of the most critical parameters affecting energy consumption and personal comfort. In this context, predicting the indoor temperature is an essential task.
Temperature forecasting has been considered an interesting subject, widely studied in the literature [1][2][3][4]. Moreover, it has also been integrated into predictive control models, developed to optimize energy devices [5,6].
The estimation of indoor temperature has been tackled with different approaches, classified according to their foundations in two main categories: the physical approach and the data-driven approach [7]. Physical modeling uses detailed equations based on physical engineering principles [8]. This approach requires thorough knowledge of the overall structure of the building, its components, and energy systems and has a reasonably high computational cost [8][9][10]. The data-driven approach allows the dynamic system to be written in purely mathematical relations expressing the output data as a function of the input data. The adopted mathematical functions can have a physical meaning; it is then a gray box model, or they may not carry any physical sense, and the model will then be known as a black box. The black box model forgoes the need for detailed input data of the simulated building and focuses on learning from the available historical data [11]. This approach and NN methods gave accurate results. Aguilera et al. [36] showed the accuracy of a thermal model based on a DT algorithm using the weather data and occupants' feedback to predict the indoor temperature. The gray box model involves both physical and black-box modeling [37]. This approach is based on the thermal modeling of buildings by analogy with an electrical resistance-capacity circuit [38]. The buildings are modeled by a set of dynamic differential equations representing the phenomenon of conduction, convection, and capacitive phenomenon. Several scholars used this approach in research about building energy efficiency. Berthou et al. [17] tested the capacity of four gray box models to predict heating and cooling demands of a multi-zone occupied office building to determine the best model architecture. The results showed that a second-order model was able to well represent the thermal behavior of the office building. Cui et al. [3] developed a hybrid model to predict the average temperature in two-story houses. Tests conducted on the 24 h data horizon gave satisfactory results. Ogunsola et al. [39] created a time-series model to estimate the indoor temperature's real-time cooling load. Two gray box models were combined for the building envelope and the internal thermal mass. The relevance of the model was checked on light, medium, and heavy constructions. A reasonably high degree of precision was obtained for the studied cases.
The studies mentioned above focused on using either the machine learning technics or the gray box approach for the thermal building modeling. This paper presents a comparison of the performances of a set of data-driven models in these two categories.
The remainder of this paper is organized as follows: Section 2 outlines the research methodology and material; Section 3 presents and discusses the prediction results., Section 4 summarizes the conclusions and highlights the primary outcome of this research.

Methodology
This research aimed to compare the ability of different ML algorithms and a gray box model to predict indoor temperature. During this investigation, data on the thermal environment were first collected from a heating experiment in a closed room in the LGCgE laboratory using an intelligent monitoring system. The recorded datasets served as a basis for the development and training of ML algorithms, and the evaluation of their predictive performance in terms of root mean square error (RMSE) and coefficient of determination (R 2 ). For detailed information about the dataset see the Supplementary Materials. A gray box model was also established and compared to the ML algorithms to cover the statistical and hybrid aspects of data-driven modeling. Figure 1 summarizes the methodology applied in this study. More explicit descriptions of the experiment, intelligent algorithms, and evaluation criteria are presented below.

Material
The study was conducted in an unoccupied closed room in the LGCgE Laboratory at Lille University. The closed room has an area of 9 m 2 and a height of 2.3 m. It is furnished and does not have a facade or windows ( Figure 2). To model the thermal environment of the room, an intelligent monitoring system composed of a wireless network sensor connected to a micro-computer (Raspberry-pi) was implemented.
The main objective of these sensors was to track indoor comfort parameters. They provided measurements of four environmental variables: temperature, humidity, luminosity, and noise (THLN). In our work, we focused on the temperature readings. Sensors were installed as shown ( Figure 2) on the internal and external faces of the walls of the room and another sensor was suspended at the center to assess the indoor temperature. A standard methodology for monitoring cannot be found in the scientific literature. Therefore, the number of sensors and their positions were based on empirical approaches [40]. However, several studies have developed models to determine the optimal location of sensors to control energy consumption and thermal comfort [41][42][43]. In this research, the position of the thermal sensors was determined based on a study carried out in the LGCgE laboratory about the optimal sensor position that can provide representative data of the indoor room environment. Therefore, a sensor was suspended through a wire in the center of the room at a height of 1.5 m above the ground. The position of the sensors recording the temperature of the internal and external faces of the walls was determined based on the manufacturer's recommendations [44]. Two sensors per wall were installed on the internal and external faces of the walls in a neutral zone at the same height above the ground (1.5 m).
Reliability analysis of the sensors was carried out before their use. A set of sensors was located at the same position. Based on the obtained temperature profiles, these were classified into four groups ( Figure 3). The maximum temperature difference between these groups, shown in Figure 4    The closed room was heated using a 2000 W power radiator for several hours. The temperatures at the center and on the walls, recording measurements at an interval of 10 min during the experiment, served as a dataset for the applied thermal algorithms.
Data were checked before their use in numerical modeling for the identification of missing data or abnormal values. Missing data were identified easily since data were recorded at a given time interval. Abnormal values were identified if they exceeded maximum expected values. In these two cases, data were identified and reported as unacceptable data. Since our experiments were conducted in controlled conditions, collected data were exempt from missing data or abnormal values. In the future, techniques based on machine learning will be used to identify and treat missing data and abnormal values.
The variation of these parameters, as well as the heating period, are illustrated in Figure 5.

Selection of Predictive Models
In this study, a set of AI-based algorithms and a gray box model were compared to identify the most suitable model to predict the indoor temperature of the room. Furthermore, these models were evaluated according to their forecast accuracy and their performance. A detailed description of the adopted models will be presented below.

ML Methods
A variety of ML algorithms are found in the literature. Some of these algorithms (Table 1) have been frequently used and have shown reliable results in predicting buildings' thermal and energy variables. An artificial neural network (ANN) is a system whose functioning is inspired by the neurons of the human brain. Multi-layer perception (MLP) is the most popular structure among the forwarding propagation methods in ANN and has been the subject of several types of research. MLP has an input layer, an output layer, and a hidden layer in which each neuron is connected to the mentioned layers. This architecture has been used as a powerful method to predict the indoor temperature and energy consumption of buildings [22,45,51] and assess the occupants' thermal comfort [52,53]. This research started with an MLP model with one hidden layer and four neurons. This number was selected after a set of tests conducted with several neurons ranging from 4 to 10. The study conducted in [51] also supports this number. The training process was carried out by considering the Levenberg-Marquardt algorithm, which has proven to be effective with convergence towards a minimal root mean square [52,53]. The transfer function sigmoid was used for the hidden layer, while a linear transfer function was used for the output layer. Several tests were carried out to obtain a reliable prediction. These tests were characterized by similar training times. The best prediction was obtained for a test with 32 epochs and four neurons in the hidden layer. Appendix A summarizes the different tests conducted to estimate the number of neurons and to obtain the best prediction performance.
Multiple linear regression (MLR) is a mathematical regression method that extends simple linear regression. It has demonstrated its ability to solve complex problems, in particular a building's energy balance and energy planning [9], daily peak demand and consumption [46], and annual energy consumption [54].
A decision tree (DT) is a technique based on partitioning the dataset into groups in the form of a flowchart. This technique has been widely used in predicting buildings' energy consumption [14,55] and user comfort indices [47], as well as modeling buildings' energy demands [56].
Ensemble learning has also been applied in monitoring building energy performance, especially bagging and boosting algorithms.
Random forest (RF) and extra trees (ET) are representative techniques of the bagging family, which combine a multitude of decision trees. These algorithms have proven their efficiency in predicting a building's cooling and heating loads [48] and energy consumption [49,57], as well as personal thermal comfort [58,59].
Gradient boosting (GB) and extreme gradient boosting (XGB) methods also belong to the ensemble learning method. Their basic idea is to combine several simple models called weak learners to obtain a strong model with an improved prediction error. These methods appeared as a promising alternative in the domain of building energy efficiency. Several studies have confirmed their effectiveness in predicting energy consumption [50,60] and building energy loads [48,61], establishing predictive energy models [62] as well as detecting faults in HVAC systems [63].
These supervised ML algorithms were selected in this research due to their popularity. The dataset was divided into two subsets to train and test the chosen algorithms. The 70% and 80% training proportions are most often used in the literature [46][47][48]64,65].
To determine the most appropriate ratios for the dataset, values ranging from 50% to 80% were tested in this study. The results confirmed the use of the two proportions mentioned above. Similar performances in terms of RMSE and R 2 were observed for these proportions (see Appendix B).
ANN modeling was conducted using the neural network toolbox in MATLAB-based software, considering a dataset divided into 70% for training, 15% for validation, and 15% for testing. All the other algorithms were developed based on the python statistical computation language. The hyper-parameters were maintained at their default values, considering a dataset distribution of 70% for training and 30% for testing.
The input and output variables used for the models are summarized in Table 2. The temperature history is a matrix of parameters with a difference of 30 min between its different columns. For example, if the temperature was recorded at a time t, the history corresponds to t-0.5h, t-1h, t-1.5h, and t-2h. The accuracy of these forecasting models was evaluated, and their performances were compared based on the following criteria: The root mean square error (RMSE) that can provide information on the magnitude of the deviations [3,65,66]: The coefficient of determination (R 2 ) that can be a measure of the adequacy between the predicted and the observed data [16,52,55]: where ŷ is the predicted vector, y is the reference vector, and n is the number of parameters.

Gray Box Model (GBM)
Hybrid models have been the subject of numerous studies. They have been widely used in the field of predictive control [67][68][69] as well as in the area of predicting building thermal load [70,71] and indoor temperature forecasting [3]. The most common method for creating this model is applying a resistance-capacity (RC) form based on physical and statistical approaches [38,[72][73][74]. The thermal resistance R represents the component to resist the heat flux, and the thermal capacity C describes its storage capacity.
In this work, a simplified (RC) model ( Figure 6) was developed to thermally model the considered room.
(T1, T2) are the respective outdoor and indoor temperatures of the first wall, (T5, T4) the respective outdoor and indoor temperatures of the second wall, and T3 the indoor air temperature. Qh is the heat source power.
The model's parameters (R1, C1), (R2, C2), and (R3, C3) respectively designate the thermal resistance and capacity of the first wall, the indoor air, and the second wall. The model can be expressed as a linear stochastic differential equation written into a matrix form for state-space representation by applying Kirchoff's balance laws to the circuit [75]. In addition, it includes a state equation and an output equation: The T vector contains the node temperatures, U the controllable inputs and disturbances, Y the measured output; A, B, C, D matrices have the RC parameters to be identified.
As for the parameters of the models, they are determined using the grayest function in MATLAB. The initial values of (R2, C2) and (R1, R3) were selected by applying the French thermal code (RT 2005-2012), while those of (C1, C3) were estimated based on the equations characterizing the walls mentioned in the building thermal code [66,76] (see Appendix C).

Results and Discussion
The tested algorithms' performance has been evaluated using the coefficient of determination (R 2 ), and the root mean square error (RMSE). The obtained values are illustrated in Table 3. This part focuses on the results of the prediction of the temperature at the center of the room only, since similar results were obtained for the prediction of the temperature of the internal faces of the walls. The used ML algorithms have been sorted in decreasing order based on their performance in each experiment, in other words, by increasing RMSE and decreasing R 2 , as shown in Figure 7.
The proposed algorithms have shown their efficiency in the prediction of the indoor temperature of the room, given the values of the performance indices (RMSE <1 and R 2 > 0.8) [47,77]. Even though these algorithms seem powerful, they do not all have the same prediction accuracy. In fact, the best result for predicting the indoor temperature was provided by the ANN (RMSE = 0.081 and R 2 = 0.99965) and ET (RMSE = 0.159 and R 2 = 0.99864) algorithms. Boosting algorithms (GB and XGB) have shown fairly close performance. DT, RF, and MLR were less high performing than the previous algorithms despite the acceptable values of the performance criteria. The RC model also exhibited acceptable values of performance criteria (RMSE <1 and R 2 > 0.8). These are compared to those of the ML algorithms in Figure 8, which illustrate the ranking of the gray box model against the lower-and the best-performing ML algorithms. This figure shows that the AI-based algorithms outperformed the gray box model in predicting the indoor temperature. The lower-performing algorithm MLR ( Figure 6) showed improved performance criteria values (RMSE = 0.332 and R 2 = 0.99415) compared to those of the RC model (RMSE = 0.842 and R 2 = 0.96237).
The results of this research were compared to other investigations in the literature. [65] compared the performance of 20 families of ML methods in predicting the indoor temperature of an intelligent building. The ET algorithm provided the best performances. This research partially agrees with this study: the ET method was among the best performing methods, but the ANN model outperformed the ET method. Wang and Chen [78] compared three data-driven models, a linear black-box model (ARX), a non-linear black-box model (ANN), and a gray box model in predicting the indoor temperature of a single-zone house. The performance of the gray box model was intermediate between the other two models. Our research also confirms the improved performance of the ANN and the linear black-box models over the gray box model. Indeed, in our study, even the simple MLR model outperformed the gray box model. Our study compared the two aspects of the data-driven approach (black and gray box models) on their abilities to provide a reliable prediction of indoor temperature. It employed emerging predictive models in a straightforward manner using a limited number of input parameters necessary to achieve accurate prediction results. The obtained results were based on a heating experiment conducted in a closed room in a laboratory environment. This comparison is helpful as it provides a preliminary idea of the most relevant model in indoor temperature prediction that can be employed in energy system management strategies aimed at improving the energy performance of existing buildings.
Although the obtained results are exciting and some are confirmed by other research, this research has some limitations, which are related to the conditions of the experimentation. Indeed, the prediction of the indoor temperature was limited to the use of the following input parameters: heat power, outdoor wall temperatures, and indoor temperature history. In the future, this approach could be generalized by integrating additional input parameters such as the occupants' behavior and the building exposure. The case study in this work was done in a room that was unoccupied and has no facade. This can be viewed as a limitation of this study due to the additional input parameters, whose influence on the models needs to be investigated. The use of variables related to the occupancy and exposure of the room might be necessary to establish a more generalized approach.

Conclusions
Forecasting indoor temperature in buildings constitutes a central task in the optimal energy control in buildings and ensuring comfort and health conditions for users. This prediction combines technical parameters such as building characteristics and their energy system, environmental parameters such as the outdoor temperature and humidity, and social parameters. Considering these techno-social issues in the thermal modeling of buildings requires advanced methods such as machine learning methods. In addition, the consideration of complex building assets and the integration of unstructured data such as those recorded by cameras requires the use of Big Data tools. This paper contributes to the first objective by comparing AI-based technics and a gray box model to predict indoor temperature. This subject is helpful for the assessment of thermal comfort conditions and for reducing energy consumption.
The analysis was conducted on temperature datasets collected in a closed room of the LGCgE laboratory at Lille University using MATLAB and python statistical computation language simulations. The adopted models exhibited a favorable prediction capacity in terms of root mean square error (RMSE < 1) and coefficient of determination (R 2 > 0.8). Among these models, ANN and ET emerged as the most suitable algorithms for indoor temperature forecasting, thus surpassing the other ML algorithms and the gray box model. These algorithms were followed by the boosting algorithms that exhibited approximately similar behavior. This research shows that a simple AI-based model could provide accurate forecasting of indoor temperature. It also offers an idea of the effective predictive models to be used in energy management strategies.
However, more efforts should be considered in the future to improve the research findings. In addition, this research should be extended to other data collected from other experimentations operating under various conditions with additional parameters such as occupancy and buildings' exposure. The use of these different data will help generalize the results of this research and their use in practical applications.
Furthermore, we suggest extending this research to the prediction of the operative temperature. Indeed, although the air temperature is the commonly used parameter in the control of energy systems, the international standards use the operative temperature for the thermal comfort control.  Data Availability Statement: Not Applicable, the study does not report any data.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix C
The heat balance at each node of the room model is described using the following set of first-order differential equations. The initial values of (R1, C1), (R2, C2), and (R3, C3) were determined using the following equations: where hint is the coefficient of internal convection, and Sint is the internal exchange surface. h air air where ρair is the air density, Cair air mass capacity, Vint indoor air volume, Mob is the impact of the furniture on the air capacity, and Sh is the heated surface.
where Rsi and Rse are the wall's inner and outer surface resistances, respectively, e is the depth of the wall, S its surface, and λ its thermal conductivity.
where m is the mass of the wall and Cp is its specific heat.