Research on Feature Variable Set Optimization Method for Data-Driven Building Cooling Load Prediction Model

Bai, Di; Ma, Shuo; Wu, Liwen; Wang, Kexun; Zhou, Zhipeng

doi:10.3390/buildings15193583

Open AccessArticle

Research on Feature Variable Set Optimization Method for Data-Driven Building Cooling Load Prediction Model

by

Di Bai

¹,

Shuo Ma

^2,*

,

Liwen Wu

³,

Kexun Wang

¹ and

Zhipeng Zhou

¹

School of Environmental Science and Engineering, Tianjin University, Tianjin 300072, China

²

School of Energy and Safety Engineering, Tianjin Chengjian University, Tianjin 300192, China

³

China Mobile Energy Technology (Beijing) Co., Ltd., Beijing 100080, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(19), 3583; https://doi.org/10.3390/buildings15193583

Submission received: 28 July 2025 / Revised: 7 September 2025 / Accepted: 11 September 2025 / Published: 5 October 2025

(This article belongs to the Section Building Energy, Physics, Environment, and Systems)

Download

Browse Figures

Versions Notes

Abstract

Short-term building cooling load prediction is crucial for optimizing building energy management and promoting sustainability. While data-driven models excel in this task, their performance heavily depends on the input feature set. Feature selection must balance predictive accuracy (relevance) and model simplicity (minimal redundancy), a challenge that existing methods often address incompletely. This study proposes a novel feature optimization framework that integrates the Maximum Information Coefficient (MIC) to measure non-linear relevance and the Maximum Relevance Minimum Redundancy (MRMR) principle to control redundancy. The proposed MRMR-MIC method was evaluated against four benchmark feature selection methods using three predictive models in a simulated office building case study. The results demonstrate that MRMR-MIC significantly outperforms other methods: it reduces the feature dimensionality from over 170 to merely 40 variables while maintaining a prediction error below 5%. This represents a substantial reduction in model complexity without sacrificing accuracy. Furthermore, the selected features cover a more comprehensive and physically meaningful set of attributes compared to other redundancy-control methods. The study concludes that the MRMR-MIC framework provides a robust, systematic methodology for identifying essential feature variables, which can not only enhance the performance of prediction models, but also offer practical guidance for designing cost-effective data acquisition systems in real-building applications.

Keywords:

cooling load; data driven; feature optimization; correlation measurement; redundancy measurement

1. Introduction

Short-term building cooling load prediction is a critical technical foundation for the refined control of building energy management systems, playing a vital role in improving heating, ventilation, and air conditioning (HVAC) operational efficiency, reducing peak energy consumption, and facilitating demand-side grid response [1,2]. Traditional mechanistic models face limitations in addressing non-linear challenges such as dynamic heat transfer through building envelopes and the stochastic nature of occupant behavior. In contrast, data-driven models offer a promising alternative by uncovering latent patterns from historical operational data [3,4]. With advancements in smart metering and artificial intelligence, machine learning-based data-driven methods have demonstrated superior accuracy and adaptability [5,6], emerging as a research hotspot in building load forecasting.

The construction of an optimal feature variable set is pivotal to enhancing the performance of data-driven load prediction models. Building cooling loads are influenced by multi-source heterogeneous factors, including meteorological parameters, operational conditions, and occupancy patterns, which often exhibit complex non-linear couplings [7]. Irrelevant or redundant features introduce noise, while the omission of key features leads to underfitting. Thus, establishing a scientific feature evaluation framework is essential for optimizing feature sets [8]. This process not only improves prediction accuracy but also reduces computational complexity and enhances model interpretability [9], offering dual benefits for engineering applications. Existing research on feature selection primarily focuses on two aspects: (1) data acquisition for load-influencing factors and (2) feature evaluation and screening.

Data acquisition for load-influencing factors. External disturbances affecting cooling loads mainly stem from outdoor environmental conditions. Current monitoring techniques for outdoor parameters are relatively mature. Meteorological data such as dry-bulb temperature, relative humidity, solar irradiance, wind speed, wind direction, and cloud cover can be accurately measured and utilized as input variables for load prediction models, often combined with weather forecasts [10]. Internal disturbances, including occupancy, equipment, and lighting, are more challenging to quantify. For instance, while occupant count and movement significantly impact loads, reliable real-time monitoring—even for basic metrics like occupancy—remains elusive [11]. Calendar data [12] or fixed schedules [13] are often used as model inputs but provide limited dynamic information on internal disturbances. Advanced sensing technologies, such as cameras [14], passive infrared sensors [15], and mobile positioning data [16], offer richer insights into occupancy patterns. However, unlike predictable schedule-based data, these measurements often require historical modeling to forecast future states. For example, Tekler et al. [17] identified carbon dioxide (CO₂) concentration and Wi-Fi-connected device counts as effective proxies for occupancy among 35 potential variables and used deep learning to predict their future values. Additionally, historical load sequences implicitly embed internal disturbance patterns and are commonly employed in time-series models [18]. Beyond direct usage, some studies extract latent features from historical data to enhance prediction. Yang et al. [19] applied K-shape clustering to daily load profiles, demonstrating that extracted load-pattern features improved the accuracy of support vector machine models.

Feature evaluation and screening. Selecting influential features from numerous candidates reduces model complexity and improves learning efficiency, while eliminating irrelevant or redundant variables enhances prediction accuracy. Correlation analysis is a widely adopted feature selection method. Kapetanakis et al. [20] used Pearson correlation to identify strong linear relationships between cooling load and variables like outdoor temperature and room temperature, while solar radiation and wind speed showed negligible correlations. Ling et al. [21] applied Spearman correlation (threshold > 0.3) to select outdoor temperature, solar radiation, and historical loads as key inputs for hourly thermal load prediction. However, different correlation metrics yield varying results, and threshold settings further influence selection. Alternative approaches leverage model prediction errors for feature selection. For instance, Ding et al. [22] employed ANN and SVM to screen 15 variables, including temperature, humidity, and occupancy, based on accuracy changes. Gao et al. [23] used random forest to rank the importance of 11 features, ultimately selecting prior-hour temperature, humidity, solar radiation, load, and schedules as inputs. Nevertheless, model-dependent methods may produce inconsistent results across algorithms.

Despite progress, current research has two key limitations: (1) Evaluation methods lack comprehensiveness; for example, Pearson correlation only captures linear relationships, while model-specific error-driven approaches lack generalizability. Moreover, inadequate consideration of feature redundancy increases complexity and may cause multicollinearity (e.g., in linear regression), thereby degrading performance. (2) Case-specific constraints: Most studies optimize features under limited data conditions, compromising the generalizability of findings for engineering practice.

Despite the progress outlined above, current research on feature selection for load forecasting is confronted with two primary limitations, which this study aims to address. (1) Limitations of Previous Studies: First, evaluation methods often lack comprehensiveness. For instance, widely used metrics like the Pearson correlation coefficient (PCC) can only capture linear relationships, while model-specific error-driven approaches (e.g., using ANN or RF for importance ranking) lack generalizability across different algorithms. More critically, many studies inadequately consider feature redundancy, which increases model complexity, may cause multicollinearity issues (especially in linear models), and ultimately degrades generalization performance. Second, due to case-specific constraints, most studies optimize features under limited data conditions, often compromising the general applicability of their findings for broader engineering practice. (2) Justification and Significance of the Proposed Methodology: To bridge these gaps, this study proposes a systematic and generalized feature evaluation framework. The proposed hybrid Maximum Relevance Minimum Redundancy (MRMR) and Maximal Information Coefficient (MIC) method is specifically designed to (a) overcome the limitation of linear assumption by employing MIC, a powerful statistic capable of detecting both linear and complex non-linear associations in data, and (b) explicitly control for feature redundancy by adhering to the MRMR principle, thereby ensuring the selection of a concise and non-redundant feature subset. This approach is not tied to a specific prediction model, thus enhancing its robustness and generalizability. (3) Contribution and Structure: The primary significance of this work lies in providing a methodological reference for determining essential yet non-redundant feature variables that can enhance model performance and inform cost-effective data acquisition strategies in practice. The remainder of this paper is structured as follows: Section 2 details the proposed methodology, Section 3 presents the case study results, and Section 4 provides a discussion, followed by conclusions in Section 5.

2. Methodology

The research framework of this study is shown in Figure 1, which consists of three main steps. The first step constructs a candidate feature variable set for building cooling load prediction models from two dimensions: attribute features and temporal features. The second step proposes a feature subset evaluation and optimization method combining Maximum Relevance Minimum Redundancy (MRMR) and the Maximal Information Coefficient (MIC). First, the forward search method is used to determine various feature subsets, and then these subsets are evaluated and optimized based on both relevance and redundancy. The third step employs three different types of prediction model algorithms to validate the effectiveness of the proposed feature optimization method, and demonstrates the advancement of the proposed method by comparing it with different feature variable optimization methods.

2.1. Construction of Candidate Feature Variable Set

As shown in Figure 2, all potential feature variables affecting load variations should be considered from two aspects: attribute features and temporal features. Attribute Features: This dimension examines which categories of feature variables influence load changes. Based on load generation mechanisms, external disturbances (outdoor meteorological factors) and internal disturbances (occupants, equipment, and lighting) are fundamental sources of load. As listed in Table 1, ten outdoor meteorological features (X1–X10) are selected as potential factors describing load variations. Internal disturbance factors are more complex. For occupants, factors such as the number of people, activity levels, and occupant types all influence heat and moisture transfer, thereby affecting internal load variations. However, accurately sensing and predicting these factors is often impractical. Thus, this study primarily considers occupant count as the key feature variable for describing load changes. Similarly, lighting power and equipment usage power are selected as representative variables for lighting and equipment-related load variations.

Temporal Features: Building loads form a time series with periodic characteristics. Incorporating historical loads into the candidate feature set can enhance the description of load variations. The temporal dimension should consider the values of different attributes at specific timestamps as features for predicting the target load. This is because buildings exhibit significant thermal inertia, requiring careful determination of time-lagged effects on load variations. As shown in Table 1, all external and internal disturbance factors include values at the prediction time and up to 24 preceding time steps as potential feature variables. Given the daily and weekly periodic patterns of buildings, historical load data from the past week (i.e., 24 × 7 data points) are also included in the candidate feature set. The one-week horizon is specifically chosen to capture the distinct load patterns between weekdays and weekends, which is common and critical for operational scheduling in building energy management systems [24].

The construction of this high-dimensional candidate set is a deliberate methodological choice, notwithstanding its apparent size. This approach is justified by two key principles:

Avoidance of Premature Exclusion: The primary goal of this study is to develop a data-driven feature selection framework. Relying solely on domain knowledge to pre-select a small number of features (e.g., assuming only dry-bulb temperature and solar radiation are important) risks introducing selection bias and potentially omitting non-intuitive yet influential variables or their specific time-lagged effects. By starting with an “over-complete” set that generously encompasses all potential influencers and their temporal dynamics, we allow the data-driven algorithm itself to identify the truly critical variables and their relevant time horizons objectively. Capturing Heterogeneous Time-Lagged Effects: Different physical phenomena exhibit different response times. Conductive heat gain through walls has a longer time constant compared to convective gain from ventilation or radiative gain from solar radiation through windows. Including a wide range of time lags allows the model to capture these heterogeneous dynamics. The proposed MRMR-MIC method is specifically designed to handle such scenarios by efficiently identifying and eliminating redundant variables, including highly correlated time-lagged ones.

2.2. Optimal Feature Subset Selection Method

The candidate feature variable set determined based on domain knowledge in the previous section is intentionally comprehensive and inevitably contains a large number of features that are irrelevant or redundant for describing load variations. Using all these feature variables as model inputs would significantly increase learning difficulty and can lead to the “curse of dimensionality”—not strictly in the sense of having statistically independent variables, but in the practical machine learning sense, where an excessively large number of predictors (including highly correlated ones like sequential time lags) increases the risk of overfitting, makes the learning process computationally expensive, and can degrade the generalization performance of the model. Moreover, from the perspective of data collection, processing irrelevant data would unnecessarily complicate data acquisition, transmission, and storage. Therefore, it is necessary to perform subset selection on the candidate feature variables to identify the truly essential features for building cooling load prediction models.

From the implementation perspective, feature selection generally involves two key steps: subset search and subset evaluation. The most comprehensive subset search method is exhaustive search, which evaluates all non-empty subsets of the candidate feature set to identify the optimal feature subset. However, for the candidate feature set containing 480 variables mentioned above, this would generate 2^480-1 non-empty subsets. Such a brute force approach is clearly computationally infeasible. For subset evaluation, it is crucial to employ appropriate evaluation metrics that match the characteristics of the problem. Otherwise, irrelevant or redundant features may persist in the selected subset, failing to provide necessary and sufficient feature information for load prediction.

2.2.1. Forward Search-Based Subset Selection Method

A forward search strategy is employed in this study. As a suboptimal yet efficient greedy search method, forward search is widely used for high-dimensional feature selection problems [25]. It starts with an empty set and iteratively adds the most contributive feature, providing a practical solution to navigate the vast search space.

Given a feature variable set

\{X_{1}, X_{2}, \dots, X_{d}\}

, the process begins by treating each individual feature as a candidate subset. These d single-feature subsets are evaluated, and the optimal single-feature subset is selected (e.g.,

\{X_{2}\}

). Subsequently, one previously unselected feature is added to the current optimal subset to form two-feature candidate subsets. Through evaluation, if a subset like

\{X_{2}, X_{6}\}

proves superior to the other two-feature combinations and shows improvement over the previous single-feature subset

\{X_{2}\}

, it is selected as the new optimal subset. This iterative process continues by adding one new feature at each step. The algorithm terminates when the evaluation result of the optimal subset at the (

k + 1

)th iteration fails to improve upon the

k

th iteration’s optimal subset. The feature subset selected at the kth iteration is then determined as the final optimal feature subset.

2.2.2. MRMR-MIC-Based Subset Evaluation Method

Selecting influential features from numerous candidates reduces model complexity and improves learning efficiency, while eliminating irrelevant or redundant variables enhances prediction accuracy. For building cooling load prediction, which involves high-dimensional data with complex, non-linear relationships, conventional linear correlation measures like the Pearson correlation coefficient (PCC) are insufficient. Similarly, model-specific feature importance rankings can lack generalizability across different algorithms.

To overcome these limitations, this study employs a hybrid approach combining the Maximum Relevance Minimum Redundancy (MRMR) principle with the Maximal Information Coefficient (MIC). This combination was specifically chosen for its demonstrated efficacy in high-dimensional feature selection problems across various domains, including bioinformatics and energy engineering. The MRMR framework is particularly suited for this task, as it provides a principled way to select features that are maximally informative about the target variable (cooling load) while being minimally redundant with each other. This is crucial for avoiding overfitting, improving model generalization, and reducing computational cost.

However, traditional MRMR implementations typically rely on mutual information (MI) to quantify relevance and redundancy. While powerful, MI estimation for continuous variables can be challenging and sensitive to discretization methods. This is a significant drawback for building energy data, which is predominantly continuous. Therefore, we integrate the Maximal Information Coefficient (MIC) into the MRMR framework. MIC is a powerful statistic that measures the strength of relationships between variables without being limited to specific functional forms (linear, exponential, etc.), making it ideal for capturing the complex, non-linear drivers of building cooling loads. Its superiority in capturing generic associations and its equity (ability to compare relationships across different types) have been demonstrated in various domains. By leveraging MIC, our proposed MRMR-MIC method is better equipped to handle the continuous and non-linear nature of building energy data, providing a more robust and accurate feature selection criterion for this application.

(1): Correlation Evaluation Criterion

The non-linear relationship between feature variables and target variables can be assessed through information-theoretic measures. A typical approach is mutual information (MI), which essentially represents the difference in information entropy—specifically, the reduction in uncertainty (measured by information entropy) of random variable

Y

when feature variable

X

is known. Mutual information equals zero if and only if the two random variables are independent. Higher mutual information values indicate stronger correlation between variables, making it particularly suitable for characterizing the intrinsic relationships of random phenomena. However, the mutual information metric is primarily applicable to discrete feature variables, while its computation for continuous variables presents significant challenges. Although continuous variables can be discretized for MI calculation, the results are highly sensitive to the discretization method employed. Different discretization approaches may lead to substantially divergent analytical outcomes, potentially compromising the reliability of the evaluation.

Reshef et al. [26] proposed the Maximal Information Coefficient (MIC), which effectively overcomes two major limitations of conventional mutual information methods in correlation assessment. The MIC approach can directly process continuous features without discretization, making it particularly suitable for feature selection in regression modeling problems. Therefore, this study employs MIC as the criterion for measuring the correlation between feature variables and the target load variable. The definition of MIC is as follows: Let

D (X, Y)

represent a two-dimensional dataset. We partition this two-dimensional space into

x

bins along the

X

-axis and

y

bins along the

Y

-axis, forming an

x \times y

grid

G

. For a given binning resolution (

x

,

y

), multiple possible grid configurations exist. Let

Ω

denote the set of all possible grid partitioning schemes.

I^{*} (D, x, y) = \max_{G \in Ω} I (D| G)

(1)

where

x

and

y

represent the number of partitioning intervals along the

X

-axis and

Y

-axis, respectively.

D| G

denotes the distribution of dataset

D

on grid

G

, and

I (D| G)

represents the mutual information of

D| G

. All maximum normalized mutual information values calculated from dataset D under different partitioning intervals are organized into a characteristic matrix

M (D)

. Each element in this characteristic matrix is defined as

{M (D)}_{x, y} = \frac{I^{*} (D, x, y)}{\log \min \{x, y\}}

(2)

where

\min \{x, y\}

denotes the smaller value between

x

and

y

. Therefore, the Maximal Information Coefficient (MIC) is formally defined as

M I C (D) = \max_{x, y < B (n)} {M (D)}_{x, y}

(3)

In the equation,

n

represents the total number of samples in the dataset.

B (n)

denotes the upper limit for the number of grid partitions, typically ranging between

[ω (1), O (n^{1 - ε})]

, where

0 < ε < 1

. This established parameterization allows MIC to effectively capture complex variable relationships in building cooling load prediction while remaining computationally feasible for practical engineering applications. The

B (n) = n^{0.6}

rule has become a standard convention in MIC implementation, supported by extensive empirical evidence across multiple research domains.

The MIC measures the correlation between different feature variables by calculating the maximum normalized mutual information under various grid partitions. Compared with conventional mutual information calculation methods, this approach offers better universality and fairness, making it suitable for processing both continuous and discrete feature variable data simultaneously. For building thermal load prediction problems that involve numerous continuous variables in their feature sets, the maximal information coefficient can relatively accurately measure the correlations between feature variables.

(2): Redundancy evaluation criterion

A commonly used feature subset selection strategy that considers feature redundancy is the Max-Relevance Min-Redundancy (MRMR) method [27]. The MRMR criterion is chosen because it provides a principled way to select features that are maximally informative about the target variable while being minimally redundant with each other. This helps to avoid overfitting, improve model generalization, and reduce computational cost. MRMR aims to identify a subset of variables that exhibits high correlation with the target variable while maintaining low inter-correlation among the variables within the subset. In traditional MRMR implementations, mutual information is typically employed to measure the correlation between variables. When accounting for redundancy among feature variables, the importance of a feature variable relative to the target variable can be expressed as follows:

f^{M R M R} (X_{i}) = I (Y, X_{i}) - \frac{1}{|S|} \sum_{X_{S} \in S} I (X_{S}, X_{i})

(4)

In the formula,

I (Y, X_{i})

represents the mutual information between feature variable

X_{i}

and target variable

Y

;

\frac{1}{|S|} \sum_{X_{S} \in S} I (X_{S}, X_{i})

calculates the average mutual information between feature variable

X_{i}

and all variables

X_{S}

in the current feature subset. This dual-criterion approach systematically balances feature relevance and redundancy during the selection process.

In summary, the optimal feature variable selection method proposed in this study adopts MRMR as the basic framework, where the Maximal Information Coefficient (MIC) replaces mutual information to measure the correlations between feature variables and the target variable, as well as among the feature variables themselves. The forward search method is employed to analyze and process all candidate feature subsets, ultimately selecting the optimal feature subset that exhibits maximum relevance and minimum redundancy. For ease of reference and comparative analysis, the proposed method is designated as MRMR-MIC.

The optimization criterion for selecting the feature subset is to maximize the overall relevance of the features to the target cooling load while minimizing redundancy among the features themselves. This is quantitatively evaluated for any candidate subset by calculating the difference between the average MIC value of all features in the subset with the target (relevance) and the average MIC value between all pairs of features within the subset (redundancy). During the forward search process, the algorithm iteratively selects the feature that provides the greatest increase to this criterion—i.e., the feature that most significantly increases the subset’s collective relevance without excessively adding to its internal redundancy. The search terminates when adding any remaining feature no longer leads to a substantial improvement in this balance, thus identifying the most parsimonious yet informative feature set.

2.3. Validation of Method Effectiveness

2.3.1. Performance Analysis of Feature Subset Optimization Method

The objective of this validation step is to evaluate the intrinsic quality of the feature subsets themselves, not to deploy a forecast model. Therefore, we focus on the empirical error (training error) achieved by the models when using different feature subsets. In the framework of computational learning theory, the empirical error reflects the expressive power and sufficiency of the feature set. A feature set that enables a model to achieve a lower empirical error on the entire available dataset is a necessary foundation for achieving a low generalization error in subsequent applications. The assessment of temporal generalization on a hold-out test set, while crucial for a complete forecasting pipeline, is considered a separate step that depends on additional factors and is beyond the primary feature-selection focus of this study.

To quantitatively evaluate the effectiveness of the proposed feature subset optimization method, this section employs three widely used prediction models: multiple linear regression (MLR) model, artificial neural network (ANN) model, and random forest regression (RFE) model. These models were selected for validation based on their distinct characteristics: MLR represents linear models that are transparent and prone to multicollinearity, ANN is a powerful non-linear universal function approximator, and RFR is a robust ensemble method. Using this diverse set of models ensures that the evaluation of the selected feature subsets is comprehensive and not biased towards a specific type of algorithm [28]. These models are utilized to assess the quality of the selected feature subsets. The implementation details of the three algorithms are as follows:

(1) MLR: The MLR model used in this study incorporates historical load variables, making its functionality similar to that of an ARX model. On one hand, it achieves autoregressive capability through historical load variables; on the other hand, by incorporating both external and internal disturbance variables, it enables conventional multiple linear regression. The regression coefficients of this MLR model are obtained using least squares fitting. The model algorithm is implemented using Sklearn package in Python 3.13.7.

(2) ANN: The ANN model employed in this study utilizes a three-layer architecture consisting of an input layer, a hidden layer, and an output layer, where the input layer’s neuron count corresponds to the number of feature variables selected by the feature selection method, the output layer employs a single neuron to generate hourly load predictions, and key parameters, including the number of hidden layer neurons (optimized through grid search across a range of 10–100 in increments of 10), learning rate (0.01–0.001 in 0.001 increments), and optimizer type (selected from Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent (MBGD), Adagrad, and Adam), are systematically determined, with the model implementing a maximum of 1000 training epochs and being computationally realized using Python’s Keras package.

(3) RFE: The random forest algorithm belongs to the bagging category of ensemble learning methods, operating on the principle of combining multiple weak models (typically decision trees in random forests, though other models like SVMs can also serve as weak learners) to form a stronger composite model that significantly outperforms individual weak models, with its training process consisting of two key phases: (1) randomly sampling multiple subsets of training data with replacement from the original dataset, and (2) constructing corresponding decision tree models for each sampled subset, while during prediction it aggregates results by averaging outputs from all constituent decision trees—in this study’s implementation, the random forest model employs 100 CART decision trees as base models with a maximum depth of 10 per tree, computationally implemented using Python’s sklearn package.

The coefficient of variation of root mean square error (CV-RMSE) is adopted as the evaluation metric for assessing feature subset quality, with its calculation formula expressed as follows:

C V - R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {({y^{'}}_{i} - y_{i})}^{2}}{n}} / \bar{y} \times 100 %

(5)

2.3.2. Comparative Analysis of Different Feature Selection Methods

To thoroughly validate the effectiveness of the proposed feature selection method (MRMR-MIC), this study conducts comparative analyses with four widely used feature selection approaches: Pearson correlation coefficient (PCC) [29], distance correlation coefficient (DCC) [30], Max-Relevance Min-Redundancy (MRMR), and fast correlation-based filter (FCBF), where PCC and DCC represent feature selection methods considering only relevance, while MRMR and FCBF are methods considering both relevance and redundancy—differing from the proposed MRMR-MIC approach in that traditional MRMR uses mutual information as its metric and FCBF employs normalized mutual information (symmetric uncertainty SU) as its criterion (specific calculation methods are detailed in reference [31]).

2.4. Case Study Description

To validate the effectiveness of the proposed feature selection methodology, a case study based on a simulated office building model is conducted. The detailed configuration of the building energy simulation model is summarized in Table 2. A typical office building in Tianjin, China, is selected as the prototype, with its key architectural and operational parameters (e.g., envelope thermal properties, internal load schedules, HVAC setpoints) set according to ASHRAE Standard [32] and common design practices. This specific version was selected as the baseline for building energy simulation due to its widespread adoption and well-established reference status in numerous scholarly studies. It is acknowledged that ASHRAE standards are periodically updated (e.g., the 2013 and subsequent versions introduced more stringent requirements and refined adaptive thermal comfort models 38). However, the core objective of this study is to propose and validate a generic feature selection methodology. The simulation environment based on ASHRAE Standard [32] provides a consistent, stable, and recognized benchmark for this purpose, ensuring that the performance comparison of feature selection methods is isolated from variations introduced by different standard editions.

The annual hourly cooling load profile of the building is simulated using EnergyPlus version 9.4. The weather data input is the Typical Meteorological Year (TMY) data for Tianjin. To better mimic real-world measurement uncertainties, Gaussian noise with a signal-to-noise ratio (SNR) of 5 dB is introduced to the simulated cooling load data. The analysis focuses on the cooling season from May 20 to September 20, yielding a total of 2928 hourly data points for model training and testing. The resulting hourly cooling load time series is illustrated in Figure 3. This simulated dataset provides a robust and controllable environment to benchmark the performance of the proposed MRMR-MIC feature selection method against other alternatives before applying it to more complex real-world data.

3. Results

3.1. Feature Subset Selection Results

Five distinct feature selection methods were applied to identify optimal feature subsets for the simulated case study, with results presented in Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8. The PCC and DCC methods, which exclusively consider feature–target correlations while ignoring inter-feature redundancy, yielded the most extensive subsets. Both methods selected features across 12 attribute dimensions, ultimately identifying 173 and 175 features, respectively, when accounting for various time-lagged variables.

The MRMR approach, incorporating redundancy considerations, demonstrated improved selectivity by identifying a reduced subset of 31 features spanning 5 attribute dimensions. FCBF exhibited the most aggressive dimensionality reduction, selecting merely 5 features from just two attribute dimensions (equipment power and historical loads). In contrast, the proposed MRMR-MIC method established an optimal feature set comprising 40 features across 8 attribute dimensions—exhibiting broader attribute coverage than traditional MRMR while maintaining tighter temporal feature selection.

Notably, the PCC and DCC results exhibited significant redundancy, exemplified by the concurrent selection of strongly correlated variables like X1 (dry-bulb temperature) and X7 (sky temperature) due to their individual load correlations. Conversely, MRMR and FCBF’s reliance on discrete variable-appropriate metrics (mutual information and symmetric uncertainty, respectively) led to important feature omissions—particularly evident in FCBF’s exclusion of all external meteorological disturbance factors despite their established physical relevance to cooling load dynamics.

3.2. Evaluation Results of Feature Subset Selection Methods

Five distinct feature subsets obtained through different selection methods were employed as input variables for the three prediction models. Figure 9 illustrates the prediction results achieved by these models when utilizing the various feature subsets, while Figure 10 presents the corresponding error metrics for comparative analysis.

The evaluation results demonstrate that the feature subsets selected by the PCC and DCC methods exhibit substantial consistency and achieve comparable performance across all three prediction models. These comprehensive subsets, containing the most elements (173 and 175 features, respectively), yielded models with relatively low fitting errors (CV-RMSE between approximately 3.5% and 4.5% across the three models, as shown in Figure 10). Although MRMR and FCBF methods effectively reduce feature dimensionality (to 31 and 5 features, respectively), their reliance on suboptimal metrics leads to prediction models with higher fitting errors. The degree of error increases with more omitted important variables: the CV-RMSE for MRMR rose to a range of 4.5% to 6.5%, while FCBF, which omitted all external meteorological factors, resulted in the highest errors, ranging from 6.5% to over 9% (Figure 10). In contrast, the proposed MRMR-MIC method not only significantly reduces the feature subset size (to only 40 features), but also maintains complete feature information integrity. It delivered comparable and even superior fitting performance across all models (CV-RMSE was maintained below 4.0% for all models), despite selecting substantially fewer features than PCC/DCC. Particularly for MLR models, which are susceptible to multicollinearity effects, those using PCC/DCC subsets exhibit higher fitting errors (~4.5% for MLR) than MRMR-MIC implementations (~3.8% for MLR), clearly demonstrating MRMR-MIC’s superior capability in eliminating feature redundancy.

In summary, the proposed MRMR-MIC feature selection method, which employs the maximal information coefficient (MIC) as its evaluation metric, effectively preserves critical feature variables while significantly reducing feature set redundancy, serving as a robust approach for identifying optimal comprehensive feature sets for building thermal load prediction models and providing reliable guidance for establishing necessary and sufficient feature variable foundations for building cooling load prediction models.

4. Discussion

4.1. Interpretation of Key Findings and Methodological Advantages

The core finding of this study is the successful development and validation of a feature selection framework that effectively balances informativeness with parsimony. The MRMR-MIC method’s superiority, as evidenced in Figure 9 and Figure 10, stems from its dual capability to handle complex non-linear relationships while rigorously controlling for redundancy. This addresses a critical gap in conventional methods: the Pearson (PCC) and distance (DCC) correlation coefficients, by considering only relevance, led to a bloated feature set (over 170 variables) riddled with multicollinearity (e.g., selecting both dry-bulb temperature and highly correlated sky temperature). Conversely, while MRMR and FCBF reduced dimensionality, their reliance on metrics less suited for continuous data (Mutual Information and Symmetric Uncertainty) caused them to overlook critical non-linear relationships, leading to the omission of physically meaningful meteorological variables and consequently higher prediction errors. The MRMR-MIC framework, by leveraging the generality of MIC, navigates these pitfalls, achieving high accuracy with a concise and physically interpretable feature set of only 40 variables. This demonstrates that the choice of association metric is as crucial as the selection paradigm itself.

4.2. Novelty and Accuracy of the Proposed MRMR-MIC Framework

The primary novelty of this study is the introduction of the MRMR-MIC framework, which is specifically designed to address the dual challenges of non-linear relationship capture and feature redundancy control in a unified manner. The accuracy of this proposed methodology is conclusively demonstrated by the results in Section 3.2 (Figure 9 and Figure 10). The MRMR-MIC method not only achieved predictive accuracy (CV-RMSE < 5%) on par with the full feature set but did so with a dramatically reduced subset of only 40 features. This reduction in dimensionality, from over 170 to 40, underscores the method’s efficiency in identifying a parsimonious yet highly informative feature set. Furthermore, its superior performance compared to other feature selection methods (PCC, DCC, MRMR, FCBF) across three distinct model types (MLR, ANN, RFR) provides robust evidence of its generalizability and accuracy.

4.3. Analysis of Influential Parameters on Cooling Load

Beyond selecting an optimal feature subset, analyzing the relative importance of the chosen parameters provides deeper insights into the drivers of building cooling load. The proposed MRMR-MIC framework not only selects features but also provides a natural ranking through the forward search process and the MIC values themselves. Based on the analysis of the final selected subset (Figure 8) and the underlying MIC metrics, the parameters can be categorized by their impact:

Maximum-Impact Parameters: The features with the greatest influence on cooling load maximization were consistently equipment power (X13) and outdoor dry-bulb temperature (X1). Their high MIC values with the target load indicate a very strong non-linear relationship. Equipment power is a direct internal heat gain, while outdoor temperature is the primary driver of external heat gain through conduction and infiltration. The time-lagged versions of historical cooling load (X14) also emerged as a dominant parameter, underscoring the critical role of thermal inertia in building dynamics. A high load at time

t - 1

often strongly predicts a high load at time

t

.

High-Impact Parameters: Parameters such as direct solar radiation (X10), occupant count (X11), and sky temperature (X7) showed high impact. Solar radiation directly adds sensible heat gain through windows and walls, while occupant count contributes to both sensible and latent loads. Sky temperature, as discussed, governs radiative heat loss from the roof, which can significantly reduce the cooling load during nighttime hours, thus playing a key role in its minimization.

Moderate-Impact Parameters: Variables like lighting power (X12), wind speed (X6), and humidity-related parameters (X2, X4, X5) showed moderate but significant influence. Humidity parameters, in particular, are crucial for determining the latent load component of the total cooling load.

This hierarchy of feature importance, quantified by our method, aligns perfectly with fundamental building thermodynamics principles. It provides data-driven validation that internal heat gains (equipment, occupants) and external climatic conditions (temperature, solar radiation) are the primary levers influencing cooling load, while also highlighting the often-overlooked significance of historical thermal memory and radiative effects.

4.4. Practical Implications for Building Energy Management

The practical value of this work extends beyond a boost in predictive modeling performance. By identifying a minimal set of salient features, this research provides actionable guidance for the design of cost-effective building monitoring systems. For building operators and engineers, the results indicate that investing in sensors for a targeted set of variables (e.g., key meteorological parameters, equipment power, and detailed historical load data) is sufficient for accurate load forecasting, potentially yielding significant savings on sensor installation, data transmission, and storage costs. Furthermore, the resulting models are not only more computationally efficient but also more interpretable. A simpler model with fewer inputs allows engineers to better understand the dominant factors driving the cooling load in a specific building, thereby facilitating more informed decision-making for control and optimization. The methodology itself is generalizable and can be applied as a foundational data preprocessing step to determine the necessary and sufficient sensing requirements for any specific building project before implementation.

4.5. Limitations and Study Scope

It is important to contextualize these findings within the study’s scope. The validation was conducted on a simulated model of a standard office building. While simulation provides a controlled environment for the precise methodological benchmarking presented here, it inherently assumes the availability of a complete and clean dataset for all 480 candidate features. In real-world applications, data availability and quality (e.g., missing values, sensor drift) for certain variables could pose a challenge. Furthermore, the specific optimal feature subset of 40 variables is intrinsically linked to the building type, HVAC system, and simulated climate. We emphasize that the primary contribution is the MRMR-MIC methodology—a tool to identify the optimal feature set—rather than a universal one-size-fits-all list of variables. The optimal set for a hospital, a retail mall, or a building in a humid climate will differ, and the proposed method can be applied to discover it.

4.6. Future Research Directions

Building upon these limitations and findings, several promising avenues for future work emerge. First and foremost, the proposed methodology must be applied and validated across a diverse portfolio of real buildings (e.g., residential, commercial, educational) in different climate zones to thoroughly assess its generalizability and to build a knowledge base on how optimal feature sets vary. Second, research should investigate the robustness of the framework to real-world data imperfections, developing integrated pipelines that handle missing data and noise before feature selection. Third, there is a significant opportunity to integrate this offline feature selection framework with online learning algorithms, creating adaptive predictive models that can not only learn from data but also periodically re-evaluate and adapt their input feature set in response to changing building operation patterns or seasons. Finally, it is important to note that this study focused on evaluating the feature selection methodology through the lens of empirical error. A critical direction for future work is to apply the selected optimal feature subsets to build forecasting models and rigorously evaluate their generalization error on temporal hold-out test sets under different operational conditions.

5. Conclusions

This study addressed the critical challenge of feature selection for data-driven building cooling load prediction by proposing and validating a novel hybrid MRMR-MIC framework. The core findings and contributions of this work are summarized as follows:

(1): Development of a Superior Feature Selection Methodology: The integration of the Maximal Information Coefficient (MIC) within the Maximum Relevance Minimum Redundancy (MRMR) principle proved to be a robust solution. This hybrid approach successfully overcomes the key limitations of conventional methods: it captures both linear and complex non-linear relationships (a weakness of PCC/DCC) while explicitly and effectively controlling for feature redundancy (a weakness of MRMR and FCBF, which use less suitable metrics for continuous data).
(2): Quantifiable Performance Excellence: The case study results provide compelling evidence for the effectiveness of the proposed method. The MRMR-MIC framework achieved a 76% reduction in feature dimensionality, successfully identifying a concise yet highly predictive subset of only 40 variables from an initial set of over 170. Most importantly, this dramatic simplification was accomplished without sacrificing predictive accuracy, maintaining a low prediction error (CV-RMSE) below 5% across three different machine learning models (MLR, ANN, RFR).
(3): Comprehensive and Physically Meaningful Feature Selection: Unlike other redundancy-control methods (especially FCBF), which omitted critical variables, the MRMR-MIC selected subset retained a physically interpretable combination of features spanning 8 attribute dimensions, including key meteorological parameters and internal loads. This underscores the method’s ability to preserve essential information while eliminating noise and redundancy.
(4): Provision of a Generalizable Framework: The primary output of this research is not a universal feature list, but a powerful and generalizable methodology. The MRMR-MIC framework provides a systematic, data-driven approach to determining the necessary and sufficient sensing and data collection requirements for any specific building, thereby offering practical guidance for the development of cost-effective and efficient building energy management systems.

It is important to acknowledge one limitation of this study: the validation was conducted on a single simulated office building case. While this provides a controlled and rigorous proof of concept for the proposed MRMR-MIC method, the generalizability of the specific optimized feature subset (e.g., the exact 40 features) to buildings with drastically different architectures, HVAC systems, or climates cannot be directly asserted. However, the core contribution of this work is the methodological framework itself, rather than a one-size-fits-all feature set. The proposed MRMR-MIC approach is designed to be generic and can be applied to any building dataset to identify its own unique optimal feature subset. Future work will focus on applying this methodology to a diverse portfolio of real buildings (e.g., commercial, residential, educational) across different climate zones to further validate its robustness and explore the variability of optimal feature sets under different conditions.

Author Contributions

Conceptualization, S.M.; methodology, D.B.; software, Z.Z.; validation, D.B.; formal analysis, K.W.; investigation, L.W.; resources, S.M.; data curation, D.B.; writing—original draft preparation, D.B.; writing—review and editing, S.M.; visualization, D.B. All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge the financial support of National Youth Science Foundation Project (52208120).

Data Availability Statement

Due to confidentiality issues related to the project, the author does not have permission to disclose the data.

Conflicts of Interest

Author Liwen Wu was employed by the company China Mobile Energy Technology (Beijing) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zheng, R.; Lei, L. A hybrid model for real-time cooling load prediction and terminal control optimization in multi-zone buildings. J. Build. Eng. 2025, 104, 112120. [Google Scholar] [CrossRef]
Abdel-Jaber, F.; Dirks, K.N. A review of cooling and heating loads predictions of residential buildings using data-driven techniques. Buildings 2024, 14, 752. [Google Scholar] [CrossRef]
Zhong, F.; Yu, H.; Xie, X.; Wang, Y.; Huang, S.; Zhang, X. Short-term building cooling load prediction based on AKNN and RNN models. J. Build. Perform. Simul. 2024, 17, 742–755. [Google Scholar] [CrossRef]
Huang, X.; Han, Y.; Yan, J.; Zhou, X. Hybrid forecasting model of building cooling load based on EMD-LSTM-Markov algorithm. Energy Build. 2024, 321, 114670. [Google Scholar] [CrossRef]
Cakiroglu, C.; Aydın, Y.; Bekdaş, G.; Isikdag, U.; Sadeghifam, A.N.; Abualigah, L. Cooling load prediction of a double-story terrace house using ensemble learning techniques and genetic programming with SHAP approach. Energy Build. 2024, 313, 114254. [Google Scholar] [CrossRef]
Havaeji, S.; Anganeh, P.G.; Esfahani, M.T.; Rezaeihezaveh, R.; Moghadam, A.R. A comparative analysis of machine learning techniques for building cooling load prediction. J. Build. Pathol. Rehabil. 2024, 9, 119. [Google Scholar] [CrossRef]
Myat, A.; Kondath, N.; Soh, Y.L.; Hui, A. A hybrid model based on multivariate fast iterative filtering and long short-term memory for ultra-short-term cooling load prediction. Energy Build. 2024, 307, 113977. [Google Scholar] [CrossRef]
Da, T.N.; Cho, M.Y.; Thanh, P.N. Hourly load prediction based feature selection scheme and hybrid CNN-LSTM method for building’s smart solar microgrid. Expert Syst. 2024, 41, e13539. [Google Scholar] [CrossRef]
Lu, Y.; Peng, X.; Li, C.; Tian, Z.; Kong, X.; Niu, J. Few-sample model training assistant: A meta-learning technique for building heating load forecasting based on simulation data. Energy 2025, 317, 134509. [Google Scholar] [CrossRef]
Xue, P.; Jiang, Y.; Zhou, Z.; Chen, X.; Fang, X.; Liu, J. Multi-step ahead forecasting of heat load in district heating systems using machine learning algorithms. Energy 2019, 188, 116085. [Google Scholar] [CrossRef]
Ardakanian, O.; Bhattacharya, A.; Culler, D. Non-intrusive occupancy monitoring for energy conservation in commercial buildings. Energy Build. 2018, 179, 311–323. [Google Scholar] [CrossRef]
Powell, K.M.; Sriprasad, A.; Cole, W.J.; Edgar, T.F. Heating, cooling, and electrical load forecasting for a large-scale district energy system. Energy 2014, 74, 877–885. [Google Scholar] [CrossRef]
Wong, S.L.; Wan, K.; Lam, T. Artificial neural networks for energy analysis of office buildings with daylighting. Appl. Energy 2010, 87, 551–557. [Google Scholar] [CrossRef]
Kwok, S.; Lee, E. A study of the importance of occupancy to building cooling load in prediction by intelligent approach. Energy Convers. Manag. 2011, 52, 2555–2564. [Google Scholar] [CrossRef]
Oliveira-Lima, J.A.; Morais, R.; Martins, J.F.; Florea, A.; Lima, C. Load forecast on intelligent buildings based on temporary occupancy monitoring. Energy Build. 2016, 116, 512–521. [Google Scholar] [CrossRef]
Pang, Z.; Xu, P.; O’Neill, Z.; Gu, J.; Qiu, S.; Lu, X.; Li, X. Application of mobile positioning occupancy data for building energy simulation: An engineering case study. Build. Environ. 2018, 141, 1–15. [Google Scholar] [CrossRef]
Tekler, Z.D.; Chong, A. Occupancy prediction using deep learning approaches across multiple space types: A minimum sensing strategy. Build. Environ. 2022, 226, 109689. [Google Scholar] [CrossRef]
Sarwar, R.; Cho, H.; Cox, S.J.; Mago, P.J.; Luck, R. Field validation study of a time and temperature indexed autoregressive with exogenous (ARX) model for building thermal load prediction. Energy 2017, 119, 483–496. [Google Scholar] [CrossRef]
Yang, J.; Ning, C.; Deb, C.; Zhang, F.; Cheong, D.; Lee, S.E.; Sekhar, C.; Tham, K.W. K-Shape clustering algorithm for building energy usage patterns analysis and forecasting model accuracy improvement. Energy Build. 2017, 146, 235–247. [Google Scholar] [CrossRef]
Kapetanakis, D.S.; Mangina, E.; Finn, D.P. Input variable selection for thermal load predictive models of commercial buildings. Energy Build. 2017, 137, 13–26. [Google Scholar] [CrossRef]
Ling, J.; Dai, N.; Xing, J.; Tong, H. An improved input variable selection method of the data-driven model for building heating load prediction. J. Build. Eng. 2021, 44, 103255. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Q.; Yuan, T.; Yang, F. Effect of input variables on cooling load prediction accuracy of an office building. Appl. Therm. Eng. 2018, 128, 225–234. [Google Scholar] [CrossRef]
Gao, Z.; Yang, S.; Yu, J.; Zhao, A. Hybrid forecasting model of building cooling load based on combined neural network. Energy 2024, 297, 131317. [Google Scholar] [CrossRef]
Amasyali, K.; El-Gohary, N.M. A review of data-driven building energy consumption prediction studies. Renew. Sustain. Energy Rev. 2018, 81 Pt 1, 1192–1205. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; McVean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti, P.C. Detecting Novel Associations in Large Data Sets. Science 2011, 334, 1518–1524. [Google Scholar] [CrossRef]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
Liu, H.; Liang, J.; Liu, Y.; Wu, H. A Review of Data-Driven Building Energy Prediction. Buildings 2023, 13, 532. [Google Scholar] [CrossRef]
Cohen, I.; Huang, Y.; Chen, J.; Benesty, J. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
Bhattacharjee, A. Distance correlation coefficient: An application with bayesian approach in clinical data analysis. J. Mod. Appl. Stat. Methods 2014, 13, 23. [Google Scholar] [CrossRef]
Senliol, B.; Gulgezen, G.; Yu, L.; Cataltepe, Z. Fast Correlation Based Filter (FCBF) with a different search strategy. In Proceedings of the 2008 23rd International Symposium on Computer and Information Sciences, Istanbul, Turkey, 27–29 October 2008; pp. 1–4. [Google Scholar]
Spielvogel, L.G. A Standard of Care for Energy. Consult.-Specif. Eng. 2004, 36, 15–23. [Google Scholar]

Figure 1. Research framework.

Figure 2. Schematic diagram of candidate feature variable set for building cooling load prediction model.

Figure 3. The time series of hourly building cooling load with Typical Meteorological Year (TMY) data.

Figure 4. Optimal feature variable subset with PCC.

Figure 5. Optimal feature variable subset with DCC.

Figure 6. Optimal feature variable subset with MRMR.

Figure 7. Optimal feature variable subset with FCBF.

Figure 8. Optimal feature variable subset with MRMR-MIC.

Figure 9. Predicted vs. actual cooling load: performance comparison across three models with five feature subsets.

Figure 10. Prediction errors of three models with five feature subsets.

Table 1. Candidate feature variable set of building cooling load.

NO.	Attribute	Time Lag Duration
X1	Outdoor air dry-bulb temperature (°C)	24
X2	Outdoor air dew-point temperature (°C)	24
X3	Outdoor air wet-bulb temperature (°C)	24
X4	Outdoor air humidity ratio (kg/kg)	24
X5	Outdoor relative humidity (%)	24
X6	Wind speed (m/s)	24
X7	Sky temperature (°C)	24
X8	Horizontal infrared radiation intensity (W/m²)	24
X9	Diffuse solar radiation intensity (W/m²)	24
X10	Direct solar radiation intensity (W/m²)	24
X11	Occupant count (persons)	24
X12	Lighting power (W)	24
X13	Equipment power (W)	24
X14	Historical cooling load (kW)	24 × 7

Table 2. Building simulation model configuration parameters.

Parameters	Values	Units
Geographic location	Tianjin, 39.12° N, 117.20° E	--
Building area	27,400	m²
Window-to-wall ratio	South: 0.4; North: 0.3; East: 0.3; West: 0.3	--
Indoor design temperature	Summer: 26; Winter: 20	°C
Indoor design humidity	60	%
External wall heat transfer coefficient	0.41	$W / m^{2} \cdot K$
Window heat transfer coefficient	1.5	$W / m^{2} \cdot K$
Roof heat transfer coefficient	0.48	$W / m^{2} \cdot K$
Floor heat transfer coefficient	0.59	$W / m^{2} \cdot K$
Air infiltration rate	0.5	h⁻¹
Fresh air volume	30	m³/person
Occupant density	4	m²/person
Lighting power density	11	$W / m^{2}$
Equipment power density	20	$W / m^{2}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bai, D.; Ma, S.; Wu, L.; Wang, K.; Zhou, Z. Research on Feature Variable Set Optimization Method for Data-Driven Building Cooling Load Prediction Model. Buildings 2025, 15, 3583. https://doi.org/10.3390/buildings15193583

AMA Style

Bai D, Ma S, Wu L, Wang K, Zhou Z. Research on Feature Variable Set Optimization Method for Data-Driven Building Cooling Load Prediction Model. Buildings. 2025; 15(19):3583. https://doi.org/10.3390/buildings15193583

Chicago/Turabian Style

Bai, Di, Shuo Ma, Liwen Wu, Kexun Wang, and Zhipeng Zhou. 2025. "Research on Feature Variable Set Optimization Method for Data-Driven Building Cooling Load Prediction Model" Buildings 15, no. 19: 3583. https://doi.org/10.3390/buildings15193583

APA Style

Bai, D., Ma, S., Wu, L., Wang, K., & Zhou, Z. (2025). Research on Feature Variable Set Optimization Method for Data-Driven Building Cooling Load Prediction Model. Buildings, 15(19), 3583. https://doi.org/10.3390/buildings15193583

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Feature Variable Set Optimization Method for Data-Driven Building Cooling Load Prediction Model

Abstract

1. Introduction

2. Methodology

2.1. Construction of Candidate Feature Variable Set

2.2. Optimal Feature Subset Selection Method

2.2.1. Forward Search-Based Subset Selection Method

2.2.2. MRMR-MIC-Based Subset Evaluation Method

2.3. Validation of Method Effectiveness

2.3.1. Performance Analysis of Feature Subset Optimization Method

2.3.2. Comparative Analysis of Different Feature Selection Methods

2.4. Case Study Description

3. Results

3.1. Feature Subset Selection Results

3.2. Evaluation Results of Feature Subset Selection Methods

4. Discussion

4.1. Interpretation of Key Findings and Methodological Advantages

4.2. Novelty and Accuracy of the Proposed MRMR-MIC Framework

4.3. Analysis of Influential Parameters on Cooling Load

4.4. Practical Implications for Building Energy Management

4.5. Limitations and Study Scope

4.6. Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI