Next Article in Journal
Numerical Simulation Investigating the Creep Behavior of Jointed Rock Masses Incorporating Variable Shear Stiffness
Next Article in Special Issue
Evaluating Evaporative Cooling-Assisted Residential HVAC System Using Whole-Building Simulation
Previous Article in Journal
Unveiling the Mechanism of Heat-Input Control and Low-Carbon Welding Consumables on Suppression of Transition Zone Hard/Brittle Layers in Stainless Steel Clad Joints
Previous Article in Special Issue
A Model Integrating Theory and Simulation to Establish the Link Between Outdoor Microclimate and Building Heating Load in High-Altitude Cold Regions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Synthetic Residential Building Energy-Consumption Dataset Generation Through Parametric Simulation for Hot–Arid Egypt

by
Hossam Wefki
1,
Emad Elbeltagi
2,†,
Mohamed T. Elnabwy
3 and
Mohamed ElAgroudy
4,*
1
Civil Engineering Department, Faculty of Engineering, Port Said University, Port Said 42526, Egypt
2
Department of Civil Engineering, College of Engineering, Qassim University, Buraydah 51452, Saudi Arabia
3
School of Architecture and Built Environment, Northumbria University, Newcastle upon Tyne NE1 8ST, UK
4
School of Leadership, Management and Marketing, De Montfort University, Leicester LE1 9BH, UK
*
Author to whom correspondence should be addressed.
Deceased.
Buildings 2026, 16(5), 976; https://doi.org/10.3390/buildings16050976
Submission received: 20 December 2025 / Revised: 18 February 2026 / Accepted: 23 February 2026 / Published: 2 March 2026
(This article belongs to the Special Issue Building Energy Performance and Simulations)

Abstract

Buildings account for a substantial share of global energy demand, and decisions made during conceptual design strongly influence long-term operational consumption. This study presents an open, simulation-derived dataset to support early-stage estimation of residential energy use in a hot–arid context (New Cairo, Egypt). A parametric Rhino/Grasshopper workflow coupled with EnergyPlus was used to generate 12,000 annual simulations. The simulations were produced by systematically sampling key geometric, envelope, glazing, and operational variables, including building dimensions, orientation, window-to-wall ratio, envelope construction options, glazing properties, internal loads (lighting and equipment), and thermostat setpoints. For each case, annual end-use outputs (heating, cooling, lighting, and equipment energy) are reported alongside the corresponding input features, enabling design-space exploration, sensitivity analysis, and the development of surrogate and machine-learning models for rapid decision support. Verification checks and plausibility screening were applied to confirm successful simulation execution and consistent data extraction. In addition, dataset-level sampling diagnostics (marginal balance and correlation screening) are reported to support robust reuse in surrogate and machine-learning studies. The resulting dataset and documentation provide a reusable resource for researchers and practitioners investigating energy-informed residential design under hot-climate boundary conditions.

1. Introduction

Global warming and climate change are closely linked to energy consumption, with the building sector accounting for 37% of global energy consumption [1]. Given these sustainability concerns and rising energy costs, optimised, energy-efficient building design has become increasingly important. A review of prior studies indicates that a building’s energy performance is strongly influenced by decisions made during the conceptual design phase [2]. In conceptual design, proper determination of key design elements can yield potential energy savings of 30–40% without additional construction costs [3]. These elements include building form, orientation, and basic envelope characteristics. Many of the parameters that govern building energy demand are defined at this stage. Early decisions play a crucial role in reducing operational energy use. However, a recognised gap persists between the availability of robust energy-simulation tools and their practical adoption during early design phases [4].
At early design stages, measured or monitored energy consumption data are unavailable because the building is not yet constructed. Even when utility data exist for comparable buildings, the associated metadata are often insufficient. Detailed geometric, envelope, system, and operational information are typically missing, which limits systematic linkage between early-stage design parameters and energy outcomes. Accordingly, this study adopts a simulation-derived (synthetic) dataset approach. Key conceptual inputs (e.g., geometry, envelope, glazing, and setpoints) are varied in a controlled and repeatable manner under consistent boundary conditions. The resulting labelled input–output pairs support predictive modelling and decision support.
Parametric design provides a structured approach for proposing alternative design options. Variables such as building form, orientation, openings, and glazing ratios can be evaluated and adjusted to optimise energy consumption. Parametric design links geometric variables to energy simulation outputs to evaluate and compare design alternatives [5]. The advantage of parametric modelling lies in the ability to establish clear relationships, rules, and parameters that define the model rather than directly specifying the geometry. By manipulating design parameters, designers can explore alternatives and assess their energy implications. This reduces effort and supports performance-oriented decision-making. This approach clarifies how changes in key variables influence building energy behaviour. It also provides a practical framework for generating datasets that support energy analysis and performance-based design [6].
One of the most flexible and adaptable platforms for parametric modelling is the Rhinoceros 3D (Rhino). Using Rhino with its visual programming plug-in (Grasshopper) enables the development of complex parametric relationships and geometry through a node-based interface. This reduces the need for traditional text-based programming skills. Due to its flexibility and extensibility, Rhino/Grasshopper is widely used to implement parametric energy analysis and workflow automation [7,8,9,10,11].
During the design stages, designers face significant challenges, particularly when projects require critical energy decisions. Traditional modelling methods often rely on lengthy manual calculations or simplified rules of thumb [12]. These approaches may fail to capture the complex interplay among factors that drive building energy consumption. These limitations have spurred the development of computational approaches that support comprehensive energy-consumption databases and facilitate more informed design decisions. The combination of parametric design tools, such as Grasshopper, and advanced energy simulation engines, such as EnergyPlus, is effective [13]. This integration enables designers to develop, analyse, and optimise building energy models using visual programming, while reducing the need to switch between software environments and improving continuity between design and analysis phases [14]. Large-scale simulation datasets further support data-driven design by enabling systematic comparison of alternatives that are difficult to evaluate using manual workflows.
This study leverages the integration of Grasshopper 5.0 and EnergyPlus 24.2 to generate an energy-consumption dataset during the conceptual design stage, with modelling assumptions aligned with Egyptian thermal material standards and local climate conditions. The workflow is selected due to its growing adoption and its potential to address four challenges: (1) delayed energy analysis in conventional workflows; (2) barriers imposed by simulation complexity; (3) interoperability limitations between design and analysis tools; and (4) the time required to set up and run detailed simulations relative to conceptual-stage decision cycles. A wide range of Building Energy Modelling (BEM) tools is used in research and practice, including EnergyPlus-based environments (e.g., OpenStudio, DesignBuilder), integrated commercial platforms (e.g., IES VE), legacy engines (e.g., DOE-2/eQuest), and specialist tools (e.g., TRNSYS, ESP-r, IDA ICE). These platforms differ in the level of required input detail and in their integration with design environments. In this context, the Grasshopper–EnergyPlus workflow supports rapid parametric variation and automated large-scale simulation runs for dataset generation at the conceptual stage.

Research Contributions

Motivated by the limitations of existing early-design energy research, often case-based, workflow-dependent, or optimisation-oriented rather than producing broadly reusable datasets, this study positions its contribution around interoperability, reuse, and decision-relevance. This study addresses the recognised theoretical issue that the mapping between parametric design representations and simulation representations is not neutral: undocumented translation rules (e.g., zoning logic, geometric simplification, default constructions, schedules, and HVAC abstractions) can embed modelling artefacts and reduce scientific comparability.
Accordingly, this research contributes the following:
  • An interoperable parametric-to-simulation workflow for early-stage residential energy analysis. The study formalises an integrated pipeline that connects Rhino/Grasshopper parametric modelling to EnergyPlus-based annual simulation (via DIVA for Grasshopper), enabling automated generation of consistent design alternatives and their energy labels.
  • An open, labelled residential energy dataset for a hot–arid Egyptian context. The dataset contains 12,000 simulation cases for New Cairo/Cairo boundary conditions and is publicly released through Zenodo. It is explicitly structured as input–output pairs linking conceptual design variables (geometry, orientation, façade WWR, glazing properties, setpoints, and discrete envelope options) to annual end-use outputs (heating, cooling, lighting, and equipment), supporting reproducible benchmarking and downstream surrogate/ML research.
  • A documented scope with consistent boundary conditions to isolate early-stage effects. To ensure comparability across alternatives, the dataset is generated under controlled assumptions (a single Cairo/New Cairo EPW, a residential operational profile, and fixed baseline internal gains and HVAC-related coefficients). Therefore, the observed energy variance is attributable primarily to the declared early-stage variables.
  • A quality-assurance layer for dataset credibility and reuse. The study introduces systematic verification and plausibility benchmarking, including parameter-bound checks, geometry sanity checks, run-completeness/error screening, output integrity checks, and confirmation of expected physical trends for cooling-dominated hot–arid conditions.
  • Empirical evidence on the relative influence of early-stage variables in hot–arid housing. Beyond releasing the dataset, the paper reports sensitivity/interpretability signals showing that cooling setpoint and building dimensions are the dominant drivers of annual energy variance, followed by glazing solar gains (SHGC). At the same time, other envelope-related parameters exhibit smaller effects within the tested, code-aligned ranges.
Together, these contributions provide (i) a reusable methodological bridge between parametric design and robust simulation, (ii) a climate-specific open dataset that fills a documented gap in hot–arid residential contexts, and (iii) actionable evidence about which conceptual-stage choices most strongly shape annual energy demand under the stated assumptions.

2. Literature Review

A variety of variables can affect the creation of parametric building models that help define the building’s energy consumption behaviour, including building geometry, orientation, building envelopes, and materials used [15]. The use of Grasshopper/EnergyPlus became crucial in generating datasets for energy consumption. The different alternatives are then simulated in EnergyPlus to generate performance metric databases that map to different design configurations.
As highlighted by Peng et al. [12], these databases can be used to develop optimisation methods for building design that consider multiple objectives, such as energy efficiency, carbon emissions reduction, and structural stability. Their research used parametric modelling and an optimisation algorithm in the Grasshopper platform to generate building designs, considering factors such as building structure, materials, and orientation.
To address these challenges, researchers have increasingly explored integrating parametric modelling tools with building energy simulation engines to create more user-friendly workflows for early-stage energy analysis. An auspicious approach involves coupling Grasshopper, a visual programming interface for Rhinoceros 3D that has gained widespread adoption among architects, with EnergyPlus, a validated whole-building energy simulation engine developed by the U.S. Department of Energy.
Many researchers have used the term “Parametric energy analysis” to examine the inputs of designs while their data are generated, simulated, and compared. Roudsari et al. [16] explain in the introduction to the Ladybug plugin for Grasshopper that the approach provides “instantaneous feedback on design modifications” by running simulations directly within the design environment. This parametric approach offers several key advantages for conceptual design:
  • It enables systematic exploration of design spaces by varying key parameters such as building dimensions, orientation, window-to-wall ratios, and basic envelope properties.
  • It enables the generation of comprehensive energy consumption databases that capture the relationships between design decisions and performance outcomes.
  • It provides visual feedback directly within the design environment, helping architects understand performance implications without switching between different software platforms.
  • It supports data-driven decision-making during the critical early phases when the potential for cost-effective optimisation is highest.
Sarkar [8] employed the Grasshopper Optimisation Algorithm (GOA) to design a net-zero-energy residential building, combining Revit for 3D modelling and MATLAB for energy simulation. By optimising passive design strategies and integrating solar panels based on local climate data, the approach dramatically reduces energy consumption. Han and Vartosh [9] proposed a two-objective optimisation model for integrated energy systems (IES) that incorporates demand response and accounts for both economic and energy-efficiency benefits. The model employs both linear and nonlinear constraints to enhance realism. A multi-objective grasshopper optimisation algorithm (GOA) is introduced that utilises Pareto optimisation and fuzzy theory to select the best solution. Peng et al. [12] explored optimisation methods for designing low-carbon, structurally stable earthen buildings using parametric modelling and algorithms in Grasshopper. By optimising design factors such as building orientation, structure, and materials, the developed design helps reduce energy consumption, lower carbon emissions, and enhance structural stability. These impacts helped demonstrate lower stress levels, reduced displacement, and increased safety. Ramirez et al. [17] presented a synthetic dataset specifically designed to analyse residential energy consumption in cold climates, with a focus on the combined effects of winter heating demand and building ageing. This dataset provides a comprehensive view of critical factors, including indoor and outdoor temperatures, humidity, energy use, and solar radiation, illustrating the ageing factor’s effect on energy consumption. The modelling was created in SketchUp (geometry) and simulated in EnergyPlus (energy), providing a solid foundation for machine-learning applications and energy/thermal forecasting. Wu et al. [18] introduced a multi-objective optimisation framework combining Bayesian optimisation with XGBoost (BO-XGBoost) and the NSGA-II algorithm to enhance energy efficiency, daylighting, and thermal comfort in residential buildings.
Using Grasshopper and Latin hypercube sampling (LHS), a dataset was generated to train the BO-XGBoost model, which accurately predicted building performance (achieving R2 values of 0.997, 0.960, and 0.994 for energy use, thermal comfort, and daylighting, respectively). Waqas et al. [19] addressed the critical challenge of improving energy efficiency in buildings, which account for 40% of EU energy use and 36% of CO2 emissions, by proposing a novel 3D BIM-based modelling technique to optimise thermal comfort and reduce overheating risks. Using CAD-reconstructed 3D building models and Ladybug/Honeybee plugins in Grasshopper, the research analyses solar radiation and environmental performance. Alammar and Jabi [20] explored the use of a decision tree (DT) machine-learning model as a surrogate modelling approach to rapidly predict the hourly cooling loads of adaptive façades (AF), offering a faster alternative to traditional building performance simulation (BPS). Since real-world data were unavailable, a parametric model of an office tower with an AF shading system was simulated using Honeybee (Rhino/Grasshopper) to generate synthetic training data. Peronato et al. [21] introduced a Grasshopper plugin for CitySim to bridge the gap between parametric design and urban-scale building performance simulation (BPS), offering a faster, more efficient alternative to traditional BPS tools that are often computationally intensive and require excessive input parameters. The developed interface leverages CitySim’s simplified yet accurate urban-scale algorithms to reduce simulation time and input complexity while maintaining reliability. It combines this with Grasshopper’s parametric flexibility to enable semi-automated manipulation of building geometries and energy parameters.
Optimising energy efficiency during the early design stages is crucial for creating sustainable residential buildings. However, this task is challenging given the many designs, energy, and simulation parameters involved [22]. Multiple simulations utilising parametric analysis are often employed to address challenges. The simulations serve as analytical tools for better understanding alternatives and their corresponding energy use. With the help of the data analysed, key stakeholders will be able to make informed decisions to improve energy efficiency and sustainability. The dataset created will facilitate the development of decision-support and predictive models in the ever-evolving architecture, engineering, and construction (AEC) industry [23].
To address challenges associated with obtaining building modelling, synthetic and real-world data are becoming practical tools for energy management, demand response, and energy planning. It is expected that by 2030, synthetic data will surpass real-world data in training AI models. The goal is to establish a basis for further investigation, extending a source that facilitates specific research by enabling experiments on energy optimisation and the impact of residential building envelope features on energy use, as well as serving as an industry decision-support mechanism [17].

Critical Synthesis and Research Gap

Previous studies demonstrate the value of building energy simulation and parametric workflows for improving early design decisions, and optimisation-based approaches are practical when the primary goal is identifying high-performing solutions under explicit objectives. However, much of the existing work remains case-based (limited design alternatives), workflow-dependent (requiring expert setup and manual model preparation), or optimisation-oriented (producing Pareto-optimal solutions rather than broad, general-purpose datasets). In addition, many published datasets and studies are not tailored to hot–arid climates or do not provide consistent, structured input–output pairs that link early-stage design variables to end-use energy results at scale.
Interoperability constitutes an additional theoretical gap: the mapping between a parametric design representation and a simulation representation is not neutral. Choices related to zoning logic, geometric simplification, surface discretisation, default constructions, schedules, and HVAC abstractions can silently alter the semantics of “the same” design option. When these translation rules are not formalised and documented, datasets become workflow-specific, and the learned relationships may reflect modelling artefacts rather than true design–performance dependencies, limiting scientific comparability and external reuse.
Addressing these limitations, the present study contributes an interoperable parametric simulation workflow that enables automated large-scale simulation and provides a labelled dataset (12,000 residential cases) for the New Cairo hot–arid context. The study adopts a dataset-first paradigm: (i) defining a clear conceptual-stage variable set and ranges, (ii) enforcing consistent boundary conditions and modelling assumptions, (iii) generating a sufficiently large and diverse sample to support learning and sensitivity analysis, and (iv) structuring outputs as end-use energy indicators suitable for ML training and research-field database contribution. Based on the identified gap, the following section presents the dataset-generation methodology, workflow, and dataset specifications.

3. Research Methodology

This section describes the dataset specifications, software workflow, and simulation procedure used to generate the 12,000-run residential dataset, and notes that verification checks and plausibility benchmarking were applied to support dataset credibility.

3.1. Data Specifications

Table 1 provides a comprehensive overview of the dataset used to analyse energy consumption and efficiency in residential buildings. The Rhino/Grasshopper was used to generate 12,000 simulations, which were stored in XLSX files. The parameters include building orientation, dimensions, materials used, and climate conditions. Data were collected in New Cairo City, Egypt (30.0363° N, 31.4758° E) and are available in the Zenodo repository. The dataset is climate-specific and reflects hot–arid conditions based on New Cairo/Cairo weather data; transferability to other climates requires rerunning the workflow with climate-appropriate EPW files. These datasets are a valuable tool for designers, architects, engineers, and researchers for analysing the impact of different designs on energy use, facilitating the development of predictive models and energy optimisation, and providing a foundation for training deep-learning models to predict future energy patterns.

3.2. Data Generation, Analysis, and Results

The research methodology employs systematic computational analysis across different architectural configurations to generate accurate performance data. Figure 1 presents the structural scheme of the proposed framework for generating a building energy consumption dataset, delineating its key operational modules. The proposed workflow follows established parametric performance-analysis approaches in Rhino/Grasshopper and EnergyPlus-based simulation pipelines reported in previous studies (e.g., Roudsari et al. [16] for design-integrated analysis workflows, and EnergyPlus-related workflow literature), with additional automation and dataset structuring introduced in this work. The integrated framework comprises five core components that operate together in a coordinated sequence:
  • Conceptual design phase: defines the baseline building configuration and the initial set of design variables to be explored;
  • Parametric simulation: generates design alternatives by systematically varying key parameters within the modelling environment;
  • Energy modelling: translates each design alternative into an EnergyPlus-ready representation, including envelope, internal loads, and operational settings;
  • Simulation workflow: executes batch simulations and manages input/output handling, quality checks, and extraction of performance metrics;
  • Energy consumption dataset: compiles the labelled input–output pairs into a consistent dataset suitable for analysis and data-driven modelling.
Each component contributes to a comprehensive analytical pipeline that transforms initial design parameters into energy consumption datasets. The subsequent sections elaborate on the technical implementation and theoretical foundations of these interconnected modules.

3.2.1. Conceptual Design Phase

The concept design phase is the primary stage for establishing key parameters that control the building’s energy performance and optimisation. The parameters are structured within a flexible digital framework to enable an iterative search for alternative designs that meet performance objectives. By turning early-stage decisions into quantifiable inputs, the research methodology will ensure that subsequent energy simulations are grounded in architecturally relevant constraints [24]. This parametric approach achieves two objectives: first, it facilitates a rapid design iteration, and second, it establishes traceable relationships between geometric configurations and the resulting energy outcomes. With this, the parameter model links the intent of the architectural design with an analytical, well-structured approach, thereby transforming qualitative design concepts into measurable performance variables. Table 2 lists the conceptual design parameters utilised for this study.

3.2.2. Parametric Simulation

Parametric design is a method that uses adjustable variables to control geometric properties. With this approach, a model takes on a specific form based on the selected parameter values. These parameters can be connected through relationships, allowing the design to automatically update whenever changes are made, even later in the process. Tools like Grasshopper, commonly used in architecture and design, let creators build digital models using algorithms and mathematical functions [25,26]. Simulations in the construction industry are often complex to implement because they require precise inputs, which is challenging when many variables remain uncertain during iterative design. However, with advances in parametric simulation and increased computing power, designers can now quickly explore countless design options. These tools provide valuable insights, helping teams make more informed decisions at the start of a project. While parametric simulation tools are specifically tailored for design, they still have limitations, such as compatibility with certain building types or the required input data [27]. The way simulation programmes are used can vary; designers themselves run quick checks (like an architect testing different window sizes), while at other times specialists are brought in to analyse specific aspects of a design.

3.2.3. Energy Modelling

Building Energy Modelling (BEM) helps estimate how much energy a structure will use and how much it could save compared to standard energy benchmarks. This analysis ensures that projects comply with local, national, and regional energy regulations. BEM forecasts energy performance using historical weather data (such as TMY files) and operational assumptions, meaning its accuracy depends entirely on how realistic those inputs are. Design teams should carefully document and validate these assumptions to avoid misleading results. Think of BEM as a sophisticated calculator: the modeller feeds it data on weather patterns, building dimensions, HVAC details, and occupancy schedules, and it generates performance reports and compliance checks.
Energy modelling simulations require a comprehensive characterisation of building features that influence energy performance. Key determinants include building typology (e.g., residential versus commercial office spaces), spatial configurations, and climatic interactions. Architectural design parameters must be carefully selected and optimised according to functional requirements; their strategic integration can significantly reduce energy demand while maintaining optimal indoor environmental quality. As shown in Table 3, these modelling parameters and inputs vary in complexity and specificity.

3.2.4. Simulation Workflow

This study established a methodological workflow integrating multiple software tools to develop a parametric simulation model for generating an annual building energy use database. The comprehensive workflow incorporates building energy simulation through sequential processes: (1) generation of building information data (including geo-metric parameters, orientation, fenestration properties, and material specifications) using Rhino-based input; (2) data integration through Grasshopper as middleware; and (3) thermal performance analysis via DIVA for Grasshopper, which translates architectural parameters into EnergyPlus-compatible inputs for detailed energy simulation.
The workflow implementation required both adapting existing Grasshopper components and developing custom elements to address specific research requirements not supported by native software functionality. This integrated approach systematically bridges discrete analytical functionalities to achieve the study’s research objectives while maintaining interoperability across architectural design, parametric modelling, and energy simulation domains. The Rhinoceros/Grasshopper tool generates a detailed building model, which is then exported to a simulation tool such as EnergyPlus for energy analysis via applications supported in Grasshopper. The simulation yields energy consumption data, which is extracted and organised into an Excel dataset, along with the design and simulation parameter input data. The proposed roadmap was allocated into four main stages covering the study’s distinctive characteristics: (1) construction material assembly, (2) building geometry modelling, (3) thermal simulation analysis, and (4) dataset generation (Figure 2).
  • Stage 1: Construction Material Assembly
Each component of the envelope can be broken down into multiple layers. These layers’ thermal capacity, absorption, and resistance influence how heat flows through the element. Carefully selecting the materials for these components is necessary based on energy standards or specific requirements. The Egyptian Residential Energy Code (EREC) specifies the permitted U-values or minimum insulation R-values for building envelope components. Additionally, it included the maximum permitted U-factor and Solar Heat Gain Coefficient (SHGC) for glazing, both of which depend on the Window-to-Wall ratio [29]. The characteristics of the materials used for the building envelope components (Walls, Roof, and SOG) under the Egyptian Specifications for Thermal Insulation Work Item [30] are detailed in Table 4. The air gap is assumed to have no mass with a resistance of 0.15 m2K/w.
This section details the configurations of three wall types, two roof types, and two slab-on-grade floor types designed for typical residential buildings in Egypt. The wall types vary in thickness and insulation, with single or double layers of red brick and options for an air gap to enhance thermal resistance. Roof configurations include different thicknesses of cement tiles, mortar, sand, insulation, bitumen, and reinforced concrete to manage heat flow. The slab-on-grade types are similarly constructed to control heat transfer from the ground. Digital modelling of these configurations was achieved using the Gerilla plugin for Grasshopper. The Gerilla material maker component develops the material library based on the thermal and surface properties outlined in the Egyptian thermal insulation specification guidebook (Figure 3). Each layer within the construction element is assigned to a material using the Gerilla construction assembly component, as illustrated in Figure 4.
  • Wall Assembly: Three types of wall constructions are analysed, each designed to represent typical configurations in the Egyptian residential sector:
  • Wall Type 1 (Single Wall 125 mm): Composed of one layer of red brick, two layers of cement mortar, and two layers of plaster.
  • Wall Type 2 (Double Wall 250 mm): Includes a double layer of red bricks, two layers of cement mortar, and two layers of plaster.
  • Wall Type 3 (Double Red Brick Wall with Air Gap): Features two single red brick layers separated by an air gap, cement mortar, and plaster layers.
  • Roof Assembly: The roof, a critical component of the building envelope, is designed to minimise heat gain. Two types of roof constructions are specified:
  • Roof Type 1 (Slab 150 mm): Consists of a cement tile layer, a cement mortar layer, a clean sand layer, an insulation bitumen layer, and a reinforced concrete layer.
  • Roof Type 2 (Roof Floor Slab 200 mm): Like Roof Type 1 but with an increased slab thickness, enhancing its thermal resistance.
  • Slab on Grade: The slab on grade is another essential element for controlling heat transfer between the building and the ground. Two types are highlighted:
  • Slab on Grade Type 1 (SOG 150 mm): Comprises a cement tile layer, a cement mortar layer, a clean sand layer, and a reinforced concrete layer.
  • Slab on Grade Type 2 (SOG 200 mm): Like Type 1 but with increased thickness for improved insulation properties.
  • Stage 2: Building Geometry Modelling
The building geometry involves modelling the building zone to be simulated using a series of Grasshopper definitions, including domain box, list items, panel, and rotate definitions. Geometry focuses on creating a virtual representation of a sample residential building by specifying parameters such as the building’s length, width, height, and orientation. Further, windows are modelled by drawing new surfaces on each façade using Grasshopper components and adjusting them to match the desired WWR.
  • Stage 3: Thermal Simulation Analysis
In thermal analysis, all combinations of the design parameters are simulated. Several factors influence energy use; this phase focuses on selecting the parameters that define the entire building simulation. Once the parameters are chosen, the level of each variable must be established. For this research, the building envelope includes three types of walls, two roofs, and two types of suspended ground (SOG). These parameters have discrete values that represent the U-value for each building envelope component. Other parameters are treated as continuous, with defined minimum and maximum value ranges, using the default settings for the simulation tool, as shown in Table 5 and Table 6. The continuous input parameter values were randomised using uniform distributions within predefined ranges for each parameter. These ranges were selected based on typical residential building practices in Egypt and aligned with the Egyptian Residential Energy Code (EREC). The parameter bounds were selected to reflect feasible early-design choices consistent with Egyptian residential practice and regulatory requirements (EREC) for the New Cairo context. Real-world residential buildings may exhibit greater variability due to differences in construction quality, informal alterations, and occupant use. Therefore, the adopted ranges should be interpreted as defining the scope of a code-aligned hot–arid residential dataset intended for conceptual-stage modelling, rather than a complete statistical representation of all as-built conditions.
Given the complexity and high dimensionality of the design space (including building dimensions, orientation, envelope types, window-to-wall ratios, glazing properties, and temperature settings), random sampling is used to generate a diverse set of design configurations. This method was chosen for its simplicity and effectiveness in exploring a wide range of parameter combinations without introducing bias. To ensure robust coverage, 12,000 simulations are performed, with 1000 for each of the 12 design options (combinations of wall, roof, and slab-on-grade types). Uniform random sampling was adopted because this work aims to generate a large and diverse dataset for downstream predictive modelling, rather than to identify optimal solutions through iterative search. Random sampling provides a simple, reproducible approach for mixed discrete–continuous variables and enables broad coverage of parameter ranges at scale. More formal space-filling and uncertainty-oriented strategies (e.g., Latin Hypercube Sampling or Monte Carlo-based designs) can improve coverage efficiency and support uncertainty quantification; however, they were not implemented in this release to maintain workflow simplicity and reproducibility, and because the target sample size (12,000 simulations) provides substantial variability for training and analysis. Structured space-filling designs (e.g., Latin Hypercube Sampling) can improve coverage efficiency and may reduce the number of simulations needed.
In contrast, the present study prioritises a large, diverse dataset for ML training and thus uses uniform random sampling for simplicity and reproducibility. These strategies are identified as extensions for future dataset releases. This large volume of simulations enabled us to capture parameter variability and interactions effectively. The computational workflow (using Grasshopper and EnergyPlus) was automated, enabling efficient handling of the considerable number of simulations. Thus, the trade-off between simulation volume and coverage was manageable without resorting to more complex sampling methods.
As mentioned, EnergyPlus is used to conduct thermal simulations. EnergyPlus generates scenarios based on input variables and their levels, making it easier to perform parametric analysis across multiple variables. This study computed twelve design possibilities for different wall, roof, and SOG construction type combinations. These combinations are as follows:
Design options = 3 (Walls) × 2 (SOG) × 2 (Roof) = 12 options
Every option comprises both continuous and discrete values. For each design option, 1000 simulations are conducted using random parameter values. Thus, each design option has 1000 simulations, resulting in 12,000 simulations that represent the calculated energy-consumed data for all design options. Once the building zone is established, the model is ready for simulation. It includes the DIVA plugin (daylight performance analysis) and Viper (energy consumption analysis). Using the EnergyPlus simulation engine, the Viper component interfaces with Grasshopper to perform thermal analysis. Weather data, zone geometry, locations, lighting, equipment loads, window assemblies, and people occupancy density are among the inputs used by the Vi-per component. For different simulation objectives, the output metrics can include the monthly and annual usage of electric heating, cooling, lighting, and equipment.
The Viper has two components: a window unit component with glazing parameters such as U-value, SHGC, and VT, and a construction component to describe the assembly using specially layered materials. To address the limitations of EnergyPlus performing only up to 100 simulations per run, a Python-scripted component named “Run All Iterations” was developed within Grasshopper to extend its parametric simulation capabilities. Given that this study involves 12 options, each with 1000 simulations, this component iterates through a series of sliders that represent the variable parameters. The simulations are executed once all attributes are configured for analysis. Figure 5 and Table 7, Table 8 and Table 9 illustrate the simulation process in Grasshopper for building and window modelling, as well as for defining inputs and outputs using DIVIA components. All simulations in this dataset were configured for residential operation. Extending the workflow to other typologies (e.g., office, educational, healthcare, or mixed-use) primarily requires updating typology-dependent inputs such as occupancy density and schedules, lighting and equipment power densities, ventilation requirements, zoning templates, and HVAC system type/control. These changes would produce different load profiles and therefore require reparameterization when generating non-residential datasets.
  • Stage 4: Dataset Generation
EnergyPlus creates an output dataset during the simulation process that contains the results for each time step. This dataset can be presented at each time step or consolidated over extended periods. For every simulation/option, the TT toolbox Grasshopper plugin collects input and output data. A total of 12,000 simulations were conducted, with 1000 simulations performed for each option. Parameters were continually adjusted, and all the information was recorded during the process. The data extracted from the Cairo International Airport weather data were stored in an Excel file, representing the annual energy consumed for heating, cooling, interior lighting, and equipment. The dataset is simulation-derived and reflects the stated boundary conditions; users should avoid extrapolation beyond the parameter ranges. The annual energy usage dataset will build a deep-learning model to predict energy usage for a particular geographical area. The dataset is generated through a workflow that simulates energy consumption for heating, cooling, lighting, and equipment. The predicted Energy Use Intensity (pEUI) label in this dataset represents the annual total site energy use for heating, cooling, lighting, and equipment (kWh/year). If an area-normalised Energy Use Intensity is required, compute EUI = pEUI ÷ (Length × Depth), yielding units of kWh/m2·year.

3.3. Verification and Benchmarking

Because the dataset is simulation-derived and intended for early-stage decision support, verification was performed at three levels: (i) input/configuration verification, (ii) simulation execution verification, and (iii) extraction and plausibility benchmarking. These checks ensure that each record is internally consistent with the stated parameter ranges, modelling assumptions, and EnergyPlus reporting conventions used in the Rhino/Grasshopper–EnergyPlus workflow.

3.3.1. Input and Configuration Verification (Pre-Simulation)

All sampled variables were validated against the declared bounds and discrete options. Continuous parameters (geometry dimensions, orientation, façade WWR, glazing properties, and heating/cooling setpoints) were checked to confirm they lie within the specified minimum–maximum ranges, and discrete parameters (wall/roof/slab-on-grade options) were checked to confirm that each case belongs to one of the 12 envelope combinations. This verification aligns with the adopted random/uniform sampling strategy and the stated scope of a hot–arid, code-aligned residential dataset.
In addition, geometry sanity checks were applied to avoid invalid or degenerate models (e.g., non-positive dimensions, malformed façade surfaces, or infeasible window definitions when applying WWR rules). These checks ensure that the parametric model produces EnergyPlus-compatible building definitions before launching batch simulation.

3.3.2. Simulation Execution Verification (During and Post-Run)

The batch simulation process was monitored for completeness and consistency with the declared simulation settings. Each run used the same residential simulation type, annual run period, and timestep resolution, and produced the same set of annual end-use outputs (heating, cooling, lighting, and equipment). Run logs were inspected to identify failed simulations (e.g., severe errors) and to confirm that the weather file, constant operational assumptions (people density, lighting/equipment loads, COP values, infiltration, fresh air), and variable parameters were correctly applied.
Where warnings occurred, the warnings were reviewed to determine whether they were expected (standard EnergyPlus non-fatal warnings) or indicative of input definition problems that could bias results; cases judged to be invalid were excluded or flagged (depending on the chosen dataset-release policy).

3.3.3. Output Extraction Checks and Dataset Integrity (Post-Processing)

The dataset export pipeline was verified to ensure that the correct outputs were extracted and mapped to the correct input records. Output units and column consistency were checked, and missing values were screened. The resulting spreadsheets were tested for: (1) consistent column naming and data types, (2) one-to-one correspondence between input vectors and annual outputs, and (3) absence of duplicated records due to batch iteration logic. This aligns with the study’s stated approach: the dataset is exported in Excel format, and quality checks verify unit consistency and adherence to parameter bounds.

3.3.4. Plausibility Benchmarking Using Expected Physical Trends (Internal Benchmarking)

In addition to integrity checks, plausibility benchmarking was conducted by confirming that the dataset reproduces expected physical trends for hot–arid, cooling-dominated buildings. For example, cooling energy should respond strongly to changes in the cooling setpoint, while glazing solar properties (e.g., SHGC) and geometry should have a meaningful influence on annual energy outcomes. These expectations are consistent with the sensitivity analysis and the physical interpretation presented in this research (e.g., setpoints and geometry dominate variance in this climate context, while some envelope U-value effects are minor within code-compliant ranges).
This plausibility benchmarking does not replace empirical calibration; rather, it confirms that the dataset behaves consistently with the modelling assumptions and that directional effects are reasonable for downstream screening and surrogate modelling within the declared parameter ranges.

3.3.5. External Benchmarking

To provide an external benchmarking reference, an area-normalised Energy Use Intensity (EUI) was derived from the dataset outputs. In the released dataset, pEUI represents the annual predicted site energy total (kWh/year), equal to the sum of the annual end-use energies (heating + cooling + lighting + equipment). Because this value is not stored as an intensity, the conditioned plan area was computed from the sampled footprint geometry as follows:
A i = L e n g t h i × D e p t h i
And the corresponding area-normalised intensity for each case  i  was calculated as follows:
E U I i = p E U I i L e n g t h i × D e p t h i
Across the 12,000 parametric cases, the derived EUI spans 53.7–626.8 kWh/m2·year (median 122.3 kWh/m2·year; P5–P95: 65.5–242.8 kWh/m2·year). Published monitoring and literature evidence for hot–arid residential buildings in GCC/MENA contexts reports typical EUIs between 115 and 270 kWh/m2·year, and a four-year monitored residential case study reported an overall annual EUI of 181.8 kWh/m2·year under comparable climatic conditions [34]. The overlap between the derived EUI distribution and these reported values supports the physical plausibility and external consistency of the simulation outputs. While a small subset of high-EUI cases exceeds typical monitored ranges, these correspond to intentionally included extreme combinations of envelope/WWR/glazing and thermostat settings, retained to broaden the learning space for data-driven modelling.

3.4. Data Significance

The dataset provides a structured set of 12,000 simulation-labelled cases that link explicit building-design inputs to annual energy outcomes under a consistent modelling boundary condition. Each record contains the sampled input variables (e.g., geometry and orientation, façade WWR, envelope and glazing properties, internal loads, and heating/cooling setpoints) paired with annual end-use outputs (heating, cooling, lighting, and equipment energy), exported in an organised spreadsheet format suitable for analysis and reuse.
This resource is valuable for researchers and practitioners who require consistent input/output pairs for data-driven workflows, including surrogate modelling, screening studies, investigations of feature importance/sensitivity, and benchmarking of early-stage parameter effects in a hot–arid residential context. Because the dataset was generated using fixed modelling conventions and clearly defined parameter ranges, it supports reproducible comparisons across configurations and facilitates transparent reporting of assumptions in downstream studies.
The dataset should be interpreted within its stated scope: results are most reliable within the provided parameter ranges and boundary conditions (climate file, residential operation, and fixed baseline assumptions for schedules/internal gains, where applicable). Users are therefore advised to avoid extrapolation beyond the sampled ranges and to re-parameterise operational profiles and system definitions when transferring the workflow to other climates or building typologies.

4. Methodological Contributions

The study’s methodological innovation lies in its systematic coupling of Grasshopper’s parametric capabilities with EnergyPlus’s simulation, enabling the generation of 12,000 distinct design configurations. This large-scale dataset captures the nuanced interplay between architectural parameters (e.g., building geometry, envelope properties, fenestration ratios) and energy outcomes, offering unprecedented granularity for predictive modelling. The workflow’s automation, facilitated by custom Python scripting to overcome EnergyPlus’s inherent simulation limits, exemplifies a scalable solution for high-throughput energy analysis. While the reliance on random sampling (as opposed to Latin Hypercube or Monte Carlo methods) may raise questions about design space coverage, the sheer volume of simulations ensures robust exploration of parameter interactions, a prerequisite for training accurate machine-learning models. The sensitivity analysis further validates this approach, revealing that cooling setpoints and building dimensions dominate the variance in energy consumption, while envelope U-values and orientation exhibit a marginal influence, a finding aligned with prior studies in hot climates [29,32].
  • Workflow Integration: The study bridges the gap between architectural design and energy performance analysis by creating a direct link between Rhino/Grasshopper and EnergyPlus. This feature allows designers to receive immediate performance feedback without switching between separate software platforms, addressing a well-documented barrier to energy-conscious design.
  • Scalable Simulation Framework: Through custom Python scripting and automation, the methodology could overcome computational limitations by executing 12,000 distinct simulations. This comprehensive dataset captures complex interactions between 18 key design parameters, providing unprecedented resolution for understanding how architectural decisions affect energy outcomes in hot climates.
  • Empirical Validation of Parameter Significance: The conducted sensitivity analysis provides quantitative evidence that cooling setpoints (sensitivity index: 0.112) and building dimensions (length: 0.081; depth: 0.085) dominate the variance in energy consumption in Egyptian residential buildings, while envelope properties exert a relatively minor influence. These findings provide a clear challenge to conventional assumptions about the importance of thermal mass in hot climates.

5. Study Discussion

The study advances the field of sustainable building design by introducing a robust parametric framework to generate clear, concise energy-consumption datasets for residential buildings in hot climates. By integrating parametric modelling and advanced energy simulation tools, this research sought to close the gap in design and associated decision-making. The study’s findings illustrate the transformative potential of data-driven approaches to bridge the gap between architectural design and energy performance analysis.

5.1. Sensitivity Analysis

A sensitivity analysis was conducted using the simulation results to evaluate the predictive capability and robustness of the proposed workflow. This analysis aimed to identify the most influential parameters affecting the energy consumption of residential buildings in Egypt, thereby quantifying the relative impact of each input variable on the model’s output. The sensitivity analysis was structured as follows. For each input parameter, values were systematically varied within a range of (±1) standard deviation from the mean. All other variables were held constant at their respective means. The output energy consumption was then computed across multiple incremental steps above and below the mean. This procedure was iterated for every input to isolate its effect.
The results, as illustrated in Figure 6, revealed that energy consumption exhibited the highest sensitivity to the cooling set point (0.112). This underscores its critical role in energy demand. Building dimensions, specifically length (0.081), depth (0.085), and height (0.051), also demonstrated substantial influence. This was followed by the solar heat gain coefficient (SHGC) of glazing (0.054). In contrast, parameters such as wall type, roof type, and slab-on-grade construction (expressed as U-values) exhibited comparatively negligible effects. Window-to-wall ratio, building orientation, glazing U-value, visible transmittance (VT), and heating set point also exhibited comparatively negligible effects. Given that the residential sector accounts for a significant proportion of total energy consumption, these findings highlight the potential for meaningful energy savings. These savings can be achieved through targeted adjustments to cooling set points, glazing properties, and building geometry. This insight is particularly valuable for policymakers and designers seeking to optimise energy efficiency in Egypt’s built environment.

5.2. Physical Mechanisms and Design Implications of the Results

The sensitivity results reflect the dominant physical drivers of residential energy demand in hot–arid climates. Cooling energy is strongly affected by cooling setpoints because small changes in the indoor temperature target shift the cooling load required to maintain comfort over long cooling seasons. Similarly, building dimensions (length, width, and height) influence energy use by altering conditioned volume and envelope area, thereby affecting conductive heat transfer and the magnitude of solar-exposed surfaces. Fenestration-related parameters (e.g., WWR and glazing solar properties such as SHGC) influence cooling demand primarily by modifying solar heat gains; increasing WWR or SHGC increases transmitted solar radiation and can raise cooling loads.
Mechanistically, these rankings are consistent with a simplified zone heat-balance formulation. For annual cooling energy, a conceptual representation can be written schematically as follows:
E c o o l 1 C O P c     Q s o l a r +   Q i n t + U A   T o u t T s e t + m c P ( T o u t Q s e t ) + d t
where  Q s o l a r S H G C · A w i n · I s o l a r U A  scales with envelope/glazing conductance and exposed area,  m ˙  captures ventilation/infiltration exchange, and  T s e t  is controlled by thermostat setpoints. In hot–arid conditions,  T s e t  directly shifts effective cooling degree-hours, while WWR and SHGC amplify solar gains through increased window area and transmittance. Geometry (length/depth/height) affects multiple terms simultaneously by scaling area and volume, which explains its consistently strong influence.
In contrast, envelope U-values may show a more negligible relative influence within the tested ranges because cooling loads are often dominated by solar gains and setpoint-driven operation rather than conduction alone, particularly when construction options remain within code-compliant limits. From an early design perspective, these results suggest prioritising concept-stage decisions that reduce cooling loads, including selecting realistic comfort setpoints, managing massing and façade exposure, and optimising glazing/WWR to limit solar gains while maintaining daylighting needs.
To quantify the estimation of the dataset-derived marginal effects using a multivariate linear interpretability fit across the full 12,000 simulations (Table 10). The fit is used to report effect sizes (not as a proposed surrogate model). The resulting estimates confirm the mechanism-based interpretation: Cooling_SP has the largest magnitude effect, while SHGC and WWR produce strong positive energy penalties consistent with solar-gain dominance, and geometry produces large increases because it scales exposed area and conditioned volume. Interaction effects also appear as expected: when the mean WWR is high (top quartile), the mean annual label increases from 43,047 to 67,202 kWh/year when SHGC shifts from the low to high quartile (Δ ≈ 24,154 kWh/year), whereas under low-WWR designs, the corresponding increase is ≈12,454 kWh/year. This supports the practical implication that SHGC control becomes increasingly critical as glazing fraction increases. Visible transmittance (VT) shows negligible association with the annual energy label in this dataset, as expected, because lighting is represented by a fixed lighting power density (i.e., no daylight-responsive lighting control). Therefore, VT does not translate into lighting-energy reduction.
The strong influence of cooling setpoints is consistent with occupant-behaviour evidence that thermostat preferences and user adjustments are among the strongest drivers of HVAC energy use. Accordingly, the dataset treats setpoints as a key early-stage operational parameter. An external EUI plausibility check against published hot–arid residential benchmarks is included in this study; more detailed calibration/benchmarking against measured Egyptian residential consumption remains future work to refine operational assumptions and quantify agreement with real-world behaviour.

5.3. Data Quality, Bias, and Uncertainty Considerations

The dataset was generated under controlled boundary conditions to ensure consistency across simulations. Quality checks were applied to verify unit consistency. These checks also confirmed that all sampled variables fall within the defined parameter bounds. Nevertheless, the dataset reflects modelling assumptions that may introduce bias and uncertainty when used for machine learning. Uniform random sampling can over-represent unlikely combinations relative to real practice. Fixed operational inputs, such as internal loads and schedules, also limit behavioural variability. In addition, the single-climate (New Cairo, Egypt) and residential-only scope constrain transferability. Therefore, machine-learning models trained on this dataset should be interpreted as predictors within the specified ranges and assumptions. They should not be treated as universal estimators. Robustness could be strengthened by incorporating probability-based sampling, stochastic occupant behaviour, additional climates, and empirical calibration.

5.4. Sampling Quality and Interaction-Bias Diagnostics

Given that the dataset is intended for downstream data-driven modelling, the sampling design was evaluated to determine (i) marginal balance across each input range, (ii) the extent of unintended dependence among inputs intended to be independently sampled, and (iii) global space coverage.

5.4.1. Marginal Balance

For continuous variables (geometry, orientation, façade WWRs, and glazing properties), 10-bin uniformity was quantified using two complementary indicators:  Δ m a x  (maximum absolute deviation from equal-frequency bins) and the coefficient of variation ( C V ) of bin counts. Table 11 indicates low imbalance for most variables (e.g., geometry and orientation  Δ m a x  ≈ 4.83–5.50% with  C V  ≈ 0.027–0.031; WWR variables  Δ m a x  ≈ 3.75–4.92% with  C V  ≈ 0.020–0.027). Glazing properties exhibit comparatively higher—yet still bounded—deviation (U-value  Δ m a x  = 11.75%,  C V  = 0.072; SHGC  Δ m a x  = 10.42%, CV = 0.055; VT  Δ m a x  = 11.83%,  C V  = 0.047), which is consistent with stochastic clustering effects that can arise in random sampling and does not, by itself, indicate systematic bias. For discrete thermostat setpoints, level-balance checks (cooling: 11 levels; heating: 11 levels) show modest deviation (cooling  Δ m a x  = 5.6%,  C V  = 0.031; heating  Δ m a x  = 3.42%,  C V  = 0.024).

5.4.2. Interaction-Bias Screening

Potential interaction bias introduced by the workflow was assessed via pairwise Spearman rank correlations across input variables. The maximum absolute Spearman correlation per input (Table 11) remains low (max | ρ | ≤ 0.027 across the listed variables), indicating that the generation pipeline does not introduce artificial coupling among the specified independent variables. The full pairwise dependence structure is visualised in Figure 7 as a Spearman correlation heatmap, enabling transparent inspection of correlation patterns.
Global space coverage and stratification. Beyond marginal and pairwise checks, global space coverage was characterised using the normalised nearest-neighbour distance in the continuous input space (median = 0.558; IQR = 0.088), supporting that samples are broadly distributed rather than concentrated in a small region of the design space. In addition, categorical envelope configurations were verified to be strictly stratified across the 12 Wall × Roof × S.O.G combinations (min = max = 1000 cases per level), ensuring that discrete construction options are not imbalanced. Collectively, these diagnostics address the risk of uneven sample distribution and unintended parameter dependence, supporting the dataset’s suitability for machine-learning applications within the defined bounds and modelling assumptions.

5.5. Theoretical and Practical Implications

Theoretically, this work challenges conventional energy modelling paradigms. It demonstrates how parametric workflows can democratise access to performance analytics during the conceptual design phase. By embedding simulation within the architect’s native Rhino/Grasshopper environment, the framework circumvents interoperability barriers. These barriers often relegate energy analysis to post-design validation. Practically, the dataset serves as a foundational resource for multiple stakeholders:
  • Architects, the framework enables real-time energy evaluation of design alternatives. It is particularly useful for messing, orientation, and fenestration decisions. The finding that the window-to-wall ratio affects cooling loads more significantly than wall insulation may shift design priorities.
  • Building code developers can use the sensitivity results to prioritise energy-efficiency measures. The strong influence of cooling setpoints suggests potential energy savings. These savings may be achieved through smart thermostat regulations or passive cooling strategies.
  • Machine-learning researchers: the dataset provides curated training data for surrogate models. This supports energy prediction in understudied regions where data scarcity is common.

5.6. Comparison with Optimisation-Driven and Conventional Simulation Workflows

This study aims to produce a dataset rather than optimise. Therefore, the workflow is proposed to generate a large, diverse set of building design configurations, along with their corresponding EnergyPlus simulation outputs, allowing downstream surrogate and machine-learning model training. To achieve broad coverage of the design space and reduce sampling bias across the declared parameter ranges, uniform random sampling was used for both continuous and discrete design variables. To verify that this stochastic design did not result in uneven coverage or unintended parameter dependence, marginal-balance, correlation-screen, and global space-coverage diagnostics are reported for the final 12,000-case dataset. These checks support the suitability of the generated dataset for subsequent data-driven applications within the specified bounds and modelling assumptions.
In contrast, optimisation approaches such as NSGA-II or BO-XGBoost-assisted frameworks typically evaluate candidate solutions iteratively and selectively, guided by explicit objective functions (e.g., energy use, thermal comfort, or daylight availability). Their primary aim is to efficiently identify Pareto-optimal solutions, rather than to construct a general-purpose dataset with wide variability. Similarly, conventional energy simulation studies often analyse a limited number of representative cases for performance comparison, rather than applying systematic large-scale sampling to construct datasets. Table 12 presents a conceptual comparison of the dataset-generation workflow, conventional simulation method, and optimisation-based approaches.

5.7. Extension to Other Building Typologies

Although the current dataset targets residential buildings, the proposed parametric pipeline is not inherently residential-specific. For commercial and institutional buildings, adaptation mainly involves redefining typology-driven operational assumptions (schedules, internal gains, ventilation, zoning logic, and HVAC configuration), which are typically more diverse and schedule-intensive than those for residential use. Therefore, the exact sampling and simulation procedure can be retained while updating these inputs to generate typology-specific datasets.

5.8. Comparison with Existing Studies

The ranking observed in the sensitivity analysis is broadly consistent with prior evidence from hot-climate building-energy studies, where operational controls and solar gains tend to dominate cooling-dominated demand. In particular, the strong influence of cooling setpoints and the comparatively lower influence of envelope U-values within code-compliant ranges aligns with previous findings reported for hot climates [29,32].
The study’s results also reinforce the commonly reported importance of fenestration-related variables (WWR and glazing solar properties such as SHGC) in climates where incident solar radiation is a major driver of cooling loads. At the same time, this study stands out by quantifying these relationships using a large, systematically generated dataset (12,000 configurations) under a consistent boundary condition. In contrast, many earlier studies rely on case-based comparisons or optimisation-focused sampling that may not represent the broader early-design parameter space.
Open-access building-energy datasets differ substantially in measurement basis, variable definitions, and intended reuse. The present dataset is simulation-derived and climate-specific for the Cairo/New Cairo hot–arid boundary condition. It comprises 12,000 labelled residential cases generated through a Rhino/Grasshopper–EnergyPlus workflow, with explicit early-design inputs (geometry dimensions, orientation, façade WWR, glazing properties, and heating/cooling setpoints) and consistent annual end-use outputs (heating, cooling, lighting, and equipment).
In contrast, metered portfolio datasets such as Building Data Genome Project 2 (BDG2) provide hourly metre time series across many buildings and metre types, supporting prediction/anomaly tasks but typically lacking explicit parametric geometry/envelope inputs required for controlled early-stage design-space studies. Similarly, the ASHRAE Great Energy Predictor III dataset focuses on modelling metered building energy usage across multiple metre types and is distributed through Kaggle under competition rules and associated metadata/weather files. Physics-based stock-simulation datasets such as ResStock and ComStock (End-Use Load Profiles for the U.S. Building Stock) provide calibrated, validated 15-min end-use load profiles for U.S. building types and climate regions. However, they are structured to represent a national stock rather than a compact, conceptually designed parametric sweep under a single controlled boundary condition. Table 13 summarises these dataset-level differences in coverage, variable definitions, temporal resolution, and reuse considerations.

5.9. Modelling Assumptions, Simplifications, and Expected Systematic Effects

This study proposes a framework for generating building energy-consumption datasets through parametric simulation. The dataset was intentionally produced under a consistent set of modelling assumptions (scope-defining boundary conditions) to ensure reproducibility and to isolate the influence of early-stage design variables. These assumptions include: (i) the use of a single representative EPW weather file for the Cairo/New Cairo context; (ii) Egyptian thermal insulation standards and material specifications; (iii) a residential building typology with residential operational schedules; and (iv) fixed baseline values for internal loads and HVAC-related coefficients applied consistently across all runs. Under these controlled boundary conditions, the parametric variables (geometry, envelope and glazing properties, WWR, and setpoints) are systematically varied to generate a structured dataset of input–output pairs suitable for downstream predictive modelling and sensitivity analysis.
The dataset is intentionally generated under controlled boundary conditions to isolate the influence of early-stage design variables and ensure reproducibility across 12,000 simulations. These boundary conditions include a single Cairo/New Cairo EPW weather file, a residential operational profile, and fixed baseline values for internal gains and HVAC-related coefficients across all runs. Table 7 and Table 8 summarise the constant simulation settings used throughout the batch execution, including people density, lighting, equipment power density, COP values, infiltration, and fresh-air rate. In the current dataset file (N = 12,000), Cooling_SP ranges from 18 to 28 °C, and Heating_SP ranges from 8 to 12 °C; if the released version is updated to Heating_SP = 8–12 °C.
Because EnergyPlus outputs reflect both physics and user-specified operational/system assumptions, several modelling choices can systematically shift absolute annual energy values. First, HVAC efficiency proxies (cooling/heating COP) directly scale reported heating and cooling energy consumption: for the same thermal loads, a higher COP reduces delivered energy, and a lower COP increases it. Second, fixed internal gains (occupancy, lighting, and equipment loads) systematically raise or lower cooling demand in a hot–arid climate by changing the internal heat balance; maintaining these at constant values across the dataset ensures label consistency but limits behavioural realism. Third, infiltration and ventilation assumptions influence sensible loads by altering outdoor air exchange; fixing infiltration and fresh-air values improves comparability across parametric design cases but may under- or over-estimate loads for buildings with different airtightness or ventilation strategies. Fourth, the geometry is parameterised at the thermal-zone scale, which supports conceptual massing exploration but abstracts from multi-zone distribution effects arising from internal partitions, room-level orientation, or zoning strategies. Finally, outputs are reported as annual end-use totals, which supports early-stage screening and ML surrogate modelling but does not capture intra-day or seasonal load-shape dynamics.
To make these implications explicit for downstream users, Table 14 summarises each key assumption/simplification, the rationale for adopting it in a conceptual-design dataset, and the expected directional influence on the outputs. Unlike a generic “assumptions list,” Table 14 also reports the observed ranges and the category balance extracted from the released dataset, enabling readers to interpret the ML labels directly. Users training machine-learning models on this dataset should interpret predictions as valid within the stated boundary conditions and parameter ranges, rather than as universal estimators across different operational profiles, HVAC efficiencies, zoning strategies, or climates.

6. Study Limitations and Future Recommendations

6.1. Study Limitations

Despite the controlled generation process, several limitations affecting transferability should be acknowledged. First, the dataset reflects a single geographical and regulatory context (Cairo climate data and Egyptian standards), which may limit direct applicability in regions with different climates, construction practices, and codes. Moreover, Cairo’s hot–arid conditions may not reflect hot–humid climates (e.g., the coastal Middle East or tropical regions), where latent loads and moisture-related cooling demands can dominate, potentially altering the relative importance of certain design variables.
Second, the current dataset focuses on one building typology (residential). Applying the workflow to commercial, institutional, or mixed-use buildings may be non-trivial because typology-dependent assumptions such as occupancy density and schedules, lighting and equipment power densities, ventilation requirements, zoning strategy, and HVAC system type/control strongly influence load profiles and energy-use patterns.
Third, verification checks and plausibility benchmarking are provided to confirm internal consistency of the Grasshopper–EnergyPlus workflow and outputs. In addition, a minimal external plausibility check is included by comparing the derived area-normalised EUI values against published and monitored hot–arid residential benchmarks, showing that the simulated outputs fall within the reported ranges. However, full empirical calibration against measured Egyptian residential energy consumption is not conducted. More detailed empirical benchmarking using Egyptian utility-bill and/or monitoring datasets remains necessary to quantify agreement with observed operation and to refine behaviour-dependent operational assumptions (e.g., schedules, plug loads, and thermostat behaviour).
Fourth, accordingly, future measured data studies should prioritise calibration of operational parameters (setpoints, schedules, and internal gains), because these behaviour-driven inputs can dominate cooling energy demand and strongly influence absolute annual consumption values.
Fifth, although marginal and pairwise diagnostics indicate limited unintended dependence among sampled inputs, uniform random sampling is not probability-based and may over-represent implausible parameter combinations relative to real housing stock; therefore, models trained on the database should be interpreted within the stated bounds rather than as population-level predictors.
Finally, generating 12,000 simulations despite automation may impose computational demands that could challenge practitioners with limited computing resources, thereby reducing accessibility for small firms or institutions.

6.2. Future Recommendations

Several extensions would strengthen generalizability and practical applicability. Expanding the workflow to additional climates by repeating the same parametric process using multiple EPW files (e.g., hot–humid and cold/temperate climates) would enable cross-climate evaluation and broader relevance. Including additional Egyptian climate zones (e.g., coastal Alexandria and desert Aswan) and comparable international hot–arid/hot–humid contexts (e.g., Riyadh, Marrakech) would further test robustness.
The same dataset-generation workflow can be replicated for additional building typologies (e.g., office/administrative and mixed-use) by implementing typology-specific schedules, internal loads, zoning templates, ventilation requirements, and HVAC system definitions. Benchmarking and calibration using measured consumption records (utility bills and/or sensor-based monitoring) would bridge the gap between synthetic results and real-world operation and improve confidence in applicability.
To enhance realism and better align sample frequencies with practice (rather than purely uniform coverage), future studies may incorporate improved sampling strategies such as Latin Hypercube or stratified/adaptive sampling, as well as probability-based parameter distributions (e.g., realistic distributions for thermostat setpoints) rather than uniform randomisation.
Introducing stochastic occupant-behaviour models and plug-load variability could generate more realistic operational profiles, and integrating future weather projections would support climate-resilient design exploration. Finally, cloud-based execution or shared computing platforms could improve accessibility and facilitate broader adoption of large-scale parametric energy analysis.

7. Conclusions

This study makes three primary academic contributions to early-stage research on residential building energy in hot–arid Egypt. First, it formalises a reproducible parametric-to-simulation translation approach that connects conceptual design variables (massing, fenestration, glazing, envelope options, and thermostat setpoints) to standardised annual end-use energy outputs, supported by explicit verification and dataset integrity checks that improve scientific comparability and reuse. Second, it provides an open, labelled input–output dataset for the Cairo/New Cairo boundary condition (released via Zenodo), structured for design-space exploration, interpretability studies, and surrogate/ML model development, while clearly documenting the modelling scope and boundary conditions required for responsible downstream use.
Third and most importantly from a knowledge standpoint, it offers evidence-based prioritisation of early-stage variables in a cooling-dominated, hot–arid context: sensitivity results show that cooling setpoint and building dimensions are the dominant drivers of annual energy variance, followed by glazing solar gains (SHGC), whereas several envelope-category choices and orientation effects are comparatively minor within the tested, code-aligned ranges. These findings have direct implications for both design practice and research. For conceptual-stage decision-making in hot–arid housing, the results indicate that the largest, most reliable performance leverage comes from realistic operational targets (cooling setpoints) and massing decisions that scale exposed area and conditioned volume, complemented by solar-gain control through glazing selection and WWR management. For data-driven workflows, the dataset’s consistent schema enables transparent benchmarking of surrogate models and feature-importance methods, but it should be interpreted strictly within the stated parameter ranges and modelling assumptions.
Finally, the work’s scope also defines its limits: results reflect controlled boundary conditions (single climate file, residential operation, fixed internal gains, and system proxies). Therefore, absolute energy magnitudes and learned ML relationships should not be generalised without re-parameterisation and validation. Future extensions that would strengthen external validity include empirical benchmarking against measured Egyptian residential consumption, introducing stochastic occupant/plug-load variability, and regenerating comparable datasets for additional climates and building typologies using the same documented pipeline.

Author Contributions

Methodology, H.W. and M.T.E.; software, M.T.E.; validation, H.W. and M.E.; formal analysis, H.W. and M.E.; investigation, H.W., E.E., M.T.E., and M.E.; writing—original draft preparation, H.W. and M.T.E.; writing—review and editing, E.E. and M.E.; visualisation, H.W. and M.T.E.; supervision, E.E. and M.E.; project administration, M.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available at Zenodo under https://doi.org/10.5281/zenodo.13622940 (accessed on 22 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Awan, A.; Kocoglu, M.; Subhan, M.; Utepkaliyeva, K.; Hossain, M.E. Assessing energy efficiency in the built environment: A quantile regression analysis of CO2 emissions from buildings and manufacturing sector. Energy Build. 2025, 338, 115733. [Google Scholar] [CrossRef]
  2. Kaloop, M.R.; Ahmad, F.; Samui, P.; Elbeltagi, E.; Hu, J.W.; Wefki, H. Predicting energy consumption of residential buildings using metaheuristic-optimised artificial neural network technique in early design stage. Build. Environ. 2025, 274, 112749. [Google Scholar] [CrossRef]
  3. Elbeltagi, E.; Wefki, H. Predicting energy consumption for residential buildings using ANN through parametric modelling. Energy Rep. 2021, 7, 2534–2545. [Google Scholar] [CrossRef]
  4. Pena, M.L.C.; Carballal, A.; Rodríguez-Fernández, N.; Santos, I.; Romero, J. Artificial intelligence applied to conceptual design. A review of its use in architecture. Autom. Constr. 2021, 124, 103550. [Google Scholar] [CrossRef]
  5. Naboni, R.; Paoletti, I. Advanced Customization in Architectural Design and Construction; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar] [CrossRef]
  6. Anton, I.; Tănase, D. Informed geometries. Parametric modelling and energy analysis in early stages of design. Energy Procedia 2016, 85, 9–16. [Google Scholar] [CrossRef]
  7. Albaik, M.; Muhsen, R. Optimising Building Performance: A Grasshopper Modelling Case Study of the King Hussein Mosque. IEEE Access 2025, 13, 47244–47259. [Google Scholar] [CrossRef]
  8. Sarkar, D. Application of Grasshopper Optimisation Algorithm for Design and Development of Net Zero Energy Residential Building in Ahmedabad, India. In Proceedings of the 2024 International Conference on Sustainable Energy: Energy Transition and Net-Zero Climate Future (ICUE), Ahmedabad, India, 21–23 October 2024; pp. 1–7. [Google Scholar] [CrossRef]
  9. Bao, X.; Zhang, J. Multi-objective decision optimisation design for building energy-saving retrofitting design based on improved grasshopper optimisation algorithm. Int. J. Renew. Energy Dev. 2024, 13, 1058–1067. [Google Scholar] [CrossRef]
  10. de Sousa Freitas, J.; Cronemberger, J.; Soares, R.M.; Amorim, C.N.D. Modelling and assessing BIPV envelopes using parametric Rhinoceros plugins, Grasshopper and Ladybug. Renew. Energy 2020, 160, 1468–1479. [Google Scholar] [CrossRef]
  11. Elbeltagi, E.; Wefki, H.; Abdrabou, S.; Dawood, M.; Ramzy, A. Visualised strategy for predicting buildings’ energy consumption during early design stage using parametric analysis. J. Build. Eng. 2017, 13, 127–136. [Google Scholar] [CrossRef]
  12. Peng, J.; Yang, Y.; Fu, X.; Hou, Y.; Ding, Y. Grasshopper platform-assisted design optimization of fujian rural earthen buildings considering low-carbon emissions reduction. Sci. Rep. 2024, 14, 18229. [Google Scholar] [CrossRef]
  13. Gavaldà-Torrellas, O.; Monsalvete, P.; Ranjbar, S.; Eicker, U. The Urban Building Energy Retrofitting Tool: An Open-Source Framework to Help Foster Building Retrofitting Using a Life Cycle Costing Perspective—First Results for Montréal. Smart Cities 2025, 8, 17. [Google Scholar] [CrossRef]
  14. Gaterell, M.R.; McEvoy, M.E. The impact of climate change uncertainties on the performance of energy efficiency measures applied to dwellings. Energy Build. 2005, 37, 982–995. [Google Scholar] [CrossRef]
  15. Khan, H. Microclimatic architectural design by interfacing Grasshopper and Dynamo with Rhino and Revit. Meas. Sens. 2024, 33, 101143. [Google Scholar] [CrossRef]
  16. Sadeghipour Roudsari, M.; Pak, M.; Viola, A. Ladybug: A parametric environmental plugin for Grasshopper to help designers create an environmentally-conscious design. In Proceedings of the Building Simulation 2013: 13th Conference of IBPSA, Chambery, France, 25–28 August 2013; pp. 3128–3135. [Google Scholar] [CrossRef]
  17. Ramirez, J.P.D.; Nagarsheth, S.H.; Ramirez, C.E.D.; Henao, N.; Agbossou, K. Synthetic dataset generation of energy consumption for a residential apartment building in cold weather, considering the building’s ageing. Data Brief 2024, 54, 110445. [Google Scholar] [CrossRef] [PubMed]
  18. Wu, C.; Pan, H.; Luo, Z.; Liu, C.; Huang, H. Multi-objective optimization of residential building energy consumption, daylighting, and thermal comfort based on BO-XGBoost-NSGA-II. Build. Environ. 2024, 254, 111386. [Google Scholar] [CrossRef]
  19. Waqas, H.; Shang, J.; Munir, I.; Ullah, S.; Khan, R.; Tayyab, M.; Mousa, B.G.; Williams, S. Enhancement of the energy performance of an existing building using a parametric approach. J. Energy Eng. 2023, 149, 04022057. [Google Scholar] [CrossRef]
  20. Alammar, A.; Jabi, W. Generation of a Large Synthetic Database of Office Tower’s Energy Demand Using Simulation and Machine Learning. In Proceedings of the International Symposium on Formal Methods in Architecture, Singapore, 25–27 May 2022; Springer Nature: Singapore, 2022; pp. 479–500. [Google Scholar] [CrossRef]
  21. Peronato, G.; Kämpf, J.H.; Rey, E.; Andersen, M. Integrating urban energy simulation in a parametric environment: A Grasshopper interface for CitySim. In Proceedings of the PLEA 2017: 33rd PLEA International Conference on Passive and Low Energy Architecture, Edinburgh, UK, 2–5 July 2017; Available online: https://arodes.hes-so.ch/record/7711?v=pdf (accessed on 11 May 2025).
  22. Wang, X.; Teigland, R.; Hollberg, A. Identifying influential architectural design variables for early-stage building sustainability optimisation. Build. Environ. 2024, 252, 111295. [Google Scholar] [CrossRef]
  23. Olu-Ajayi, R.; Alaka, H.; Sulaimon, I.; Sunmola, F.; Ajayi, S. Building energy consumption prediction for residential buildings using deep learning and other machine learning techniques. J. Build. Eng. 2022, 45, 103406. [Google Scholar] [CrossRef]
  24. Mendes, V.F.; Cruz, A.S.; Gomes, A.P.; Mendes, J.C. A systematic review of methods for evaluating the thermal performance of buildings through energy simulations. Renew. Sustain. Energy Rev. 2024, 189, 113875. [Google Scholar] [CrossRef]
  25. Cavieres, A.; Gentry, R.; Al-Haddad, T. Knowledge-based parametric tools for concrete masonry walls: Conceptual design and preliminary structural analysis. Autom. Constr. 2011, 20, 716–728. [Google Scholar] [CrossRef]
  26. Lee, K.S.; Han, K.J.; Lee, J.W. Feasibility study on parametric optimization of daylighting in building shading design. Sustainability 2016, 8, 1220. [Google Scholar] [CrossRef]
  27. Samuelson, H.; Claussnitzer, S.; Goyal, A.; Chen, Y.; Romo-Castillo, A. Parametric energy simulation in early design: High-rise residential buildings in urban contexts. Build. Environ. 2016, 101, 19–31. [Google Scholar] [CrossRef]
  28. Wefki, H.; Elbeltagi, E.; Abdrabou, S.; Dawood, M.; Ramzy, A. Conceptual Design for Sustainable Buildings Considering Energy Consumption Using Simulation and ANN. Ph.D. Thesis, Mansoura University, Mansoura, Egypt, 2017. [Google Scholar]
  29. Attia, S.; Wanas, O. The Database of Egyptian Building Envelopes (DEBE): A Database for Building Energy Simulations. In Proceedings of the SimBuild Conference 2012: 5th Conference of IBPSA-USA, Madison, WI, USA, 1–3 August 2012; pp. 96–103. [Google Scholar]
  30. ESTIW. The Egyptian Specifications for Thermal Insulation Work Items; No. 176/1998; Ministry of Housing: Cairo, Egypt, 2017.
  31. Attia, S.; Gratia, E.; De Herde, A.; Hensen, J.L. Simulation-based decision support tool for early stages of zero-energy building design. Energy Build. 2012, 49, 2–15. [Google Scholar] [CrossRef]
  32. Ihm, P.; Krarti, M. Design optimization of energy efficient residential buildings in Tunisia. Build. Environ. 2012, 58, 81–90. [Google Scholar] [CrossRef]
  33. Assad, M.N. Towards Promoting Sustainable Construction in Egypt: A Life-Cycle Cost Approach. Master’s Thesis, The American University in Cairo, Cairo, Egypt, 2021. Available online: https://fount.aucegypt.edu/retro_etds/2445/ (accessed on 3 April 2025).
  34. Alajmi, A.F. Quantifying energy use intensity and peak demand in a hot-arid residential building: Insights from four years of high-resolution monitoring. Energy Rep. 2025, 14, 2204–2216. [Google Scholar] [CrossRef]
Figure 1. High-level workflow for generating the energy-consumption dataset, with automation and dataset structuring introduced in this study [16].
Figure 1. High-level workflow for generating the energy-consumption dataset, with automation and dataset structuring introduced in this study [16].
Buildings 16 00976 g001
Figure 2. Simplified Rhino/Grasshopper workflow for generating the EnergyPlus-labelled energy consumption dataset.
Figure 2. Simplified Rhino/Grasshopper workflow for generating the EnergyPlus-labelled energy consumption dataset.
Buildings 16 00976 g002
Figure 3. Gerilla “Material Maker” component in Grasshopper for defining EnergyPlus material properties.
Figure 3. Gerilla “Material Maker” component in Grasshopper for defining EnergyPlus material properties.
Buildings 16 00976 g003
Figure 4. Construction-assembly definition in Grasshopper using the Guerilla plugin.
Figure 4. Construction-assembly definition in Grasshopper using the Guerilla plugin.
Buildings 16 00976 g004
Figure 5. Automated EnergyPlus simulation workflow in Grasshopper using the DIVA plugin.
Figure 5. Automated EnergyPlus simulation workflow in Grasshopper using the DIVA plugin.
Buildings 16 00976 g005
Figure 6. The comparative influence of individual parameters.
Figure 6. The comparative influence of individual parameters.
Buildings 16 00976 g006
Figure 7. Spearman correlation heatmap of sampled inputs showing minimal correlations.
Figure 7. Spearman correlation heatmap of sampled inputs showing minimal correlations.
Buildings 16 00976 g007
Table 1. Dataset Specifications.
Table 1. Dataset Specifications.
SubjectBuilding Performance Analysis and Energy Engineering
Specific subject areaEnergy consumption and efficiency in residential buildings
Data typeSynthetic dataset stored in .xlsx files. The dataset is simulation-derived to support early-stage design exploration, where measured consumption data and complete building metadata are typically unavailable.
How the data were acquiredThe Rhino/Grasshopper was used to generate 12,000 simulations, which were stored in XLSX files. Different design parameters, such as building orientation, dimensions, materials used, and climate conditions. Data were collected in New Cairo City, Egypt (30.0363° N, 31.4758° E) and are available in the Zenodo repository.
Data formatRaw
Experimental factorsThe simulations included diverse scenarios for building orientation, dimensions (width, depth, and height), material properties, and climatic conditions (indoor and outdoor).
Data source locationNew Cairo City, Cairo, Egypt
Geographical Coordinates—30.0363° N, 31.4758° E
Dataset accessRepository name: Zenodo
Data identification number: 10.5281/zenodo.13622940
Direct URL to data: https://doi.org/10.5281/zenodo.13622940 (accessed on 22 February 2026).
Instructions for accessing these data: none
Value of dataUseful in analysing the effect of different design factors on residential building energy use. Beneficial for designers, architects, engineers, and researchers in the development of energy optimisation. Support the creation of energy optimisation and performance assessment models.
It can be used for training deep-learning models and predicting future energy consumption patterns.
Table 2. Parameter description.
Table 2. Parameter description.
ParameterDescription
Wall Type Different wall types
Roof Type Different roof types
Slab-on-Grade (S.O.G) Type Different S.O.G types
Building Length Building dimensions, different lengths (m)
Building Width Building dimensions, different widths (m)
Building Height Building dimensions, different heights (m)
Building Orientation Building orientations from the North in degrees
South Window-to-Wall Ratio (WWR)Window-to-Wall ratio in (%) for South façade
East Window-to-Wall Ratio (WWR)Window-to-Wall ratio in (%) for East façade
North Window-to-Wall Ratio (WWR)Window-to-Wall ratio in (%) for North façade
West Window-to-Wall Ratio (WWR)Window-to-Wall ratio in (%) for West façade
Glass U-value 1Thermal conductance W/(m2·K)
Glass SHGC 2Solar Heat Gain Coefficient (SHGC)
Glass VT 3Visible transmittance (VT) in (%)
Heating Setpoint TemperatureIndoor heating comfort temperature (°C)
Cooling Setpoint TemperatureIndoor cooling comfort temperature (°C)
1 Thermal conductance (W/(m2·K)) is a measure of how easily heat flows through a layer or assembly per unit area for each 1 Kelvin (or 1 °C) temperature difference across it. 2 Glass SHGC (Solar Heat Gain Coefficient) is a dimensionless number (0 to 1) that indicates how much of the sun’s heat passes through a glazing system into the building. 3 Glass VT (Visible Transmittance) is a dimensionless value (0 to 1) that describes the fraction of visible light that passes through a glazing system.
Table 3. Classification of Model Variables and Data Inputs [28].
Table 3. Classification of Model Variables and Data Inputs [28].
Model ParameterInput Information
Weather DataLocation, Latitude, and Longitude, and Temperatures
Building GeometryBuilding shape, Building orientation, Principal building function, Total floor area, and Floor-to-floor height.
EnvelopeWindow-to-wall ratio, Glass (SHGC, U-value, VT), Wall, Roof, Slab on Grade, Thermal zoning, and Infiltration assumptions.
Internal LoadsAnticipated building occupancy, Lighting power density, and Plug-load density.
HVAC EquipmentSystems type (heating and cooling), distribution type, capacity, efficiency, and schedules of operation and control.
Table 4. The characteristics of materials [30].
Table 4. The characteristics of materials [30].
ItemConductivity [W/m·K]Density [kg/m3]Specific Heat
[J/kg °C]
Red Brick 0.601790.00840.00
Cement Mortar1.001570.00896.00
Plaster0.16600.001000.00
Reinforced Concrete1.442460.001000.00
Cement Tiles1.502100.001000.00
Sand0.331520.00800.00
Bitumen Damp Insulation0.151055.001000.00
Thermal properties are reported as conductivity (W/m·K), density (kg/m3), and specific heat (J/kg °C) and are adopted from the Egyptian Specifications for Thermal Insulation Work Item. The air gap is modelled as a massless layer with thermal resistance of 0.15 m2·K/W.
Table 5. Continuous parameter values.
Table 5. Continuous parameter values.
ParameterPossibilityParameter Value
Min.Max.
Building dimensionLength10 m30 m
Depth10 m30 m
Height3 m15 m
Building orientation 360°
Windows-to-wall ratioNorth0%80%
South0%80%
East0%80%
West0%80%
Glazing typeU-value01.2
SHGC01
VT01
Temperature set point Cooling18 °C28 °C
Heating8 °C12 °C
Table 6. Discrete parameter values.
Table 6. Discrete parameter values.
ParameterPossibilityParameter Value
Building envelopeWallType 1
Type 2
Type 3
RoofType 1
Type 2
SOGType 1
Type 2
Lighting Load 1 7.3 W/m2
Equipment Load 2 7.0 W/m2
1 Lighting load is the amount of electrical power used for lighting in a space, typically expressed as a power density. 2 Equipment load is electrical power associated with plug-in and installed equipment in a space (e.g., computers, appliances, office devices), usually expressed as power density.
Table 7. Thermal simulation settings.
Table 7. Thermal simulation settings.
SettingAttribute
Weather FileEGY_Cairo.Intl.Airport.623660_ETMY.epw
Simulation typeResidential
Run PeriodAnnual
Time Steps per Hour6
Outputs- Heating Energy Consumption (Annual)
- Cooling Energy Consumption (Annual)
- Lights Energy (Annual)
- Equipment Energy (Annual)
Table 8. Constant parameters for thermal simulation analysis.
Table 8. Constant parameters for thermal simulation analysis.
SettingAttribute
Number of People (people/m2)0.033 people/m2 [31]
Lighting Load (W/m2)7.3 W/m2 [31]
Equipment Load (W/m2)7.0 W/m2 [31]
Cooling COP 3.0 [32]
Heating COP 4.0 [33]
Infiltration Rate0.7 L/s/m2 [32]
Fresh Air20 m3/h/person [31]
Table 9. Thermal simulations’ variable parameters.
Table 9. Thermal simulations’ variable parameters.
Parameters
Building dimensions (Thermal zone)
Building orientation
WWR (South, North, West, East)
U-value for glass
Solar Heat Gain Coefficient (SHGC) for glass
Visible Transmittance (VT) for glass
Cooling Set Point Temperature
Heating Set Point Temperature
Table 10. Dataset-derived marginal effect sizes for annual energy label (pEUI) using an interpretability fit (N = 12,000).
Table 10. Dataset-derived marginal effect sizes for annual energy label (pEUI) using an interpretability fit (N = 12,000).
ParameterChange Δ p E U I  (kWh/Year) 95 %   C I  (kWh/Year)p-Value
Cooling setpoint (Cooling_SP)+1 °C−4922.42[−4988.33, −4856.51]<1 × 10−16
Cooling setpoint (Cooling_SP)18 → 28 °C (Δ10 °C)−49,224.22[−49,883.34, −48,565.11]<1 × 10−16
Glazing SHGC+0.10+2689.96[+2622.74, +2757.18]<1 × 10−16
Mean façade WWR+0.10+2584.68[+2419.53, +2749.84]1.29 × 10−206
Glazing U-value+0.10 W/(m2·K)+294.87[+236.19, +353.55]6.88 × 10−23
Building length+1 m+2055.35[+2021.74, +2088.96]<1 × 10−16
Building depth+1 m+2083.59[+2050.28, +2116.89]<1 × 10−16
Building height+1 m+2429.04[+2367.18, +2490.91]<1 × 10−16
S.O.G type (1 vs. 0)switch 0 → 1+1159.66[+797.46, +1521.86]3.49 × 10−10
Wall type (2 vs. 0)switch 0 → 2−575.70[−964.14, −187.25]3.68 × 10−3
Table 11. Sampling coverage and independence diagnostics for continuous and discrete input variables.
Table 11. Sampling coverage and independence diagnostics for continuous and discrete input variables.
Input VariableSampling TypeTarget Range/LevelsCoverage Metric   Δ m a x ( % )   C V max | Spearman   ρ |
Length Uniform random10–3010-bin uniformity5.080.0270.017
Depth Uniform random10–3010-bin uniformity5.50.0310.018
Height Uniform random4–1510-bin uniformity5.250.0310.027
Orientation Uniform random0–36010-bin uniformity4.830.0280.016
WWR—South Uniform random0.0000–0.7999 (≈0–80%)10-bin uniformity3.750.0240.018
WWR—East Uniform random0.0000–0.7999 (≈0–80%)10-bin uniformity4.920.0270.012
WWR—North Uniform random0.0000–0.8000 (≈0–80%)10-bin uniformity4.580.0270.011
WWR—West Uniform random0.0001–0.7999 (≈0–80%)10-bin uniformity3.750.020.019
Glazing U-value Uniform random0.01–1.210-bin uniformity11.750.0720.016
Glazing SHGC Uniform random0.01–0.9910-bin uniformity10.420.0550.027
Glazing VT Uniform random0.01–0.9910-bin uniformity11.830.0470.019
Cooling setpoint Uniform random18–28 Level balance (n = 11) 5.60.0310.013
Heating setpoint Uniform random8–12Level balance (n = 5)3.420.0240.017
Table 12. Comparison of proposed workflow, conventional simulation, and optimisation approaches.
Table 12. Comparison of proposed workflow, conventional simulation, and optimisation approaches.
ApproachPrimary GoalHow Simulations Are SelectedTypical OutputStrength/Best Use-Case
Proposed workflow (this study)Generate a large, structured dataset for ML training and analysisBroad sampling of parameter space (e.g., uniform random sampling across ranges)Dataset of inputs + EnergyPlus outputs across many configurationsBest when the goal is dataset availability and design-space coverage for predictive modelling
Conventional simulation (case-based)Evaluate a small number of design alternativesManually defined scenarios; limited runsDetailed results for a few casesBest for project-specific analysis; limited suitability for ML training due to small sample size
NSGA-II (multi-objective optimisation)Find Pareto-optimal designs under multiple objectivesIterative evolutionary search based on objective evaluationPareto front/optimal candidate solutionsBest for optimisation and trade-off exploration, not primarily intended for producing general-purpose datasets
BO-XGBoost-assisted optimisation (surrogate + search)Accelerate optimisation using surrogate modelsIterative sampling guided by Bayesian optimisation and surrogate learningPareto front and surrogate modelBest when simulation is expensive and the aim is faster convergence to good designs; the dataset is typically optimisation-focused rather than broadly representative.
Table 13. Dataset-to-dataset comparison with representative open-access building-energy datasets.
Table 13. Dataset-to-dataset comparison with representative open-access building-energy datasets.
DatasetDataset TypeTypical ScopeGeography/Climate CoverageTemporal ResolutionInput Variables (Design/Metadata)Output Variables
Present study. https://doi.org/10.5281/zenodo.13622940 (accessed on 22 February 2026)Simulation-derived (EnergyPlus via Grasshopper)12,000 residential parametric casesSingle hot–arid boundary condition (Cairo/New Cairo EPW)Annual end-use outputsExplicit early-design variables (geometry, orientation, WWR, glazing properties, setpoints, and discrete envelope options)Annual heating, cooling, lighting, and equipment
Building Data Genome Project 2 (BDG2). https://github.com/buds-lab/building-data-genome-project-2. (accessed on 26 January 2026) Measured metre time-series3053 m from 1636 buildingsPortfolio-based (multi-building; not a controlled single climate boundary condition)Hourly (2016–2017)Building-level metadata; limited explicit parametric geometry/envelope variables compared with early-design sweepsMultiple metre types (electricity, heating/cooling water, steam, etc.)
ASHRAE Great Energy Predictor III (GEPIII). https://www.kaggle.com/c/ashrae-energy-prediction. (accessed on 26 January 2026) Measured metre data for ML benchmarking>1000 buildings; multiple metre typesPortfolio-based (multi-site); includes weather + building metadataHourly metre readings (multi-year)Metadata + weather; not organised as a parametric early-design variable sweepMetered usage for chilled water, electric, hot water, and steam
ResStock/End-Use Load Profiles (U.S.). https://resstock.nrel.gov/datasets. (accessed on 28 January 2026) Simulation-derived stock model (calibrated/validated)U.S. residential building stock (portfolio/stock)U.S. climate regions (multi-climate)15-min calibrated load profiles (EULP)Stock/characteristic variables; not primarily conceptual massing variables under a single controlled boundary conditionEnd-use load profiles (time-series)
ComStock/End-Use Load Profiles (U.S.). https://comstock.nrel.gov/ and https://natlabrockies.github.io/ComStock.github.io/docs/data.html (accessed on 22 February 2026) Simulation-derived commercial stock modelU.S. commercial building stockU.S. climate regions (multi-climate)15-min calibrated load profiles (EULP)Stock/typology descriptors; not structured as an early-stage parametric geometry/envelope sweepEnd-use load profiles (time-series)
Table 14. Key modelling assumptions/simplifications and their expected systematic influence on outputs.
Table 14. Key modelling assumptions/simplifications and their expected systematic influence on outputs.
Assumption Observed in Dataset (N = 12,000)Systematic Influence/Interpretation
Dataset scale and balance (envelope stratification)12,000 cases; 12 envelope combos (3 walls × 2 roofs × 2 S.O.G); exactly 1000 cases per combo; Wall codes: 0/1/2 = 4000 each; Roof codes: 0/1 = 6000 each; S.O.G: 0/1 = 6000 eachPrevents training bias toward one construction category; supports fair ML benchmarking across envelope classes.
Conceptual massing geometryLength 10–30 (mean 20.01); Depth 10–30 (mean 20.08); Height 4–15 (mean 9.47); Orientation 0–360 (361 discrete values)Captures early-stage “massing-level” effects; outputs reflect conceptual abstraction (not room-level zoning). Therefore, labels should not be interpreted as multi-zone detailed design truth.
Façade WWR (four sides)South ~0.00003–0.79988; East ~0.000004–0.79993; North (“Nourth”) ~0.000035–0.79996; West ~0.000058–0.79992WWR strongly drives solar gains → cooling (especially in hot–arid climates); interactions with orientation and SHGC are expected and are a “real signal” in ML training.
Glazing propertiesUValue 0.01–1.20 (120 discrete levels); SHGC 0.01–0.99 (99 levels); VT 0.01–0.99 (99 levels)These bounds define what the model can learn; outside these ranges, ML predictions become extrapolation. SHGC changes are expected to systematically shift cooling demand through solar-gain control.
Thermostat setpoints (as labels are conditioned on them)Cooling_SP: 18–28 °C (mean 22.97); Heating_SP: 8–12 °C (mean 9.99)Setpoints directly shift delivered energy totals. ML predictions are only valid under the setpoint ranges used.
Energy label definition (target variable)pEUI min 6903.46, max 214,819.62, mean 51,612.50, median 45,336.51; P5–P95: 18,484.28–105,763.72.Highlights label scale and outliers; supports sanity checks and helps future users choose normalisation/log transforms for ML.
Fixed boundary conditions (not stored as columns; constant across all runs)Climate (single EPW), residential schedules, internal gains, infiltration/ventilation, and HVAC efficiency proxies are held constant.These constants systematically shift absolute energy magnitudes; ML models should be interpreted as conditional on these fixed assumptions, not universal across other climates/schedules/HVAC efficiencies.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wefki, H.; Elbeltagi, E.; Elnabwy, M.T.; ElAgroudy, M. Synthetic Residential Building Energy-Consumption Dataset Generation Through Parametric Simulation for Hot–Arid Egypt. Buildings 2026, 16, 976. https://doi.org/10.3390/buildings16050976

AMA Style

Wefki H, Elbeltagi E, Elnabwy MT, ElAgroudy M. Synthetic Residential Building Energy-Consumption Dataset Generation Through Parametric Simulation for Hot–Arid Egypt. Buildings. 2026; 16(5):976. https://doi.org/10.3390/buildings16050976

Chicago/Turabian Style

Wefki, Hossam, Emad Elbeltagi, Mohamed T. Elnabwy, and Mohamed ElAgroudy. 2026. "Synthetic Residential Building Energy-Consumption Dataset Generation Through Parametric Simulation for Hot–Arid Egypt" Buildings 16, no. 5: 976. https://doi.org/10.3390/buildings16050976

APA Style

Wefki, H., Elbeltagi, E., Elnabwy, M. T., & ElAgroudy, M. (2026). Synthetic Residential Building Energy-Consumption Dataset Generation Through Parametric Simulation for Hot–Arid Egypt. Buildings, 16(5), 976. https://doi.org/10.3390/buildings16050976

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop