Finding Patterns of Construction Systems in Low-Income Housing for Thermal and Energy Performance Evaluation through Cluster Analysis

.


Introduction
In Brazil, the development of public policies aimed at low-income housing has contributed to reducing the housing deficit (which currently is approximately 5.9 million units, according to [1]), of which more than 87% correspond to families with monthly income lower than three minimum wages (i.e., lower than BRL 3960.00 per month). BRL stands for Brazilian Real; as of 6 July 2023, USD 1.00 equals BRL 4.93. In the long term, this process will certainly impact the growth of the Brazilian construction industry, promoting more jobs and income, but also an increase in energy demand. Buildings represent a large share of energy consumption around the world. Global energy consumption in the building sector is estimated at around 30% [2]. In Brazil, according to the National Energy Balance [3], the residential sector was responsible for 28.3% of Brazil s electricity consumption in 2021, second only to the industrial sector, which consumed 40.1%. Due to this context, energy efficiency in low-income housing in Brazil represents an important research area.
Energy consumption in buildings depends on many factors. The combination of the materials that compose the structural systems and the envelope sealing is one of them. building with improvement measures by comparing them with a known, standardised envelope. The reference model should reflect the characteristics typically found in a given building typology and locality. In this way, the building with improvement measures (e.g., different materials or construction strategies) would present better thermal and energy performance than the reference model if it obtains better results in the considered performance indicators [24].
Many studies have been developed worldwide to determine reference models and evaluate their thermal and energy performance. Fumo et al. [25] applied reference models to simplify the process of estimating the hourly energy consumption of a building. They compared the hourly consumption of electricity and fossil fuel obtained from simulation with actual energy consumption from energy bills. They found an error of up to 10% on the hourly energy consumption estimation, proving it unnecessary to perform dynamic simulations of the building.
Theodoridou et al. [26] developed an analysis of the built stock in Greece, structured through the classification of existing data in national terms, categorising the inventory according to the age of the buildings. That allowed the authors to summarise a set of information related to the construction period, such as building typology, construction typology, and equipment related to energy performance. A sample of actual buildings was selected for an energy audit, thermal comfort, and air quality monitoring. Simulations considering user behaviour were conducted to evaluate energy consumption. The simulation results were compared with actual data on average annual energy consumption for heating. The authors affirm that the scenario mapped for the built stock of Greece represents a tool to identify the potential points of improvement of the energy performance of the buildings.
Attia et al. [27] developed reference models for the energy consumption of the residential sector in Egypt. A database on the profiles and patterns of energy use, equipment, construction, and size of a building sample was built. Two representative typologies of flats were developed based on the data on the intensities of internal loads and the energy end-use patterns. Results obtained from the simulation of the reference models were compared with the estimated monthly consumption averages. The comparison showed a good correlation between simulation data and survey data. The simulation of the reference models found a difference of about 2% for the total annual consumption.
Geraldi et al. [28] developed reference buildings based on information about school building stock shapes in Brazil, aiming to evaluate the impact of the shape on buildings energy benchmarking. The reference buildings were simulated for different scenarios, and the simulation results were used to develop regressive benchmarking models to reproduce the building stock performance. Results showed that the different shapes of reference buildings significantly impact benchmarking buildings using regressions.
Palladino [29] defined reference models representative of the Italian residential building stock considering thermal properties and geometry. Parametric energy simulations of such reference models were performed to evaluate the theoretical deviation of the energy performance gap (i.e., the difference between the calculated and actual energy consumption). This approach allowed the quantification of the energy performance gap and could help develop national energy policies.
The studies presented above show that using reference models can be an efficient way to obtain indicators of thermal and energy performance on a large scale, in the sense that they simplify the studies in this area. However, it is important to note that although the composition of the model is based on an integrated set of characteristics that describe it, the methods for obtaining such models are still largely based on the application of univariate statistical techniques. Therefore, searching for methods based on multivariate statistics is necessary to determine the models. An appropriate response to this problem relies on cluster analysis.

Cluster Analysis to Determine Reference Models
Cluster analysis is an exploratory, non-theoretical, and non-inferential statistical technique encompassing various algorithms whose purpose is to divide a sample of entities into a smaller number of mutually exclusive groups based on their similarities [30]. There are three considerations to be taken to perform a cluster analysis, i.e., (1) the treatment of input data (variables), (2) the selection of the similarity measure, and (3) the selection of the partitioning technique.
Firstly, the variables selected should characterise the aggregated objects and relate specifically to clustering objectives (i.e., in an energy efficiency study, variables that impact the building performance). Furthermore, it is advisable to standardise the variables in order to obtain more homogeneous variances [30]. Finally, before starting the clustering process, it is also important to check that there are no outliers in the sample. The similarity measure (or dissimilarity) represents the differences or similarities between two entities through a mathematical value [31]. Finally, partition techniques are procedures used as a criterion for separating clusters from a distance between objects (similarity measure). They are classified as hierarchical or non-hierarchical. In hierarchical techniques, only two objects are attached to each other in each step, thus characterising tree formation, called a dendrogram (Figure 1), which represents the similarity level obtained with each union of two clusters. In non-hierarchical partitioning techniques, each cluster has no tree formation, and this occurs interactively. Objects are distributed in clusters simultaneously from seed points. Some objects can be relocated to only resemble the cluster they were allocated to [30][31][32][33][34].
statistics is necessary to determine the models. An appropriate response to this p relies on cluster analysis.

Cluster Analysis to Determine Reference Models
Cluster analysis is an exploratory, non-theoretical, and non-inferential st technique encompassing various algorithms whose purpose is to divide a sa entities into a smaller number of mutually exclusive groups based on their sim [30]. There are three considerations to be taken to perform a cluster analysis, i.e treatment of input data (variables), (2) the selection of the similarity measure, and selection of the partitioning technique.
Firstly, the variables selected should characterise the aggregated objects an specifically to clustering objectives (i.e., in an energy efficiency study, variables tha the building performance). Furthermore, it is advisable to standardise the variables to obtain more homogeneous variances [30]. Finally, before starting the clustering it is also important to check that there are no outliers in the sample. The similarity m (or dissimilarity) represents the differences or similarities between two entities th mathematical value [31]. Finally, partition techniques are procedures used as a crite separating clusters from a distance between objects (similarity measure). They are c as hierarchical or non-hierarchical. In hierarchical techniques, only two objects are a to each other in each step, thus characterising tree formation, called a dendrogram 1), which represents the similarity level obtained with each union of two clusters. hierarchical partitioning techniques, each cluster has no tree formation, and thi interactively. Objects are distributed in clusters simultaneously from seed point objects can be relocated to only resemble the cluster they were allocated to [30][31][32][33][34].  in the case of this work, each number would represent a house. The vertical axis presents the level of similarity obtained in each new union. The cut line, also called a stop rule, indicates when the partitioning process will be interrupted, determining the amount and formation of the clusters.
But how do you obtain a reference model at the end of the whole process? According to [35], one can determine theoretical or real models from the results. The theoretical models are hypothetical models constructed from the average characteristics of each grouping (average of each variable). As a real model, one solution is to identify the case closest to the centre of the cluster (the nearest distance from the object to the centre of the cluster) and adopt it as a model. This latter process is more practical since there is no need to define characteristics beyond those used in the analysis.
Cluster analysis has been used to investigate the potential for energy savings in buildings and propose classifying buildings as their energy efficiency, among others. The most common objectives of using these studies are to identify different groups in a population according to some specified criteria or to find a way to represent a large group through data simplification. Some applications of cluster analysis to determine reference models can be found in the literature.
Schaefer and Ghisi [19] determined two reference models for low-income housing in southern Brazil to carry out energy efficiency studies for this group of buildings. Data were collected on 100 dwellings over a year, referring to their geometry, such as internal and external dimensions, spatial distribution of environments, solar orientation, and information about openings such as operation mode, sizes, and shading. The collected data were submitted to cluster analysis, which aimed to find subgroups within a sample with high internal homogeneity and heterogeneity between the groups. Hypothesis tests were also conducted to verify the statistical independence between the groups found. Two reference models were determined from the buildings closest to the centre of each group. Finally, each building from the sample and the models were submitted to computational simulation to prove that the model represented the group s thermal performance.
Wang et al. [36] used cluster analysis to determine the shape of a typical office building in Shanghai, China. Based on a field survey, data about the morphology of 181 buildings were collected. Three building characteristics were considered: single-floor area, storeys, and floor height. After the analysis, four clusters were determined. The geometric information of the most representative cluster (61 office buildings) was used to assess the effects of urban morphology factors on the energy performance of neighbourhood-scale building stocks.
A cluster analysis was carried out by Mitra et al. [37] to identify typical schedules followed by the population of the United States. American Time Use Survey data were obtained and split into groups based on age and weekday or weekend. Three main patterns were obtained from cluster analysis: day-work, stay-home, and night-work. Furthermore, different patterns based on age and weekday or weekend were observed. Thus, the results of this analysis can be used to estimate the users occupancy schedule based on their age and characteristics, which significantly impact the building energy performance.
Liu et al. [38] assessed the correlations between occupancy patterns and socio-demographic characteristics in Chinese households since occupancy patterns significantly impact building energy consumption. For that, cluster analysis was used to determine representative occupancy patterns based on household occupancy data collected through a field survey. Several representative occupancy patterns were observed; each cluster formed showed the probability of users being at home in specific environments and periods.

Method
The method applied in this research follows the method proposed by Schaefer and Ghisi [19]. It aims to answer these four main questions: (1) Which are the materials and construction systems used in low-income houses of Florianópolis? (2) Can we split the sample into clusters with similar features and determine a construction system reference model to represent each cluster? (3) Do the clusters differ based on the thermal performance of their houses? (4) Are the reference models found representative of their cluster?
To answer such questions, this study was developed in three steps: (1) obtaining data, (2) clustering data, and (3) checking the representativeness of clusters and models.
In the first step, data regarding the materials and construction systems which compose the dwellings envelope of the sample were obtained. These data were summarised and submitted to cluster analysis in the second step. This analysis allowed us to split the sample of houses into clusters with similar features. A reference model of features of each cluster was determined. In the third and last step, all the samples, including the reference models, were submitted to computer simulation, from which the degree-hour, a thermal performance indicator, was obtained for each house. That allowed us to verify if the clusters differ in their thermal performance and if the reference models represent their cluster.

Obtaining Data
Measurements and interviews were conducted through semi-structured questionnaires in low-income single-family houses in Florianópolis, Brazil. The houses considered for the study should have family income equal to or less than three minimum monthly wages or be located in urban areas intended for low-income housing, as established by Brazil s guidelines of national housing programmes. Information about the envelope of 106 lowincome houses in Florianópolis was obtained. Although there is no generally accepted rule regarding minimum sample size to perform cluster analysis [39], some authors recommend a sample size of at least ten times the number of variables [40]. This study considered ten variables (representing the composition of the wall, roof, floor, and frames) in cluster analysis, as explained in Section 4.1.
Data collected refer to the materials that compose the walls, roof, floor, and opening frames of each house. Although the most convenient and correct form to acquire this type of data would be from the executive design project of the building, such a design was not available for most of the houses. Thus, this information was obtained by interviewing the residents, who, in many cases, have participated in the construction process and could indicate the materials used. When the residents could not provide such information, the researcher visually inspected its composition.
This inspection identified the combination of materials that make up the construction system. As this is social housing-built with low resources-in some cases, there was no painting or plastering on all internal surfaces, so it was possible to identify the different layers of materials through one or more walls of the house. Even in buildings with plaster, the identification could be made considering the thickness of the walls, resistance to touching, or even observing some leftover material in the backyard of the building. There is usually a standard in the thickness of different wall enclosure materials (e.g., ceramic-brick or concrete block) and slight variation in the available options. All other systems, such as the floor and roof, could also, in most cases, be identified using the same process. A concrete slab or wooden floor could be observed from the outside of the dwelling, which was normally not in contact with the ground. The researcher could also easily identify the type of interior finish (wood, ceramic, etc.). In the case of the roof, the existence of the ceiling and the material used could be identified by the researcher inside the building. The tiles, in turn, could be identified from the outside. The existence of a slab on the roof can be identified through a trapdoor or by observing roof beams. The frame materials were identified in the same way, either by the thickness of the frame or through visual observation.
This procedure was adopted to collect data on the layers of materials of the walls, floor, roof, and frames in each room, as described below.
• Walls: thickness and materials that compose the wall, plaster, mortar, ceramic-brick or concrete block, wood, etc.; • Floor: floor covering, structure and contact or not with the ground; • Roof: type of tile, existence or not of a concrete slab and ceiling material, when applicable; • Frames: frame material.
Despite the data collected having a low level of detail, the reliability regarding the compositions of the construction systems was considered acceptable due to the little variability of the existing compositions for this object of study.

Cluster Analysis
In this study, the method proposed by Schaefer and Ghisi [19] was used to split the sample into clusters and to determine the reference models for each group. Firstly, data about the houses' envelope obtained in the field were normalised in Z-scores (Equation (1)) prior to cluster analysis, so the results would not be impacted by the differences in the dispersion of the variables involved in the analysis. Also, after the normalisation process, the existence of outliers was assessed using the D 2 of Mahalanobis measure (Equation (2)). The D 2 limit value adopted in this study was 0.001, as suggested by Schaefer and Ghisi [19].
No atypical values were found for the Mahalanobis D 2 test, so the 106 dwellings were submitted to cluster analysis.
where Z x i is the standardised value of x i , x is the average of variable values, and s is the standard deviation of the variable values.
where D 2 nm is the Mahalanobis measure, C −1 is the covariance matrix, x n is the value of "n" for each variable, x m is the value of "m" for each variable, "n" represents each object, and "m" represents the multivariate average. Cluster analysis started with the hierarchical procedure, which was applied in order to identify the ideal number of clusters (grouping of houses) in the sample. The first step to initiate the hierarchical procedure is defining a similarity measure. The Squared Euclidean distance was adopted for all pairs of objects. With this measure, it is possible to accumulate empirical evidence on similarity levels, differentiating itself from Euclidean distance by highlighting the differences between more distant objects [32]. Equation (3) was used to obtain the distances between the objects using the Squared Euclidean distance.
where d AB is the Squared Euclidean distance from A to B, x i A is the value of A for each variable, and x i B is the value of B for each variable. In this study, points A and B can be considered houses of the sample, as shown in Figures 2 and 3.
The Ward Method was selected as the partitioning algorithm, a set of rules defining which pairs of clusters are combined in each step. This method considers the best solution the combination that minimises the residual increase in the squares across all the variables in all the clusters [31]. Once the partitioning algorithm and the similarity matrix were defined, the hierarchical procedure was performed. In this procedure, the pairs of objects (dwellings) were combined according to the distance value obtained with the specified similarity measure and the partition rules of the selected algorithm. At each stage, two distinct clusters were combined to form a new cluster. The value of the distance in which each cluster was formed indicates their similarity level.
The formation of the clusters from the hierarchical method can be represented in a dendrogram, a stepped graph in the form of a tree where the levels of similarity obtained to each union are observed. In the early stages, this measure is small and grows as different clusters are combined. Figure 2 exemplifies the construction of a dendrogram.
The dendrogram helped to determine the preliminary clustering solutions, corresponding to the stage in which there was a high increase in the level of similarity in relation to the previous stages. In addition to the dendrogram, the percentage variations of the heterogeneity obtained in each union were also analysed to obtain a suitable number of clusters. A large increase in heterogeneity means that two considerably distinct clusters were joined. The number of clusters to be formed was then determined by the previous solution to a step where the combination of clusters generated a percentage increase in the agglomeration coefficient (a measure of heterogeneity for each new cluster provided by the statistical programme) more considerable than the previous steps. density equal to 119 W in the living room, 1465 W in the kitchen, 33 W in the main bedroom, and 17 W in the other bedrooms. Figure 5 shows the electrical equipment use of the reference model considering the rate of use for installed power density (119 W for the living room, 1465 W for the kitchen, 33 W for the main bedroom, and 17 W for other bedrooms). Table 1 shows the artificial lighting use and the opening operation patterns. It was considered lighting power equal to 30 W, 40 W, 60 W, and 20 W in the living room, kitchen, main bedroom, and other bedrooms, respectively. This table shows the periods when the artificial lighting system is on, and the openings are open.   density equal to 119 W in the living room, 1465 W in the kitchen, 33 W in the main bedroom, and 17 W in the other bedrooms. Figure 5 shows the electrical equipment use of the reference model considering the rate of use for installed power density (119 W for the living room, 1465 W for the kitchen, 33 W for the main bedroom, and 17 W for other bedrooms). Table 1 shows the artificial lighting use and the opening operation patterns. It was considered lighting power equal to 30 W, 40 W, 60 W, and 20 W in the living room, kitchen, main bedroom, and other bedrooms, respectively. This table shows the periods when the artificial lighting system is on, and the openings are open.   [19]. Figure 3. Floor plan drawing of the model used in computational simulations [19].
After determining the optimal number of clusters, the non-hierarchical procedure was applied to obtain the final solution using the k-means algorithm. In this procedure, the objects (dwellings) are distributed simultaneously in clusters according to their proximity to the seed points (reference points where the clustering process starts). These points represent the centre of the cluster (centroid), which changes as new objects are grouped. As centroids keep changing, some objects initially assigned to one cluster may become closer to the centroid of another cluster during the process and therefore are re-assigned. The same procedure is repeated the number of times necessary until the convergence is reached (when no object is assigned to a new cluster due to the change of its centre).
Based on this process, the clusters were formed, grouping dwellings with similar envelope materials. Because cluster analysis is an exploratory rather than inferential analysis, there is no "p value " reference for comparing the final result to verify if the clustering formation was suitable, since the statistical variable is defined by the data involved in the analysis itself [30]. Therefore, hypothesis tests (Chi-square test) were performed with all variables (representing the composition of the wall, roof, floor, and frames) to verify significant differences between clusters. A significance level of 0.05 was adopted, assuming that the variables are statistically independent when the value obtained using the test is less than 0.05. The reference models were considered as the objects (dwellings) with the smallest distance to the centroid of each cluster, adopted as actual reference models (defined from characteristics of actual dwellings) and not from theoretical reference models (resulting from the combination of the means obtained for each variable).
At last, the construction systems that comprise the walls, floor, roof, and frame of each model were presented.

Thermal Performance through Computer Simulation
In the last step, to verify the suitability of the clusters and their reference models in terms of thermal performance, the thermal performance of all houses was assessed using EnergyPlus (version 8.1). It was considered the climate of Florianópolis (latitude 27 • south, longitude 48 • west). For that, virtual models were configured for each composition of the envelope materials found in the field surveys. The other parameters, such as geometry, operation, and internal gains, were fixed for all files so that the performance of the models reflected only the effects caused by the difference in the composition of the envelope materials. These parameters were adopted as proposed by Schaefer and Ghisi [19] and are presented below.
The geometry was configured using the Open Studio plugin within the Sketch Up software (13.0.3689). It is a three-bedroom house with an independent living room and kitchen, a floor plan area corresponding to 76.00 m 2 , and a front façade facing east ( Figure 3). Figure 4 shows the room s occupancy rate concerning the maximum occupancy of each room. It was considered a maximum occupancy of two persons per bedroom, power density equal to 119 W in the living room, 1465 W in the kitchen, 33 W in the main bedroom, and 17 W in the other bedrooms. Figure 5 shows the electrical equipment use of the reference model considering the rate of use for installed power density (119 W for the living room, 1465 W for the kitchen, 33 W for the main bedroom, and 17 W for other bedrooms).   . Figure 5. Hourly electrical equipment use in long-term rooms [19]. From the operative temperatures obtained from the simulations, the heating and cooling degree-hours were calculated for all houses, including the reference model for the living room, bedrooms, and the whole house (Equations (4) and (5), respectively). Additionally, a third indicator, representing the total degree-hours in the year, was obtained by summing the degree-hours of heating and cooling. In order to obtain the total degree-hours for the whole dwelling, a weighting average (Equation (6)) was calculated according to the area of each room. In the end, nine variables were obtained for each dwelling (cooling, heating, and total degree-hours for the living room, bedroom, and whole house).  From the operative temperatures obtained from the simulations, the heating and cooling degree-hours were calculated for all houses, including the reference model for the living room, bedrooms, and the whole house (Equations (4) and (5), respectively). Additionally, a third indicator, representing the total degree-hours in the year, was obtained by summing the degree-hours of heating and cooling. In order to obtain the total degreehours for the whole dwelling, a weighting average (Equation (6)) was calculated according to the area of each room. In the end, nine variables were obtained for each dwelling (cooling, heating, and total degree-hours for the living room, bedroom, and whole house).
where GH c is the cooling degree-hour ( • Ch) for each long-term room, GH h is the heating degree-hour ( • Ch) for each long-term room, To is the operative temperature for each longterm room for each hour of the year ( • C), GH house is the weighted average of degree-hours per floor plan area of each room ( • Ch), GH is the sum of cooling and heating degree-hours of each room ( • Ch), AU is the floor plan area of each room (m 2 ), AU total is the sum of the floor plan areas of all rooms (m 2 ). Through the degree-hour obtained, two analyses were used to verify if the clusters and their reference models were adequate. First, a hypothesis test was performed to verify if each cluster mean was the same or differed between clusters, meaning that the clusters had similar or different performances. The significance value was set at 0.05. Thus, variables whose p value was lower than 0.05 were considered not equal.
The second analysis regards the sample distribution for each cluster variable through a boxplot. A boxplot displays the sample data divided into four equal parts (quartiles), each containing 25% of sample cases. The horizontal lines indicate the separation of each quartile. The area between the first and third quartiles makes up 50% of the sample data. Above the third quartile line are 25% of cases, and below the first quartile line, the remaining 25%. The centre of the distribution is indicated by the median, which is the line dividing the rectangle into two equal parts, each containing 25% of the data. According to its position in the rectangle, it is possible to understand the asymmetry of the sample. The height of the rectangle can determine the dispersion of the sample; the higher it is, the more dispersed the sample is. Horizontal lines outside the rectangle represent the upper and lower limits. The cases that fall outside the range delimited by the upper and lower limits may represent outliers and should be analysed by the researcher to determine if they represent a particular case or should be withdrawn from the sample. Thus, the performance of each model was compared to the performance of all houses of its cluster using a boxplot to check its suitability. For that, the reference model value should be located between the first and third quartiles of the distribution. If that criterion is satisfied, the model presents a central tendency and can represent its cluster. This step aims to verify the ability of the reference building to represent the sample and not to evaluate its thermal performance. Boxplots were analysed to verify if the degree-hours of the reference model were close to the median.

Results and Discussion
This section shows the results of the three steps of the method proposed.

Data on Materials and Construction Systems
The compositions of the construction systems found in the field were considerably heterogeneous, requiring simplification. Therefore, the characteristics of the dwellings were summarised in ten variables, representing the composition of the wall, roof, floor, and frames. The wall composition was split into two variables, which described the composition of the walls in the dry areas (such as living room and bedrooms) and the composition of walls in wet areas (such as kitchen and bathroom). The walls were classified into three different types, based on the structural elements of the partitions: wooden walls, ceramicbrick walls, and concrete block walls ( Figure 6). The floor was also divided into two variables, considering dry areas and wet areas. The compositions considered were wooden floors, concrete floors without ceramic coating, or concrete floors with ceramic coating. As for the floor, it was also considered whether or not this surface was in contact with the ground (Figure 7). The roofs were classified into three variables: tile type (fibre cement, ceramic, or none), the existence of concrete slab and ceiling (wood, PVC, gypsum, or none) ( Figure 8). Finally, the frame material was divided into two variables: door frames (wood, aluminium, or PVC) and window frames (wood, aluminium, PVC, or mixed) ( Figure 9). with ceramic coating. As for the floor, it was also considered whether or not this surface was in contact with the ground (Figure 7). The roofs were classified into three variables: tile type (fibre cement, ceramic, or none), the existence of concrete slab and ceiling (wood, PVC, gypsum, or none) (Figure 8). Finally, the frame material was divided into two variables: door frames (wood, aluminium, or PVC) and window frames (wood, aluminium, PVC, or mixed) (Figure 9).   was in contact with the ground (Figure 7). The roofs were classified into three variables: tile type (fibre cement, ceramic, or none), the existence of concrete slab and ceiling (wood, PVC, gypsum, or none) (Figure 8). Finally, the frame material was divided into two variables: door frames (wood, aluminium, or PVC) and window frames (wood, aluminium, PVC, or mixed) (Figure 9).      As can be seen in Figures 6-9, most walls are composed of ceramic brick (75% for dry and 95% for wet areas). Ceramic and fibre cement tiles, the absence of concrete slabs (73% of cases, most of them with ceiling), and wooden ceilings were also predominant. As for floors, the composition of concrete slabs with ceramic coating predominates (72%). Door and window wooden frames are the most common, accounting for 97% of cases. Figure 10 shows the percentage of cases found for each roof composition. The most frequent was fibre cement tile, with no slab and wooden ceiling (25%), followed by ceramic tile, without slab and with wooden ceiling (19%). As can be seen in Figures 6-9, most walls are composed of ceramic brick (75% for dry and 95% for wet areas). Ceramic and fibre cement tiles, the absence of concrete slabs (73% of cases, most of them with ceiling), and wooden ceilings were also predominant. As for floors, the composition of concrete slabs with ceramic coating predominates (72%). Door and window wooden frames are the most common, accounting for 97% of cases. Figure 10 shows the percentage of cases found for each roof composition. The most frequent was fibre cement tile, with no slab and wooden ceiling (25%), followed by ceramic tile, without slab and with wooden ceiling (19%).

Clusters
The hierarchical procedure yielded the dendrogram shown in Figure 11. It was possible to verify a substantial increase in the level of similarity after joining two clusters in two moments: in the joining of the two final clusters (similarity level of approximately

Clusters
The hierarchical procedure yielded the dendrogram shown in Figure 11. It was possible to verify a substantial increase in the level of similarity after joining two clusters in two moments: in the joining of the two final clusters (similarity level of approximately 25), resulting in two clusters (clusters 1 and 2, with 76 and 30 dwellings, respectively), and the previous step (similarity level of approximately 18), resulting in three clusters (clusters 1, 2 and 3, with 26, 50 and 30 dwellings, respectively).  As in the hierarchical procedure, two solutions were obtained regarding the number of clusters to be formed. Therefore, the non-hierarchical procedure (k-means) was performed for the two alternatives. The final decision to adopt the two-or three-cluster solution was based on two criteria: statistical significance and practical significance. Table 3 shows the significance values obtained using the Chi-square test for each variable. The best solution for this comparison was identified when having the lowest significance values, i.e., closer to zero. It was verified that for the two cases, the dry area  Table 2 shows the heterogeneity percentages' variation at each stage of the hierarchical process. It can be seen that there are two suitable solutions for this sample, as also shown in the dendrogram. The first one was to stop the clustering in step 103, obtaining three clusters due to the increase of 20.2% in the agglomeration coefficient compared to the previous union. The second solution would be to stop in step 104 and obtain two clusters, with an increase of 24.2% in the agglomeration coefficient.
As in the hierarchical procedure, two solutions were obtained regarding the number of clusters to be formed. Therefore, the non-hierarchical procedure (k-means) was performed for the two alternatives. The final decision to adopt the two-or three-cluster solution was based on two criteria: statistical significance and practical significance. Table 3 shows the significance values obtained using the Chi-square test for each variable. The best solution for this comparison was identified when having the lowest significance values, i.e., closer to zero. It was verified that for the two cases, the dry area wall composition, dry and wet floor composition, ground contact, and concrete slab on the roof were significant for determining the clusters. The tile type and ceiling material were significant only for the solution of three clusters, while the window frame material was significant only for the solution of two clusters. The wet areas wall composition and the door frame material were insignificant in any case. The three-cluster solution is statistically more significant than the two-cluster solution. In addition to having more significantly impacting variables, the variables in which the twocluster solution is more significant are not as important as those of the three-cluster solution. Also, even though some variables are significant for both solutions, their significance is higher for the three-cluster solution (existence of concrete slab). Thus, it is concluded through the statistical significance that the three-cluster solution is more appropriate.
Regarding the practical significance, the two-cluster profiles are presented in Table 4. The two-cluster solutions differed in drywall composition, wet floor composition, ground contact, and tile type. Thus, cluster 1 would be described as a house with ceramic-brick walls and ceramic-coated concrete floor, both in dry and wet areas, a roof composed of ceramic tile and wooden ceiling, without concrete slab and wooden frames. Cluster 2 would be a house with wooden walls and floor in dry areas and ceramic-brick walls and ceramic-coated concrete floor in wet areas without ground contact. The roof would have fibre cement tiles and a wooden ceiling with no concrete slab. The opening frame would also be made of wood. For the three-cluster solution, clusters 1 and 2 were more similar than cluster 3. As for the composition of the walls, clusters 1 and 2 presented ceramic-brick walls in the dry and wet areas, while cluster 3 had wooden walls in dry areas and ceramic-brick in wet areas. Ground contact was observed in clusters 1 and 2, but not in cluster 3. The roof composition was the only difference between clusters 1 and 2, the first consisting of a flat slab, without tile and ceiling, while the second was composed of ceramic tile and wooden ceiling, without concrete slab. In cluster 3, the roof would comprise fibre cement tile and a wooden ceiling without a concrete slab. The material of the frames is wood in all clusters. Cluster 3 was considerably similar to cluster 2 of the two-cluster solution, while clusters 1 and 2 appeared to merge, forming a single cluster. The differences observed for these two clusters (1 and 2) were only due to the roof, of which their thermal properties differ significantly (for cluster 1: U = 3.73 W/m 2 K and CT = 220 kJ/m 2 K; for cluster 2: U = 2.02 W/m 2 K and CT = 26 kJ/m 2 K). Thus, as the assessment of statistical significance also pointed to the solution of three clusters as more adequate, three clusters were adopted in this work.
Three dwellings were selected as the reference models of clusters 1, 2, and 3 due to their higher proximity to the centroid of their cluster (0.620, 0.908, and 1.125, respectively). Figures 12-14

Suitability of the Clusters and Their Reference Models Based on Thermal Performance
The results of the comparison of means for the three clusters are presented in Table  5. The degree-hour cluster means and the standard errors for each cluster are presented in columns, and the variables are presented in each row. As there are three clusters and it was not possible to prove the normality of data, the Kruskal-Wallis test was applied to verify if the means of the clusters differ. It was observed that the cluster analysis formed clusters with significantly different means for all variables since the pvalue was less than 0.05 for all cases.
It is observed that cluster 1 and cluster 2 have higher similarity than cluster 3. That is because their walls and floor construction systems are similar, mainly differing in their roof construction system. Yet, a slight difference in their performance due to the roof composition can be observed. The cooling degree-hour means for cluster 2 are lower than the ones obtained for cluster 1, while the heating degree-hour means are higher for cluster 2. That proves that cluster 1, composed mainly of houses with concrete slabs and no tile nor ceiling, performs better in the cold season.
In contrast, cluster 2, composed of houses with only ceramic tile and wood ceilings, performed better over the hot season. These results are related to their thermal properties, i.e., thermal transmittance and thermal capacity, which are higher for cluster 1 roof construction systems. The differences in their performance would be even higher if the absorptances were the same. The absorptance of cluster 1 is higher (68%) than that of cluster 2 (56%), contributing to heat absorption. Cluster 3 showed the worst performance, both for heating and cooling, as expected. It is made of wood (walls and floor) and has only fibre cement tiles on the roof with no concrete slab, which is known for promoting poor thermal performance.
The results refer to a virtual building model with specific geometry and occupation patterns and might not be valid for another model. These findings aimed to obtain different building reference models based on the envelope materials. The reference model of cluster 1 is a house with ceramic-brick walls, both in dry and wet areas. It was determined that the walls would have 2.5 cm of mortar laying on both sides, and external painting. The average wall absorptance found in the sample was adopted. It was not possible to obtain the roof tile absorptance, so it was adopted based on the literature. The wall thickness was adopted considering the average obtained for the cluster. The floor was composed of a concrete slab and ceramic coated throughout the house, including dry and wet areas. The thickness adopted was 10 cm, based on what is usual, since this characteristic was not obtained in the surveys. The floor is in contact with the ground. As for the roof, it is composed of a concrete slab without tile and ceiling. Doors and windows frames are made of wood. Details on the glass were not used in the analysis due to the impossibility of collecting this information in the field, and 3 mm thick single-pane clear glass was adopted in all windows.
The reference model of cluster 2 has the same characteristics as reference model 1 for walls, floor, and frames, differing only by the composition of the roof. It is composed of ceramic tile and wooden ceilings.
Cluster 3 is the one that differs most among the three clusters. The floor of dry areas is composed of planks of wood, while the wet areas consist of ceramic-coated concrete slabs. Likewise, the walls of the dry areas are different from the walls of the wet areas, as they are made of wood, while those of the wet areas are made of ceramic brick with mortar. The wall thickness was 3.5 cm, corresponding to the average for the cluster. The absorptance was the same as the sample mean. The roof composition was a wooden ceiling and cement tile, whose absorptance was adopted as found in the literature. The frames have the same characteristics as the other clusters.
The definition of the reference models based on their characteristics seems to fit well with what was observed in the surveys, highlighting the existing typologies. Based on the concepts of cluster analysis, it was concluded that the formation of clusters was suitable.

Suitability of the Clusters and Their Reference Models Based on Thermal Performance
The results of the comparison of means for the three clusters are presented in Table 5. The degree-hour cluster means and the standard errors for each cluster are presented in columns, and the variables are presented in each row. As there are three clusters and it was not possible to prove the normality of data, the Kruskal-Wallis test was applied to verify if the means of the clusters differ. It was observed that the cluster analysis formed clusters with significantly different means for all variables since the p value was less than 0.05 for all cases. It is observed that cluster 1 and cluster 2 have higher similarity than cluster 3. That is because their walls and floor construction systems are similar, mainly differing in their roof construction system. Yet, a slight difference in their performance due to the roof composition can be observed. The cooling degree-hour means for cluster 2 are lower than the ones obtained for cluster 1, while the heating degree-hour means are higher for cluster 2. That proves that cluster 1, composed mainly of houses with concrete slabs and no tile nor ceiling, performs better in the cold season.
In contrast, cluster 2, composed of houses with only ceramic tile and wood ceilings, performed better over the hot season. These results are related to their thermal properties, i.e., thermal transmittance and thermal capacity, which are higher for cluster 1 roof construction systems. The differences in their performance would be even higher if the absorptances were the same. The absorptance of cluster 1 is higher (68%) than that of cluster 2 (56%), contributing to heat absorption. Cluster 3 showed the worst performance, both for heating and cooling, as expected. It is made of wood (walls and floor) and has only fibre cement tiles on the roof with no concrete slab, which is known for promoting poor thermal performance.
The results refer to a virtual building model with specific geometry and occupation patterns and might not be valid for another model. These findings aimed to obtain different building reference models based on the envelope materials. Figure 15 shows the boxplot for all variables (in rows) and rooms (in columns) of each cluster and allows one to check the central tendency of a value in a sample. Each cluster is represented by a boxplot dot colour (blue, orange, and green for clusters 1, 2, and 3, respectively). The big yellow dot in each boxplot represents the cluster reference model, and its position in the distribution is related to its sample representativeness. As expected, due to what was seen in the hypothesis tests, it is observed that for all variables of all rooms, the reference model degree-hour is close to the sample median. That indicates that it takes on a central position in the sample, proving a good representativeness of the reference model. Additionally, it can be observed that clusters differ in the dispersion of their data, which highlights the different behaviour of houses with different construction systems.

Conclusions
This paper aimed to find actual construction system reference models of low-income housing in Florianópolis, southern Brazil, through cluster analysis, to be used in future thermal and energy performance studies. For that, three steps were conducted in the proposed method. Also, one can observe that cluster 3 is the cluster with the most dispersed sample. It means that the differences in their houses' construction systems impact their thermal performance higher than in the other clusters, and it is even higher for the bedrooms in the hot season. Few outliers were found, two belonging to cluster 2 and one to cluster 1.
They differ from their sample mainly due to their tile type (both have fibre cement tile; in cluster 2, ceramic tile was the usual, and there was no tile in most cases in cluster 1).
Based on the results, it is considered that the reference models found in the cluster analysis are adequate to represent the sample. Thus, the envelope materials from these models can be used in future studies to evaluate the thermal performance of low-income housing in southern Brazil.

Conclusions
This paper aimed to find actual construction system reference models of low-income housing in Florianópolis, southern Brazil, through cluster analysis, to be used in future thermal and energy performance studies. For that, three steps were conducted in the proposed method.
First, a data acquisition process was developed, and information about the envelope of 106 low-income houses in Florianópolis was obtained. A great difficulty in obtaining data from this type of building was observed. Many of them were built without project or approval in the city hall. The imprecision and the high heterogeneity in the information found hinder the creation of a pattern, being necessary to simplify and even withdraw much of the collected data to proceed with this study. However, despite simplifications, it was possible to identify patterns and improve the knowledge about the characteristics of these buildings.
Three clusters were found, each represented by their reference models. The main difference is their roof construction systems, which differ in all three clusters. The definition of the reference models based on their characteristics seems to fit well with the reality found, highlighting the existing typologies. Beyond that, the thermal properties of the reference models' construction systems from the clusters obtained are quite different, as confirmed by the hypothesis tests, which can provide a good panorama of the thermal performance of that type of building in future studies. From the computer simulation step, the internal operative temperatures of all long-term rooms were obtained for all houses of the sample. The degree-hour temperatures, a thermal performance indicator, were calculated from these output data, and hypothesis tests were performed to check if the clusters' means differed significantly between clusters. The statistical independence between the clusters for all variables was confirmed. In addition, it was verified in the boxplot diagram that the degree-hours of the reference models were within the range of the first to the third quartile.
From these results, it is concluded that the clusters are well-represented by the reference models found. In general, cluster 1 performed better over winter, while cluster 2 performed better over summer than the other clusters (see degree-hour means in Table 5). Cluster 3 showed the worst performance, both for cooling and heating degree-hour. This was already expected since it was composed of wood and fibre cement tile houses, materials known for their poor thermal performance in buildings. It is highlighted that these findings are based only on the envelope materials since other parameters such as geometry, operation, and internal gains were fixed in all virtual models in the computer simulations.
This study contributes to the literature by proposing and validating a method to determine building reference models. In addition, representative reference models of lowincome housing in a city in southern Brazil were developed, helping to form a database that can be used in future thermal and energy performance studies. Knowing the possible building systems of a given location helps in the development of new housing designs. The design project should reduce the adverse effects of climate, and its performance should be analysed by means of comparison to usual building systems. In addition, it assists in establishing public policies to better meet performance criteria. Therefore, reference models are helpful in analyses for improving the thermal performance of a building stock and not only of a specific building. This is particularly important for studies developed as a basis for modifying or developing standards.
At last, it can be concluded that cluster analysis was a practical and objective method for obtaining reference models. On the one hand, because it is an exploratory rather than inferential data analysis, the researcher must choose wisely the variables to compose the database and be aware of its relation with the objectives of the study. On the other hand, even though cluster analysis is an exploratory analysis, it is essential to apply other statistical techniques to ensure adequate results, such as the hypothesis tests applied.