Statistical Methodology for the Deﬁnition of Standard Model for Energy Analysis of Residential Buildings in Korea

: This study was conducted to propose an optimal methodology for deriving a standard model from existing residential buildings. To strategically improve existing residential buildings, it is necessary to identify standard models that can be used as quantitative standards. In this study, a total of six methods were established for di ﬀ erent algorithms in the dimensionality reduction and clustering stage of the data preprocessing stage. In addition, a total of 22,342 households’ data were analyzed, and a total of 26 variables were used to perform cluster analysis. The process of method 6 (data pre-processing, principal components analysis, clustering [K-medoids], veriﬁcation) was proposed as a way to derive the standard model from the existing Korean housing. The method proposed in this study is capable of deriving a number of standard models considering all variables (n) in a single analysis. The representative building derived in this study contains a lot of building data, so it can be e ﬀ ectively used for planning and research related to buildings on a regional and national scale. In addition, this process can be applied to various buildings to derive representative buildings.


Introduction
According to the Climate Change 2014 Synthesis Report, the number of anthropogenic greenhouse gases recently released is the highest since observations, and several extreme weather and climate events have been observed since 1950 [1]. Korea participated in international efforts to respond to climate change and decided in 2015 to aim to "reduce 37% of greenhouse gas emissions forecast by 2030". Among them, the building sector aims to reduce its emission forecast by 64.5 million tons through strengthening the energy standards of new buildings, improving energy performance of existing buildings, improving facility efficiency, and expanding supply of new and renewable energy, building energy information infrastructure, and others [2]. Accordingly, the government has revised and implemented the subdivision of regional classification and strengthening of the heat permeation rate (W/(m 2 ·K)) of buildings in each region since September 2018 to expand the distribution of energy-saving buildings, but it is limited to new buildings [3]. To achieve the GHG emission forecast for the building sector, which was aimed at by 2030, it is time to try to improve energy efficiency not only for new construction but also for existing buildings [4,5]. For energy efficiency, high efficiency of equipment in existing buildings is also important, but first, it is necessary to improve the energy efficiency of the building itself so as to minimize the energy demand (kWh/(m 2 ·a)) of the building [6][7][8].
For the strategic improvement of existing buildings, a standard model that can be used as a quantitative standard must be prepared [9][10][11]. This is because the optimal improvement process established by the standard model is easy to efficiently improve a large amount of buildings [12][13][14].
In order to define a standard model, it is necessary to consider various variables (features) affecting building energy. The cluster analysis (clustering) is a multivariate analysis method that classifies groups with similar characteristics when there is no external criterion to determine which group each individual belongs to. This method forms specialized clusters among individuals with similar patterns, and in this process, a representative point that is the center of each cluster is derived. A cluster is composed of two or more clusters with different characteristics, and it is possible to create a virtual object central to each cluster or to designate a central one among existing objects [15].
In previous studies involving standard models, studies using cluster analysis have been conducted. Schaefer et al. [16] used cluster analysis to find the standard buildings of low-income housing considering the features related to the geometry of the building. Two standard buildings were derived by cluster analysis of 120 houses, and simulations proved that the results obtained by cluster analysis were significant. In this paper, cluster analysis has proven to be a useful technique for obtaining reference buildings. However, the authors emphasize that to be very careful in choosing the variables in the analysis.
Tardioli et al. [17] presented a new methodology for identifying building groups and standard models in urban data sets. This methodology uses a combination of building classification, building clustering, and predictive modeling. The analysis was performed with Geneva's dataset and included building type, construction period, location, and geometric information. Sixty-seven representative buildings [18] were identified in about 13,600 buildings, and five normalizations and GIS linkages were performed. There are some limitations to the approach presented, the most important of which is that clustering requires a complete set of data. In this study, the problem of lack of data and completeness of the data set was partially overcome (achieved an average accuracy of 89.6%) by using a random forest predictive modeling method.
Li et al. [19] presented a methodology for developing residential representative buildings at the district level for the purpose of bottom-up energy modeling. A satellite image of China's Yuzhong district was used to create a 3D building information database for 575 residential buildings and to perform cluster analysis. As a result of analyzing the relative errors by simulating the energy consumption of the two representative buildings and the corresponding district, the result was 1.55%. However, this result has a limitation that the error rate of the simulation program and the actual building energy consumption are not considered in addition to the error rate of the energy consumption simulation result of the representative building and the district.
Kim et al. [20] developed a standard model for low-income housing to propose a remodeling optimization plan to improve energy efficiency [21]. The sample was extracted by sampling stratification for 2571 households of low-income housing and then analyzed by applying the Neyman allocation method. The average value of the flat type (living room, kitchen, bathroom, two rooms), building-oriented, floor area (44.5 m 2 ), and window area ratio (three-way window) was set as the standard model. When comparing the annual energy consumption requirements of the Energy Census Report with the standard model, it showed a difference of 5.78% and 12.1% when compared to buildings of the same size.
Previous research has been conducted to derive the standard model. The model was able to see the importance of standards in carrying out an assessment of the energy use of a building or group of individual buildings. It can be confirmed that the cluster analysis [22][23][24][25][26][27][28][29] technique has been used as a tool for deriving a standard model, and its usefulness has been proven. However, the problem of data collection and incompleteness was the limit. Geometric characteristics were mainly considered in deriving the standard model, and a separate simulation was performed for verification. The verification method is different for each study, but it can be confirmed that the energy use was used as an indicator.
Energies 2020, 13, 5796 3 of 16 In this study, the cluster analysis technique was performed as the main analysis technique. Due to the nature of the building data, multivariate analysis was required, and it was judged that the techniques of finding representative points with different characteristics were appropriate in the process of deriving the standard model. In order to solve the incompleteness of the data based on the previous studies, we tried to improve the accuracy and reliability by varying the detailed methodology. In addition, various building characteristics used in the analysis of building energy were considered in the derivation of standard models to improve the limitations of existing research. Therefore, this study aims to propose an optimal methodology for deriving a standard model that reflects various characteristics of existing residential buildings.

Methodology
As shown in Figure 1, different methods were used in the preprocessing and clustering steps, and a total of six methods were set to perform the analysis. The analysis was conducted on existing housing in Korea that were improved by the energy efficiency improvement project in 2016-2018. The optimal method was suggested by evaluating the finally derived standard model. Details of the step-by-step method are covered in the subsections.
Energies 2020, 13, x FOR PEER REVIEW 3 of 17 verification. The verification method is different for each study, but it can be confirmed that the energy use was used as an indicator. In this study, the cluster analysis technique was performed as the main analysis technique. Due to the nature of the building data, multivariate analysis was required, and it was judged that the techniques of finding representative points with different characteristics were appropriate in the process of deriving the standard model. In order to solve the incompleteness of the data based on the previous studies, we tried to improve the accuracy and reliability by varying the detailed methodology. In addition, various building characteristics used in the analysis of building energy were considered in the derivation of standard models to improve the limitations of existing research. Therefore, this study aims to propose an optimal methodology for deriving a standard model that reflects various characteristics of existing residential buildings.

Methodology
As shown in Figure 1, different methods were used in the preprocessing and clustering steps, and a total of six methods were set to perform the analysis. The analysis was conducted on existing housing in Korea that were improved by the energy efficiency improvement project in 2016-2018. The optimal method was suggested by evaluating the finally derived standard model. Details of the step-by-step method are covered in the subsections.

Data for Deriving a Standard Model
This study utilized a part of the database collected through "Energy Efficiency Improvement Project" from 2016 to 2018. The purpose of the study was to propose a methodology, and it was limited to existing homes with improved subject matters in the verification stage. So, the database used to derive the standard model used the improved housing data as the "Energy Efficiency Improvement Project".

Data for Deriving a Standard Model
This study utilized a part of the database collected through "Energy Efficiency Improvement Project" from 2016 to 2018. The purpose of the study was to propose a methodology, and it was limited to existing homes with improved subject matters in the verification stage. So, the database used to derive the standard model used the improved housing data as the "Energy Efficiency Improvement Project".
The collection data was collected based on ISO 52016-1:2017 (Energy performance of buildings-Energy needs for heating and cooling, internal temperatures, and sensible and latent heat loads-Part 1: Calculation procedures). The 8 items of categorical data for buildings, 18 items of Energies 2020, 13, 5796 4 of 16 numerical data related to building heat loss and gain, and a total of 26 items were used for analysis (Table 1). Total area of wall (m 2 ) -X05 Averaged wall U-value (W/(m 2 ·K)) -X06 Total area of window (m 2 ) -X07 Averaged window U-value (W/(m 2 ·K)) -X08 Total area of door (m 2 ) -X09 Averaged door U-value (W/(m 2 ·K)) -X10 Total area of roof (m 2 ) -X11 Averaged roof U-value (W/(m 2 ·K)) -X12 Total area of Floor (m 2 ) -X13 Averaged floor U-value (W/(m 2 ·K)) -X14 Solar heat gain (W) -X15 Averaged SC 1 (-) 1 SC is shading coefficient X16 Averaged SHGC 2 (-) 2  The field measurement data (measured data) to be used to verify the accuracy of the methodology and standard model (simulated data) were collected by field measurement. For 50 of the target households (households that have implemented Energy Efficiency Improvement Project) from which the standard model was derived, it was carried out so that actual data could be constructed for the same items as in Table 1. From December 2018 to February 2019, we visited the target households, installed the measurement equipment in Table 2, and measured data for one week.

Preprocessing
Before performing the clustering algorithm, it is necessary to go through the process of processing data into a suitable form. The raw data may contain missing and outliers, and incomplete data hinders good results. In addition, the longer the number of objects (d) in the data, the longer it takes, and as the number of variables (x) and clusters (k) increases, the calculation time increases. It is necessary to process with high-quality data so that clustering can be achieved according to the purpose, and if necessary, to select key variables.

Data Preprocessing
The clustering algorithm finds a pattern based on the characteristics of the data. When the scale of the data is significantly different, the result is completely changed by the variable with the larger scale. Therefore, a standardization process is required so that all data is reflected in the analysis on the same scale.
Since the clustering algorithm is sensitive to outliers, z-score (Equation (1)) is applied to minimize the effect of outliers in preprocessing. The z-score does not generate standardized data on the exact same scale, but has the advantage of handling outliers well [16,17].
After standardization, Mahalanobis [30] distance was used for outlier detection (Equation (2)). Mahalanobis distance is a distance in the probability distribution and is useful for detecting outliers in multivariate data. Objects with outliers and missing values were removed to improve the accuracy of the clustering algorithm.
where D 2 : Mahalanobis distance, x: vector of data, µ: vector of mean value of independent variables, T: Indicates vector should be transpond, C −1 : inverse covariance matrix of independent variables. Objects with outliers and missing values were removed to improve the accuracy of the clustering algorithm.

Dimensionality Reduction
As the dimension in the data increases, the amount of data to express it increases exponentially (curse of dimensionality, increase of storage space, and processing time). In addition, if there is a high correlation between the variables, the clustering performance deteriorates or the model becomes unstable [27,31,32]. Therefore, if there is a high correlation between variables before clustering, it is necessary to process it and reduce the high-dimensional data to a lower one. The method of reducing the dimension in the data is largely divided into the selection and extraction of variables.
This study considered correlation analysis, which is a method of selecting variables, and principal component analysis, which is a method of extracting variables. Correlation analysis is a method of removing only variables with a high correlation coefficient from existing variables and using only the remaining variables. Principal component analysis is a method of linearly combining existing variables and extracting them as mutually independent principal components.
Equation (3) was applied to determine the number of dimensions to be reduced. In this study, the sum of the cumulative eigenvalues of Equation (3) was extracted as n main components with 0.8 or more (It has explanatory power up to 80% of the data before it is reduced). In this study, clustering was performed by constructing three datasets separately according to the pre-processing process.
Data pre-processing was performed in the same way. Dataset 1 did not perform dimension reduction, dataset 2 performed dimension reduction by correlation analysis, and dataset 3 performed dimension reduction by principal component analysis.

Clustering
In this study, a non-hierarchical cluster analysis method was used for large-scale data analysis. Hierarchical clustering induces clustering by sequentially classifying objects with high similarity without assumptions about the number or structure of clusters. However, once an object belongs to a cluster, it becomes impossible to move to another cluster, resulting in a problem that outliers are not removed. Additionally, when the size of the data increases, it becomes very difficult to express the resulting dendrogram (tree diagram), and a lot of difficulties arise in calculation. In this case, a non-hierarchical clustering method was developed as a method to apply cluster analysis.
Non-hierarchical cluster analysis is a method of forming an optimized cluster by examining all methods that can be divided into k clusters. It can be applied to various types of data. Compared to hierarchical analysis, computational complexity is low, so it can be used for large-scale data analysis. However, the algorithm cannot be executed until the number of clusters is determined in advance [22,23]. The number of clusters k is determined by determining the optimal point by examining the sum of squared errors (SSE) in the cluster while sequentially increasing the number of clusters. That is, the point at which the decrease in the SSE value reaches the limit becomes the number of clusters (elbow method).
Clustering was performed by two algorithms: a k-means algorithm that derives a virtual center point from a non-hierarchical analysis method and a k-medoids algorithm that derives a center point among objects. The standard model derived by the k-means algorithm is a non-existent building derived to be the central point for all variables of all objects in the cluster. The standard model derived by the k-medoids algorithm is a building that exists as the central object among the objects in the cluster [21].
The performance information (variables) of the finally obtained standard model is the same, and in the case of the standard model derived by the k-medoids algorithm, the object identification number is recognized and the performance information of the building is obtained.

Verification
In case of cluster analysis, which is case-based unsupervised learning, it is difficult to accurately evaluate numerically. To maximize reliability, significance, and accuracy for this study, RMSE (root mean square error) techniques were used to analyze the error rate of the measured data, methodology, and standard model [33]. RMSE is a commonly used measure when dealing with the difference between a predicted value and an actual observed value, and represents the overall uncertainty of the variable. The lower the RMSE value, the better, and always has a positive value.
where S = simulated data, M = measured data, N = number of variables.
After calculating the RMSE of the observed (field measurement data, 50 households) and predicted values (derived standard model, methodology), the lower the average value of the RMSE, the better the accuracy. When there was no significant difference in the mean value (Kruskal-Wallis h-test), the standard deviation was evaluated.

Data Preparation and Description
In this study, 22,342 households of statistically valid data were collected and analyzed. Additionally, in this paper, among the 26 variables collected, a single database was constructed with 18 variables corresponding to the performance information of the building among continuous variables excluding categorical variables. After data standardization, 2443 outliers, including missing values, were removed and the analysis was performed with 19,899 data. Table 3 shows the descriptive statistics after data preprocessing is performed.  Figure 2 shows the results of the SSE review by increasing the number to k = 10 to determine the number of clusters (k).

Results of Clustering without Dimensionality Reduction
In Methods 1 and 2, after pre-processing the data, a dataset (①) was formed without a dimensionality reduction process, and clustering (A, B) was performed. RBs 1 and 2 derived by method 1 showed more than average differences in the variables X01, X05, X11, X13, and Y01, and showed the most opposite values in the construction year and U-value. RBs 3 and 4 of Method 2 showed more than average differences in the variables X01, X05, X06, X11, X13, X14, and Y01, and showed the most opposite values in the construction year, U-value, and solar heat gain. The results are shown in the RB (representative buildings) 1 to 4 in Table 4.  The analysis results showed a rapid decrease in SSE until all three data sets had two clusters, followed by a trend of gradual decline (Elbow point = 2).
In conclusion, the number of clusters was determined to be two and analyzed because there was not much difference in the result values when there were more than three clusters.

Results of Clustering without Dimensionality Reduction
In Methods 1 and 2, after pre-processing the data, a dataset ( 1 ) was formed without a dimensionality reduction process, and clustering (A, B) was performed. RBs 1 and 2 derived Energies 2020, 13, 5796 8 of 16 by method 1 showed more than average differences in the variables X01, X05, X11, X13, and Y01, and showed the most opposite values in the construction year and U-value. RBs 3 and 4 of Method 2 showed more than average differences in the variables X01, X05, X06, X11, X13, X14, and Y01, and showed the most opposite values in the construction year, U-value, and solar heat gain. The results are shown in the RB (representative buildings) 1 to 4 in Table 4.  In addition, the patterns of RB 01 and RB 03, RB 02 and RB 04 showed similar patterns. RB 01 and RB 03 showed an average difference of 4.9%p, and RB 02 and RB 04 showed an average difference of 9.17%p.

Clustering Result after Dimension Reduction (Correlation Analysis)
Methods 3 and 4 used correlation analysis to find the variables that overlap during the dimensionality reduction process and excluded variables with correlation coefficients. This was configured as a dataset (2) to perform clustering.
As a result of performing a correlation analysis on 17 independent variables excluding the dependent variable, it was found that they had correlations as shown in Table 5.  In addition, the patterns of RB 01 and RB 03, RB 02 and RB 04 showed similar patterns. RB 01 and RB 03 showed an average difference of 4.9%p, and RB 02 and RB 04 showed an average difference of 9.17%p.

Clustering Result after Dimension Reduction (Correlation Analysis)
Methods 3 and 4 used correlation analysis to find the variables that overlap during the dimensionality reduction process and excluded variables with correlation coefficients. This was configured as a dataset (2) to perform clustering.
As a result of performing a correlation analysis on 17 independent variables excluding the dependent variable, it was found that they had correlations as shown in Table 5. In these methods, 6 variables (X03, X06, X10, X11, X12, X16) were removed by removing variables with a larger correlation coefficient with other variables, and a dataset with a total of 12 variables was constructed. As a result of analyzing by applying the clustering algorithm to the dataset 2 (A,B), representative buildings 5 to 8 were derived as shown in Table 6. In the case of K-means in method 3, an algorithm to generate the center coordinates was applied, so values were omitted for some variables. Figure 4 shows the variables of representative buildings derived by methods 3 and 4. RB 05, 06 and RB 07, 08 have similar values for each variable, but they are located in opposite directions, indicating opposite patterns. In addition, the patterns of RB 05 and RB 07, RB 06 and RB 08 showed similar patterns. RB 05 and RB 07 showed an average difference of 7.05%p, and RB 06 and RB 08 showed an average difference of 8.08%p. In the case of K-means in method 3, an algorithm to generate the center coordinates was applied, so values were omitted for some variables. Figure 4 shows the variables of representative buildings derived by methods 3 and 4. RB 05, 06 and RB 07, 08 have similar values for each variable, but they are located in opposite directions, indicating opposite patterns. In addition, the patterns of RB 05 and RB 07, RB 06 and RB 08 showed similar patterns. RB 05 and RB 07 showed an average difference of 7.05%p, and RB 06 and RB 08 showed an average difference of 8.08%p.

Clustering Result after Dimension Reduction (Principal Component Analysis)
Methods 5 and 6 performed clustering by constructing a dataset (③) from which variables were extracted by principal component analysis in the dimensionality reduction process after data preprocessing. As a result of performing principal component analysis, it appeared as shown in

Clustering Result after Dimension Reduction (Principal Component Analysis)
Methods 5 and 6 performed clustering by constructing a dataset ( 3 ) from which variables were extracted by principal component analysis in the dimensionality reduction process after data preprocessing. As a result of performing principal component analysis, it appeared as shown in Figure 5. The first main component (PC1) explains the existing variable by 30.2% and PC2 by 21.1%, and up to PC5, 81.63% of the existing variable can be explained and summarized into five independent variables.
The PC1 through PC5 were named as Building envelop U-value (30.23%), solar heat gain (21.13%), heat loss (window) (11.39%), heat loss (door) (10.55%), and heating system efficiency (8.33%). As a result of clustering (A,B) of the data set ( 3 ) in which variables were extracted by principal component analysis in the dimensionality reduction process in the pre-processing step, it was derived as shown in the representative buildings 9-12 in Table 7.  The PC1 through PC5 were named as Building envelop U-value (30.23%), solar heat gain (21.13%), heat loss (window) (11.39%), heat loss (door) (10.55%), and heating system efficiency. (8.33%) As a result of clustering (A,B) of the data set (③) in which variables were extracted by principal component analysis in the dimensionality reduction process in the pre-processing step, it was derived as shown in the representative buildings 9-12 in Table 7.  In the case of Method 5, which creates an imaginary center point, values are omitted for the existing variables used for principal component extraction In both methods, the difference is clearly revealed in the variables PC1 and Y01, and representative buildings with particularly opposite values in the building envelope U-value, which is the first main component containing the most information on the existing variables, were derived. Figure 6 shows the parameters of representative buildings derived by methods 5 and 6. RB 09, 10  In the case of Method 5, which creates an imaginary center point, values are omitted for the existing variables used for principal component extraction In both methods, the difference is clearly revealed in the variables PC1 and Y01, and representative buildings with particularly opposite values in the building envelope U-value, which is the first main component containing the most information on the existing variables, were derived. Figure 6 shows the parameters of representative buildings derived by methods 5 and 6. RB 09, 10 and RB 11, 12 have similar values for each variable, but they are located in opposite directions, indicating opposite patterns. In addition, the patterns of RB 09 and RB 11, RB 10 and RB 12 have similar patterns. When comparing only the dependent variable, RB 09 and RB 11 show a difference of 5.05%p, and RB 10 and RB 12 show a difference of 22.37%p.

Verification; RMSE
In this section, RMSE (root mean square error) is used to analyze the difference between the predicted value of the methodologies proposed in the study and the value measured in the actual environment. As for the analysis results, when the observations and methodology (predicted values) were analyzed, Method 6 was found to be the most accurate (Table 8).
First, the difference was analyzed using RMSE for 50 households (=actual value, M) and 8 representative buildings (=predicted value, S) that conducted actual field surveys. When analyzed by representative buildings, RB 08 of Method 4 was analyzed to be the most accurate (Table 9). Table 8. Results of root mean square error (RMSE), by methods.

Verification; RMSE
In this section, RMSE (root mean square error) is used to analyze the difference between the predicted value of the methodologies proposed in the study and the value measured in the actual environment. As for the analysis results, when the observations and methodology (predicted values) were analyzed, Method 6 was found to be the most accurate (Table 8). First, the difference was analyzed using RMSE for 50 households (=actual value, M) and 8 representative buildings (=predicted value, S) that conducted actual field surveys. When analyzed by representative buildings, RB 08 of Method 4 was analyzed to be the most accurate (Table 9).

Discussion
The detailed method of the cluster analysis process was used differently, and the analysis was performed in a total of six methods. The variables of the two representative buildings derived by each method show opposite patterns, and this shows the characteristics of clustering in which the center points of each cluster are separated from each other as much as possible. In addition, it suggests that it meets the purpose of this study to define a specialized representative building that reflects the performance pattern of the variables as much as possible.
Among the methodologies, method 6, which performed dimensionality reduction process by principal component analysis and applied the K-medoids algorithm, was found to be the best in deriving representative buildings. Among the derived representative buildings, RB 08, which performed dimensional reduction through correlation analysis and applied K-medoids algorithm, was the most excellent.
In the methodology presented in this study, a number of models in which various variables have opposite values are presented. Therefore, it is judged appropriate that one model does not represent the whole, but the derived multiple models represent the whole. In addition, in the RMSE results for each building in Table 8, RB 12 shows a slight difference from RB 08 and 5.29%p in the average, and in the standard deviation, it can be confirmed that RB 11 is superior to RB 08 with a difference of 10.47%p. Therefore, method 6 applying principal component analysis and K-medoids algorithm is proposed as a methodology for defining representative buildings in existing residential buildings as shown in Figure 7.
In this study, two representative buildings were derived from about 20,000 existing residential buildings by applying the clustering technique, and the derived two representative buildings show the performance as shown in Table 10 below. RB 1 is an older building than RB2, has a small area, and has a high U-value. In addition, RB1 showed opposite patterns with annual heating energy demand per unit area of about 275 KWh/(m 2 ·a), and RB 2 of annual heating energy demand per unit area of about 110 KWh/(m 2 ·a).    In addition, since this method uses the K-medoids algorithm, it is possible to recognize the object's unique number and check all the qualitative building data of the building.

Conclusions
In this study, in deriving representative buildings, a methodology was studied that includes various information of buildings as much as possible and reflects their characteristics. In the case of previous studies, the usefulness of the cluster analysis technique was proved, but limitations and imperfections of data collection appeared, and geometric characteristics were mainly considered in deriving the standard model.
In this paper, a representative building derivation methodology based on multivariate building data used for building energy analysis was proposed. Additionally, a total of six methods were established for different algorithms in the dimensionality reduction and clustering stage of the data preprocessing stage. In addition, to verify the established methodology, data collected on existing domestic houses were used for analysis, and a total of 22,342 households and 26 building variables were used for analysis. Among the six methods, method 6, which consists of data preprocessing, principal component analysis, clustering (K-medoids), and verification, is presented as a method of deriving representative buildings from existing domestic houses, and through this, two representative buildings of existing houses were derived.
The method proposed in this study is capable of deriving a number of standard models considering all variables (n) in a single analysis. In other words, the representative building contains information on n variables used for analysis, and becomes the center of the n-dimensional. The representative building derived in this study contains a lot of building data, so it can be effectively used for planning and research related to buildings on a regional and national scale. In addition, this process can be applied to various buildings to derive representative buildings. Depending on the data, a more optimized method should be applied by performing the process presented, and understanding and proficiency of the process is required to perform this series of processes. If the process is built as a program, accessibility is expected to be secured. As a representative building derived later, a study on establishing a standard improvement strategy for the existing building will be conducted.