An Embedding-Based Collaborative Filtering Housing Recommender System for Analyzing Housing Preference

: Housing preference is the subjective and relative preference of users toward housing alternatives and studies in the ﬁeld have been conducted to analyze the housing preferences of groups with sharing the same socio-demographic attributes. However, previous studies may not suggest the preference of individuals. In this regard, this study proposes “SeoulHouse2Vec,” an embedding-based collaborative ﬁltering housing recommendation system for analyzing atypical and nonlinear housing preference of individuals. The model maps users and items in each dense vector space which are called embedding layers. This model may reﬂect trade-o ﬀ s between the alternatives and recommend unexpected housing items and thus improve rational housing decision-making. The model expanded the search scope of housing alternatives to the entire city of Seoul utilizing public big data and GIS data. The preferences derived from the results can be used by suppliers, individual investors, and policymakers. Especially for architects, the architectural planning and design process will reﬂect users’ perspective and preferences, and provide quantitative data in the housing decision-making process for urban planning and administrative units. system was created that can present divergent housing alternatives users may prefer. of the survey respondents were not evenly distributed and the size of the dataset is limited, the scope of interpretation and application of the research scenarios apply to only the 215 respondents and the 679 housing proﬁles. Future studies must investigate a broader scope of model application and results analysis. This study created two research scenarios, which are housing preference analysis and housing decision-making. In the scenario presentation step, a brief theoretical review was conducted to create the scenarios. In the model application step, acquired dataset and the built model was utilized. In the results interpretation step, the application results of the model were interpreted and visualized.


Introduction
Seoul is the capital of South Korea, with a population of approximately 9.7 million people [1]. Of the 605.24 square kilometers of land in Seoul, 53.8% is designated as residential areas. Furthermore, based on household-Korea's unit of housing-about 2,866,000 houses have been supplied, a housing supply rate of 96.3% [2]. Housing affects occupants' health, wealth, and lifestyle as it provides the necessary built indoor environment in which they live for an extended period. In addition, housing-related costs including the purchase and lease of housing account for a significant portion of household spending [3]. From a social viewpoint, housing helps form and maintain relationships with occupants' families, friends, and communities, thus impacting their well-being [4][5][6][7].
Housing decision-making is a process in which users explore and evaluate housing choices by considering housing attributes including tenure, type, size, orientation, location, etc. [8,9]. Through this process, users form subjective preferences for certain housing alternatives [10][11][12]. Housing choice is expressed in actual behavior regarding the housing unit, and housing preference is the relative preference for housing alternatives [13,14]. Based on their economic and social contexts, users form housing expectations and preferences in the process of housing decision-making [15]. While housing choice is significantly affected by users' housing preferences, it may differ from these because of dense vector space based on user-item rating information obtained through a survey. For this purpose, a housing preference survey was conducted, the results of which were used to build a dataset consisting of rating information on multiple housing profiles of individual respondents. Recommended housing alternatives are presented to users using geographic information system (GIS) and data visualization technology. While previous housing preference studies were conducted by calculating the relative importance of variables and variables affecting the housing preferences of groups with the same demographic and sociological attributes, the analysis unit of this study was the individual, which is a smaller unit. Through this, the study presents one way to support users' search for housing alternatives and their housing decision-making process.

Research Materials, Methods, and Structure of the Paper
The paper is organized as follows: Section 2 highlights theoretical considerations regarding the housing preference and the embedding-based collaborative filtering recommendation system. As explained, housing preference variables that affect the housing decision-making process in Korea were derived from an analysis of existing research. In Section 3, public data based on the housing preference variables derived from the preceding step are used to create housing profiles subject to survey respondents' preference. Respondents rated their preferences for the housing profiles in the survey. In the process of creating the profiles, "Seoul Metropolitan Government Housing Status (housing type, occupancy type, etc.) Statistics" [35], "Seoul Metropolitan Apartment Information" [36], and GIS location information data were used. Section 4 describes user-item rating datasets acquired through the survey and, using Google TensorFlow and Keras, builds the "SeoulHouse2Vec" recommendation system. In Section 5, the dataset acquired in the previous step is split into training, validation, and evaluation sets. The split datasets were used to the corresponding process, respectively. In the model training, a supervised-learning method was used. After that, the model validating in which the model parameters are tuned for better performance is conducted. The final performance is then measured using performance metrics, which are precision, recall, and f1_score. To measure the metrics, a confusion matrix, which is a commonly used method in the algorithm evaluation, is created. This study uses Python 3.6, Google TensorFlow, and the Keras library to build the model in Section 4. In Section 5, the built model is trained, validated, and evaluated. The development environment is set to JetBrains Pycharm Community Edition 2019.1.2. The hardware environment is "Intel i9-9900k" CPU, 16 GB RAM, and "NVIDIA GeForce RTX 2060," with Windows 10. In Section 6, scenario-based demonstration of the built model is suggested in order to provide a possible application of the model in terms of analyzing housing preference and supporting housing decision-making. Figure 1 presents the research overview, which comprises three parts: Dataset generation, "SeoulHouse2Vec" model building & training & evaluation, and model application & visualization.

Embedding-Based Collaborative Filtering Recommender System
The recommendation system filters preference information and supports their information search process. Its value has recently drawn interest because it can help users when considering increasing volumes and types of information [28,37]. Methods for running the recommendation system include the bestseller presentation method, which presents items with a large number of views over a specific period, and content-based process, which manually extracts and analyzes the attributes of an object. Recently, studies have confirmed the relatively high accuracy of a technology called collaborative filtering [38][39][40].
The premise of collaborative filtering is that groups of users with similar preferences for specific information will have similar responses to other details. Unlike the content-based method, where recommendations are based on the extracted internal attributes of items, the collaborative filtering method utilizes rating information, which is the preference information obtained from multiple users' evaluation of multiple items. The significance of the collaborative filtering method is its "collaboration," as it uses users' ratings to recommend items to a specific user [28,41,42].
Among the various methods of implementing the collaborative filtering-based recommender, embedding-based methods have attracted attention for their accuracy and efficiency [33,34,43]. For Guo and Berkhahn [44], embedding is a technology that represents non-continuous data in the sparse vector format as continuous data in the dense vector format. Embedding technology derives the intrinsic properties of data by continuously providing the right representation thereof and supporting the learning process of machine learning and deep learning models. Guo and Berkhahn [44] showed that when input data are embedded in the right way, terming it "entity embedding," the training speed of the artificial neural network model increases, and decreases overfitting. The data are finally placed in Euclidean space in a way that minimizes errors in the neural networks model [45].
Embedding technology has most recently been used in the field of natural language processing, a technology that allows computers to understand the natural language in a process called word embedding. In word embedding, natural language tokens (minimum unit of the data) are expressed in a dense vector consisting of floating-point values [46,47]. As the training iterations of word embedding and natural language processing model are repeated, the more similar the semantic meanings of the natural language input units, the closer those are mapped in the embedding space. The semantic relations of the natural language are thus expressed in a format the computer can execute, which is called distributed representations of words.
In the embedding-based collaborative filtering recommendation system, the natural language tokens, which are subject to similarity calculation in the word embedding technology, are cast as individual users and items [48]. Prior research on the embedding-based collaborative filtering recommender include studies [31] on building the embedded technology-based music recommendation system "ITEM2VEC"; a personalized e-mail advertising system called "prod2vec," which is based on user purchase records [49]; "prefs2vec," which is based on users' item preference [43]; and a system based on users' visit record ("the check-ins"), which recommends places users might like to visit [33].
Various studies suggested that an embedding-based recommender makes it easier for users to intuitively and visually identify similarity than other implementation methods. Furthermore, its simple model construction and higher learning efficiency are highlighted.
This study aims to build a recommendation system using embedding-based methods to provide the housing profiles users might prefer, and to visually suggest similarities between the users and the housing alternatives.

Housing Preference
Housing preference is the users' subjective evaluation toward housing. It refers to the users' requirements, expectations, and emphasis on the characteristics of various housing.
Studies related to housing preference analyze the housing preferences and design requirements of specific groups that share socio-demographic attributes including age, gender, income level, current values, etc. to provide one possibility to improve design quality and increase occupants' satisfaction [24]. The studies have provided basis to guide future design decision-making for architects and enable a quantitative comparison and evaluation methods between housing alternatives when choosing housing for potential users and occupants.
Opoku and Abdul-Muhmin [26] analyzed the correlation between socio-demographic attributes-gender, marital status, income, family situation, etc.-in Saudi Arabia's low-income class, and multiple kinds of housing factors including dwelling type and tenure options. Contending that the preference is heterogeneous, Hoshino [22] created housing profiles by deriving housing attributes and levels, and analyzed user preferences through a conjoint analysis method. Jansen et al. [10] studied a housing selection scenario for couples, presented multiple dwelling profiles on the basis of housing attribute levels (dwelling type, costs, size living room, number of rooms, backyard size, architectural style, and residential environment), and used the multi-attribute utility method to analyze housing preference and calculate the utility value, which can be used to recommend and analyze the choices. They also presented a unit consisting of the preferred factors of a group with specific socio-economic attributes.
Together, previous studies identified factors that affect housing preferences, designed questionnaires, and measured the correlations between respondents' social, economic, and demographic attributes and other factors. However, these studies were limited in terms of analyzing individuals who share common demographic attributes but have different preferences, or conversely those with dissimilar demographic attributes but the alike preferences. While providing one possibility for quantitatively assessing the alternatives by weighting factors, they are limited in giving trade-off or unexpected alternatives for various attributes, despite that users' preferences for the options are heterogeneous and nonlinear.
Thus, this study derived housing preference variables from prior studies and used them to quantitatively induce users' preference for housing alternatives. The collaborative filtering-based recommendation system was then used to analyze the preferences. Through this process, a recommendation system was created that can present divergent housing alternatives users may prefer.

Important Housing Attributes for Housing Preference in South Korea
Although research on housing preferences is actively conducted in various countries, the scope of this analysis was limited to papers published between 2009 and 2020 in Korea because of differences in housing preference variables by region and age.
The studies by Jeong and Choi [6] and Kim and Seo [50] are considered significant. They studied the housing choices and preferences of the eco-generation and baby boomers, deriving variables specific to these generations. Jeong and Choi [6] identified the local status of housing demand/development potential, educational environment, location factors related to public institutions/facilities, and green areas and rest areas as essential factors the eco-generation considered when choosing homes. Kim and Seo [50] identified the variables of housing preferences as social, local, and personal factors affected by friendliness toward the elderly; physical factors including housing styles and size; and economic factors including housing prices, rent, and housing costs.
Lee and Kim [7] and Lee, et al. [51] studied, quantified, and determined the importance of residential environment preferences through a conjoint analysis. However, their studies were limited in that the survey was conducted as a hypothetical alternative that arbitrarily manipulated residential variables. Thus, they were not able to use actual residential options. As Table 1 shows, Lee and Kim [7] examined the impact of apartment environmental attributes on consumer preferences by employing the variables of apartment prices, house interior factors of scale, investment value of brand awareness, view, and park accessibility. Lee, et al. [51] employed the extant literature and criteria for calculating the initial sale rate of apartments provided by the Korea Housing & Urban Guarantee Corporation to identify the variables impacting housing preference, as shown in Table 1. The variables for housing preference were housing characteristics and price per 3.3 square meter including of the interior, characteristics of the complex, convenience of transportation, location in the city center, environmentally friendly location, location in a good school district, potential for regional development, and investment value.
To identify housing preference variables according to lifestyle, Son and Lee [52] delineated apartment complexes and indoor requirements by considering the actual living space and experience rather than location of housing from a macro perspective.
To analyze housing satisfaction and preference, Kim et al. [53] examined changes in preference according to the type of housing, type of occupancy, and size of housing in Gyeonggi Province; identified factors to consider in housing policies; and explained differences in housing demand by region. The variables were housing location factors including the convenience of public transportation, neighborhood facilities, cultural performance facilities, accessibility to major facilities, and children's educational environment. The internal factors included size and management expenses, and environment factors included green areas and nearby parks as well as investment value.
Lee and Kim [7], Kim and Seo [50], and Lee, et al. [51] quantitatively measured preference by demonstrating the correlation between housing attributes and housing preference, as well as between respondents who share specific socio-demographic characteristics such as age (generation), gender, type of residence, type of housing tenure, and income. The outcome of these studies was models showing how much a particular group values a specific factor. However, they did not offer a real housing alternative or housing suggestions because they do not reflect atypical preferences.
Thus, the present study used a real housing profile to analyze accumulated housing preference data. Furthermore, it recommends unexpected housing alternatives that reflect atypical preferences by building an embedding-based collaborative filtering recommender to support users' decision-making process. The preferences derived from the results can be used by suppliers, individual investors, and policymakers [54]. Table 1 summarizes the key aspects of the literature review.

Housing Attributes and Housing Profiles Composition
Based on the literature review on the housing preferences, discussions with certified architects and housing planning and design experts, and in-depth interviews with occupants of apartments, this study derived the following nine attributes: "time to metro," "accessibility to market," "number of schools," "housing prices," "housing area," "number of rooms," "number of bathrooms," "distance to park," and "investment value." The housing profiles are housing alternatives prepared based on the abovementioned nine housing preference variables. Those were presented to respondents through a survey. Respondents considered all nine variables and evaluated their preference for the profiles on a scale ranging from 1 (least preferred) and 5 (most preferred). Before creating the housing profiles, 30 pilot profiles were designed to modify the scope and definitions of some criteria. Below, the final nine variables are defined and the creation of the profiles are explained.
First, "time to metro" was measured by walking time (minute) to the nearest subway station. It refers to the accessibility to public transportation. There are two types of public transportation in Seoul: bus and subway. In designing the pilot profiles, the time taken to the nearest bus stop was about less than five minutes. As there was no significant difference between the apartments, the criterion was based on the time required to the nearest subway station on foot.
Second, "accessibility to market" is the distance to the nearest store from the specific profile which is related to proximity to the commercial districts. In preparing the pilot profiles, the distances from the specific apartments to the nearest convenience stores were not discriminating factor in Seoul. In this regard, the measurement used the "large market search" function provided by Naver Maps, which is commonly used in Korea. Due to the size and visiting characteristics of department stores, big-box stores, etc. most people use vehicles rather than walk to get there. The distance traveled in meter units was used to exclude the effects of travel time depending on traffic conditions. Third, "number of schools" measured the number of elementary, middle, and high schools located within a 1-km radius from the apartment.
Fourth, "housing price" is the price of the apartment divided by "a unit pyeong(3.3 m2)". The prices are referenced and created based on the "Multi-unit Housing Handbook" (2005.1.1-2019.6.1). The unit of this factor is 10,000 Korean Won (KRW). The price used in this research was hypothetical since there were gaps between actual market prices and official prices given by the government ("Gongsiji-ga") for housing taxes. Moreover, since the market price may differ from time, district, market situation, government policies and cases, this study used hypothetical price referenced by the statics [54].
Fifth, "housing area" was also based on the "Multi-unit Housing Handbook" (2005.1.1-2019.6.1). The criteria for area were based on the "jeon-yongmyeonjeog (exclusive area)" of rooms, living rooms, bathrooms, and kitchens used only by the apartment unit. Thus, public areas in the apartments were excluded, such as stairwells, corridors, and community facilities [54].
Sixth, "number of rooms" reflects the number of rooms inside the unit excluding living rooms and kitchens.
Seventh, "number of bathrooms" reflects the number of bathrooms inside the unit. Eighth, "distance to park" was measured on the map from the house to the nearest park to determine environmental factors. The unit is meter.
Finally, "investment value" was used to determine investment value. The investment value of apartments was assumed to be have for Samsung, Hyundai, Daelim, GS, Daewoo, POSCO, Hyundai ENG, Lotte, HDC Hyunsan, and Hoban Construction, the top 10 domestic construction companies in the "Construction Capability Assessment (2014-2019)" provided by the Ministry of Land, Infrastructure and Transport. For other cases, it was assumed there was no investment value [51,55].
The housing profiles were limited to the Seoul. Housing type was limited to apartment with five or more floors following "Article 3 clause 1 no. 1 of the Enforcement Decree of the Housing Act and Article 3-5 of the Enforcement Decree of the Building Act" [56,57]. According to the "Integrated Apartment Information Center," there are 3368 apartment complexes in Seoul. Thus, to ensure a 90% confidence level, 722 housing profiles must be created. However, 679 profiles were created because of missing data and problems pertaining to overlapping. Table 2 summarizes the nine variables and units derived from the literature review and their criteria.  Figure 2 shows a Box and Whisker plot for the 679 housing profiles based on 6 of the 9 variables: time to metro, accessibility to market, number of schools, housing price, housing area, and distance to park.

Survey Design
In this study, respondents aged between 20 and 60 years were surveyed for 5 months between 1 October 2019 and 29 February 2020. The survey was conducted offline and online in a way that did not create differences between the two methods in content or presentation. In total, 100 copies of the questionnaire were distributed offline and 150 online, and 233 were retrieved: 98 offline and 135 online.
The questionnaire consisted of questions relating to respondents' socio-demographic attributes including their age, monthly household income, and housing tenure type, and those measuring their ratings of the profiles. In total, 30 randomly extracted profiles were given to the respondents, and three profiles formed a combination. These were presented to users 10 times. This number was determined in advance through the pilot survey process to ensure the survey secured a reasonable amount of data but did not fatigue respondents.

Dataset of Housing Prefernces Ratings Description
Of the 233 questionnaires retrieved, 18 respondents who gave incomplete responses or had missing information were excluded, resulting in 215 surveys. The dataset built through this consisted of 6450 (215*30) rows. The dataset consisted of the following: "UserId," a unique six-place identification code (three alphabet letters + three digits) to ensure the anonymity of survey respondents; "HousingId," which is subject to a preference evaluation; "Rating," which is the rating

Survey Design
In this study, respondents aged between 20 and 60 years were surveyed for 5 months between 1 October 2019 and 29 February 2020. The survey was conducted offline and online in a way that did not create differences between the two methods in content or presentation. In total, 100 copies of the questionnaire were distributed offline and 150 online, and 233 were retrieved: 98 offline and 135 online.
The questionnaire consisted of questions relating to respondents' socio-demographic attributes including their age, monthly household income, and housing tenure type, and those measuring their ratings of the profiles. In total, 30 randomly extracted profiles were given to the respondents, and three profiles formed a combination. These were presented to users 10 times. This number was determined in advance through the pilot survey process to ensure the survey secured a reasonable amount of data but did not fatigue respondents.

Dataset of Housing Prefernces Ratings Description
Of the 233 questionnaires retrieved, 18 respondents who gave incomplete responses or had missing information were excluded, resulting in 215 surveys. The dataset built through this consisted of 6450 (215*30) rows. The dataset consisted of the following: "UserId," a unique six-place identification code (three alphabet letters + three digits) to ensure the anonymity of survey respondents; "HousingId," which is subject to a preference evaluation; "Rating," which is the rating the user gives for the housing profiles; and respondents' socio-demographic attributes. The attributes were not used when training the embedding-based recommendation system model. Dataset consisted of rows and ratios of 2568 (39.8%), 1883 (29.2%), 920 (14.3%), 748 (11.6%), and 331 (5.1%), respectively, based on a rating of 1 to 5. Further research is needed on the fact that relatively unfavorable ratings (1-2) accounted for a large portion of the data. For the socio-demographic attributes, of the 215 people, 88, 27, 21, and 79 participated in the dataset for the age groups 20 to 30, 30 to 40, 40 to 50, and 50 to 60 years, respectively. Based on monthly household income, 50, 36, 19, and 110 people participated in the income groups of less than 2 million KRW, 2 to 3 million KRW, 3 to 4 million KRW, and 4 million KRW or more, respectively. Type of housing tenure was divided into three categories: self-owned, "Jun-se" (Korean unique lease type), and monthly rent, with 125, 50, and 40 respondents in each group, respectively. Figure 3 shows the distribution of respondents' socio-demographic attributes and ratings based on age, monthly household income, type of housing tenure, and rating. and 110 people participated in the income groups of less than 2 million KRW, 2 to 3 million KRW, 3 to 4 million KRW, and 4 million KRW or more, respectively. Type of housing tenure was divided into three categories: self-owned, "Jun-se" (Korean unique lease type), and monthly rent, with 125, 50, and 40 respondents in each group, respectively. Figure 3 shows the distribution of respondents' sociodemographic attributes and ratings based on age, monthly household income, type of housing tenure, and rating.

Model Structure
In the data pre-processing phase, label encoding was performed for UserId and HousingId. In total, 215 existed for UserId and 679 for HousingId; thus, they were expressed as unique index values ranging from 0 to 214 and 0 to 678, respectively. Label encoding used the "LabelEncoder" function provided by "scikit-learn" API. Through label encoding, string-type data are expressed in numeric format and entered into the model. For "Rating," one-hot encoding was conducted, which represents N number of data as sparse vectors in N-dimensions. This process expressed 1, 2, 3, 4, and 5 as [1, 0,

Model Structure
In the data pre-processing phase, label encoding was performed for UserId and HousingId. In total, 215 existed for UserId and 679 for HousingId; thus, they were expressed as unique index values ranging from 0 to 214 and 0 to 678, respectively. Label encoding used the "LabelEncoder" function provided by "scikit-learn" API. Through label encoding, string-type data are expressed in numeric format and entered into the model. For "Rating," one-hot encoding was conducted, which represents N number of data as sparse vectors in N-dimensions. This process expressed 1, 2, 3, 4, and 5 as [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0], and [0, 0, 0, 0, 1], respectively. One-hot encoded rating values are frequently used in the process of expressing the class and prediction in the classification model. This model has two input layers, which receive as input values UserId and HousingId to calculate the similarity between users and housing items at subsequent embedding stages.
Embedding layers are the core layers that make the recommendation system operational. They map the encoded UserId and HousingId data to n-dimensional dense vectors. The embedding dimension is the dimension of a dense vector, for which a larger number represents a higher dimension. The vector values present in n-dimensions place individual data in the space in a way that minimizes the errors that are the difference between the model's prediction and the actual rating (Rating) as the training repeats. In this process, the users' atypical preference for the profiles is expressed in the dense vector space in the form of computational data.
The dense layer is a fully connected neural networks. The weights, which represent the degree of connectivity between nodes, are adjusted to reduce the error as the training repeats.
The output layer receives the output value of the dense layer as an input value. "Softmax" was used for the output layer's activation function. The function presents the probability that the data entered in the model belong to a particular class. It gives a probability distribution whose total sum of the probabilities of belonging to each class is 1. Figure 4 shows the overall structure of the model and flow of data.

"SeoulHouse2Vec" Model Training, Validating and Evaluation with Confusion Matrix
The recommendation system was trained and validated in supervised-learning method. In the initial phase of the training, embedding values are rather random and "trainable". However, after training proceeds and error decreases, the embedding values of the users and the profiles are rather "meaningful," which means their own vector values may reflect an atypical preference and intrinsic

"SeoulHouse2Vec" Model Training, Validating and Evaluation with Confusion Matrix
The recommendation system was trained and validated in supervised-learning method. In the initial phase of the training, embedding values are rather random and "trainable". However, after training proceeds and error decreases, the embedding values of the users and the profiles are rather "meaningful," which means their own vector values may reflect an atypical preference and intrinsic value of the given data. In this regard, if a specific user prefers a specific housing profile (item), the system recommends items that have close embedding distance to the user-liked item. From the user's viewpoint, it can be assumed that if two different users have close embedding distances, then their preferences for a given specific housing alternative may be similar. Figure 5 shows the process of how supervised-learning classification problem can be cast to recommendation via using the concept of embedding. In the initial phase of the training, items are mapped to embedding layer in random order. Training process consists of forward and back propagation. In the forward propagation step, the mapped values are input to the dense layer, and the dense layer predicts the probability of belonging to a specific class. This prediction value calculates the difference with the label, which is the error. This error value is passed back to the embedding layer, and the model changes the mapping values of individual items in a way to reduce this error. This is referred to as backpropagation. After some iterations of this process, items are mapped in a way to reflect intrinsic, atypical, and abstract characteristics of the data in numeric values.

Model Training and Validating for Tuning Model Hyperparameters
In this section, the model training and validating was conducted to set the embedding dimension and the unit of the dense layer.
In total, 6450 (215*30) data units were used in the process, namely the Rating values explicitly expressed by 215 respondents for 30 profiles. Using the "train_test_split" function in the scikit-learn API, from the entire data, 1293 were split as the evaluation data.
The embedding dimension consisted of a range from 2D to 200D, and the dense layer units from 5D to 300D. The model training and validating was conducted with a range of different combinations of the two hyperparameters. If both hyperparameters were set to values greater than the value of the range, overfitting occurred in which only training data were learned, limiting that range. The training iterations were set to 300. Among the various hyperparameters combinations, the highest accuracy was indicated when the embedding dimension and dense layer unit were (200, 50), respectively, and the second highest accuracy was (100, 50). Furthermore, (2, 10) had the lowest accuracy. This is shown in Figure 6a. The more the training was repeated, the errors decrease, as shown in Figure 6b.

Model Training and Validating for Tuning Model Hyperparameters
In this section, the model training and validating was conducted to set the embedding dimension and the unit of the dense layer.
In total, 6450 (215*30) data units were used in the process, namely the Rating values explicitly expressed by 215 respondents for 30 profiles. Using the "train_test_split" function in the scikit-learn API, from the entire data, 1293 were split as the evaluation data.
The embedding dimension consisted of a range from 2D to 200D, and the dense layer units from 5D to 300D. The model training and validating was conducted with a range of different combinations of the two hyperparameters. If both hyperparameters were set to values greater than the value of the range, overfitting occurred in which only training data were learned, limiting that range. The training iterations were set to 300. Among the various hyperparameters combinations, the highest accuracy was indicated when the embedding dimension and dense layer unit were (200, 50), respectively, and the second highest accuracy was (100, 50). Furthermore, (2, 10) had the lowest accuracy. This is shown in Figure 6a. The more the training was repeated, the errors decrease, as shown in Figure 6b.  The "SeoulHouse2Vec" model is difficult to visualize because the individual users and housings are each mapped to the vector space. Therefore, the t-distributed stochastic neighbor embedding (t-SNE) method was used. The approach enables visualization by converting highdimensional vectors that are difficult for users to intuitively understand into low-dimensional vectors while maintaining relative similarity and data characteristics between the individual items. 20 of the 679 and 215 items were randomly extracted, respectively. In (a), the 20 arbitrarily extracted housing profiles represented in relatively small circles are coordinate values when the training was repeated 100 times. The relatively large X-shaped housing profiles are coordinate values when the training was repeated 300 times. Based on housing profiles #88 and #625, the distance was closer when the training was repeated. In (b), the user represented in a relatively small circle is the coordinate value when the training was repeated 100 times, while the user represented in a relatively large pentagon is the coordinate value when the training was repeated 300 times.  The "SeoulHouse2Vec" model is difficult to visualize because the individual users and housings are each mapped to the vector space. Therefore, the t-distributed stochastic neighbor embedding (t-SNE) method was used. The approach enables visualization by converting high-dimensional vectors that are difficult for users to intuitively understand into low-dimensional vectors while maintaining relative similarity and data characteristics between the individual items. 20 of the 679 and 215 items were randomly extracted, respectively. In (a), the 20 arbitrarily extracted housing profiles represented in relatively small circles are coordinate values when the training was repeated 100 times. The relatively large X-shaped housing profiles are coordinate values when the training was repeated 300 times. Based on housing profiles #88 and #625, the distance was closer when the training was repeated. In (b), the user represented in a relatively small circle is the coordinate value when the training was repeated 100 times, while the user represented in a relatively large pentagon is the coordinate value when the training was repeated 300 times.

Evaluation with Confusion Matrix for Estimating Final Performance of the Model.
The model was evaluated using three metrics: Precision, Recall, and f1_score for models (200, 50) and (100, 50), which demonstrated the highest accuracy during the previous model training and validating phase. For this, a confusion matrix was created. The confusion matrix, which is also referred to as the error matrix, is a common technique for evaluating and visualizing algorithm performance in machine and deep learning classification problems. Figure 8 shows a concept of the confusion matrix that can be created in a binary-classification problem. Precision is the ratio that the label, which is the actual classification of the data, is TRUE, from among the cases where the model's prediction for the input data is TRUE. Recall is the ratio the model's prediction is to TRUE from among the cases where the actual data classification is TRUE. The f1_score is the harmonic mean of precision and recall.

Evaluation with Confusion Matrix for Estimating Final Performance of the Model.
The model was evaluated using three metrics: Precision, Recall, and f1_score for models (200,50) and (100, 50), which demonstrated the highest accuracy during the previous model training and validating phase. For this, a confusion matrix was created. The confusion matrix, which is also referred to as the error matrix, is a common technique for evaluating and visualizing algorithm performance in machine and deep learning classification problems. Figure 8 shows a concept of the confusion matrix that can be created in a binary-classification problem. Precision is the ratio that the label, which is the actual classification of the data, is TRUE, from among the cases where the model's prediction for the input data is TRUE. Recall is the ratio the model's prediction is to TRUE from among the cases where the actual data classification is TRUE. The f1_score is the harmonic mean of precision and recall.

Evaluation with Confusion Matrix for Estimating Final Performance of the Model.
The model was evaluated using three metrics: Precision, Recall, and f1_score for models (200, 50) and (100, 50), which demonstrated the highest accuracy during the previous model training and validating phase. For this, a confusion matrix was created. The confusion matrix, which is also referred to as the error matrix, is a common technique for evaluating and visualizing algorithm performance in machine and deep learning classification problems. Figure 8 shows a concept of the confusion matrix that can be created in a binary-classification problem. Precision is the ratio that the label, which is the actual classification of the data, is TRUE, from among the cases where the model's prediction for the input data is TRUE. Recall is the ratio the model's prediction is to TRUE from among the cases where the actual data classification is TRUE. The f1_score is the harmonic mean of precision and recall.  The trained model is a multi-classification model with five label values. The metrics can be measured based on a specific label class; here, the metrics were calculated based on the case of "Rating = 1" (least preferred), which comprised the largest share among the five classes. From the perspective of housing preference, the model maps the survey respondents and housing profiles to embedding vectors through the previous training process. In the evaluation process, arbitrary respondents and housing profiles are received as input values, and "Rating = 1" scores are not given to the model. If the model predicted that the respondent's score for the housing profile is "Rating = 1" and the respondent's actual score for the profile is "Rating = 1," then this is considered a "True Positive." For model (100, 50), precision and recall were measured at 0.679 and 0.761, respectively, based on Rating = 1. The f1_score was measured at 0.718.
For model (200,50), the precision and recall values were measured at 0.666 and 0.763, respectively, based on Rating = 1. The f1_score or harmonic mean was measured at 0.711.
Based on the three metrics, the (100, 50) model demonstrated slightly better performance in terms of confusion matrix. Figure 9 shows the confusion matrix of the two models: (a) (100, 50) and (b) (200, 50). The trained model is a multi-classification model with five label values. The metrics can be measured based on a specific label class; here, the metrics were calculated based on the case of "Rating = 1" (least preferred), which comprised the largest share among the five classes. From the perspective of housing preference, the model maps the survey respondents and housing profiles to embedding vectors through the previous training process. In the evaluation process, arbitrary respondents and housing profiles are received as input values, and "Rating = 1" scores are not given to the model. If the model predicted that the respondent's score for the housing profile is "Rating = 1" and the respondent's actual score for the profile is "Rating = 1," then this is considered a "True Positive." For model (100, 50), precision and recall were measured at 0.679 and 0.761, respectively, based on Rating = 1. The f1_score was measured at 0.718.
For model (200,50), the precision and recall values were measured at 0.666 and 0.763, respectively, based on Rating = 1. The f1_score or harmonic mean was measured at 0.711.
Based on the three metrics, the (100, 50) model demonstrated slightly better performance in terms of confusion matrix. Figure 9 shows the confusion matrix of the two models: (a) (100, 50) and (b) (200, 50).

Scenario-Based Demonstration of "SeoulHouse2Vec" Model
This section provides one possible usage and application of the built model in terms of analyzing housing preference and supporting housing decision-making via recommendation of the profiles. Two research scenarios were suggested, respectively. In the previous survey step, since the demographic characteristics of the survey respondents were not evenly distributed and the size of the dataset is limited, the scope of interpretation and application of the research scenarios apply to only the 215 respondents and the 679 housing profiles. Future studies must investigate a broader scope of model application and results analysis. This study created two research scenarios, which are housing preference analysis and housing decision-making. In the scenario presentation step, a brief theoretical review was conducted to create the scenarios. In the model application step, acquired dataset and the built model was utilized. In the results interpretation step, the application results of the model were interpreted and visualized.

Scenario-Based Demonstration of "SeoulHouse2Vec" Model
This section provides one possible usage and application of the built model in terms of analyzing housing preference and supporting housing decision-making via recommendation of the profiles. Two research scenarios were suggested, respectively. In the previous survey step, since the demographic characteristics of the survey respondents were not evenly distributed and the size of the dataset is limited, the scope of interpretation and application of the research scenarios apply to only the 215 respondents and the 679 housing profiles. Future studies must investigate a broader scope of model application and results analysis. This study created two research scenarios, which are housing preference analysis and housing decision-making. In the scenario presentation step, a brief theoretical review was conducted to create the scenarios. In the model application step, acquired dataset and the built model was utilized. In the results interpretation step, the application results of the model were interpreted and visualized.

Scenario: Multi-Attribute Utility Theory
A common method to quantitatively measure users' residential preferences is the multi-attribute utility theory (MAUT), which is a compositional model. After giving weights based on a scale of 0 (least important) to 100 (most important) to each attribute's level, the utility and preference between the alternatives were quantitatively evaluated by adding the value of each attribute and the product of the weighted value of each attribute value as assessed by the user. This technique has the possibility of identifying the relative importance between multiple attributes of a housing from the user's viewpoint and quantitatively measuring the preference [58][59][60][61][62][63][64][65].
Studies on residential preference using MAUT usually analyze the correlation between respondents' socio-demographic attributes and the weights of multiple residential attributes. They present housing alternatives that a group sharing particular socio-demographic attributes might prefer and quantitatively measure these preferences. However, these studies are limited in that they were unable to analyze individual units belonging to the group but showing different preferences. Therefore, this study performed a survey using MAUT, which can identify the relative importance between multiple attributes, and sought to interpret the individual preference differences that show different preferences within the same group by analyzing the importance according to the housing variable of each individual unit.
To identify different preferences within the same group, a group was selected in which respondents shared specific socio-demographic attributes. The embedding distance between respondents was then measured to indicate the similarity of residential preferences within the group. Figure 10 shows the ranking of the embedding distances between eight individuals in a group with the same socio-demographic attributes: they were aged 50-60 years (age), earned KRW 4 million or more (monthly household income), and had "Jun-se" (housing tenure type). With the eight users in the group, 27 (48%) of the 56 rankings indicated different preferences with a ranking below 120. This indicated the need for a more personalized approach, since some cases belonged to a group that shared the same attributes but indicated different preferences for individuals. A common method to quantitatively measure users' residential preferences is the multi-attribute utility theory (MAUT), which is a compositional model. After giving weights based on a scale of 0 (least important) to 100 (most important) to each attribute's level, the utility and preference between the alternatives were quantitatively evaluated by adding the value of each attribute and the product of the weighted value of each attribute value as assessed by the user. This technique has the possibility of identifying the relative importance between multiple attributes of a housing from the user's viewpoint and quantitatively measuring the preference [59][60][61][62][63][64][65][66].
Studies on residential preference using MAUT usually analyze the correlation between respondents' socio-demographic attributes and the weights of multiple residential attributes. They present housing alternatives that a group sharing particular socio-demographic attributes might prefer and quantitatively measure these preferences. However, these studies are limited in that they were unable to analyze individual units belonging to the group but showing different preferences. Therefore, this study performed a survey using MAUT, which can identify the relative importance between multiple attributes, and sought to interpret the individual preference differences that show different preferences within the same group by analyzing the importance according to the housing variable of each individual unit.
To identify different preferences within the same group, a group was selected in which respondents shared specific socio-demographic attributes. The embedding distance between respondents was then measured to indicate the similarity of residential preferences within the group. Figure 10 shows the ranking of the embedding distances between eight individuals in a group with the same socio-demographic attributes: they were aged 50-60 years (age), earned KRW 4 million or more (monthly household income), and had "Jun-se" (housing tenure type). With the eight users in the group, 27 (48%) of the 56 rankings indicated different preferences with a ranking below 120. This indicated the need for a more personalized approach, since some cases belonged to a group that shared the same attributes but indicated different preferences for individuals. Thus, the study aimed to conduct the analysis using the SeoulHouse2Vec model, in which the minimum unit of analysis is the individual's preference. Based on users who share sociodemographic attributes but show different preferences, or users who do not share any sociodemographic attributes but have the same preferences, the study aimed to present and implement a scenario for analyzing residential preferences with the MAUT method.

SeoulHouse2Vec Application with MAUT
Randomly selected respondents were asked about the relative weight of the residential preference variables in the MAUT survey. The six housing preference variables presented to respondents were accessibility to the subway, accessibility to the supermarket, accessibility to Thus, the study aimed to conduct the analysis using the SeoulHouse2Vec model, in which the minimum unit of analysis is the individual's preference. Based on users who share socio-demographic attributes but show different preferences, or users who do not share any socio-demographic attributes but have the same preferences, the study aimed to present and implement a scenario for analyzing residential preferences with the MAUT method.

SeoulHouse2Vec Application with MAUT
Randomly selected respondents were asked about the relative weight of the residential preference variables in the MAUT survey. The six housing preference variables presented to respondents were accessibility to the subway, accessibility to the supermarket, accessibility to educational facilities, residential facilities, accessibility to parks, and investment value. For each housing preference variable, the respondent answered on a scale ranging from 0 (least important) to 100 (most important). The interior factors of the dwelling are the area of the unit, price per pyeong, number of rooms, and number of bathrooms, which may not show a linear preference. They were grouped as house interior factors and their importance was indicated. Each of the four items was divided into three levels. Area of house was delineated as small (80 m 2 or less), medium (80-109 m 2 ), and large (109 m 2 ). Price per pyeong was divided into KRW 10 million or below, KRW 10 to 15 million, and KRW 15 million or more. Number of rooms was delineated as two or less, three, and four or more, and number of bathrooms as one and two or more. The score for each was then assessed. The house interior factors were combined in the assessment because individuals' nonlinear preference was evident. For example, in the case of the importance score for time to metro, which showed a linear preference, less time means higher utility. However, for area of housing, which has a nonlinear preference, a larger size does not mean higher utility. The preferred size of homes may vary by respondent because they may prefer smaller houses considering the maintenance costs or larger houses because of the size of the family.
Of the survey respondents, randomly selected "USER_A001" was aged 50 to 60 years (age), earned KRW 4 million or more (monthly household income), and was leasing (housing tenure type). In addition, the importance values (weights) of the user's residential preference variables were as follows: accessibility to the subway (80), accessibility to the supermarket (50), accessibility to educational facilities (20), house interior factors (80), accessibility to parks (60), and investment value (20).
First, of the users who share all the socio-demographic attributes of USER_A001, "USER_Y031" is the 152 nd furthest away from USER_A001. Comparing the importance of the residential preference variable of the reference respondent and USER_Y031, accessibility to the subway is similar: 80 for the reference respondent and 90 for the comparison respondent, below the difference range of 10. However, the weights for the other five categories assigned by the comparison respondent were accessibility to the supermarket (30), accessibility to educational facilities (60), house interior factors (20), accessibility to parks (40), and investment value (20). Being above the margin of error of 10 or more, the two respondents' preferences for most categories were non-similar. Therefore, even in groups with matching demographic characteristics, residential preferences may vary depending on the difference in importance each respondent assigns to each variable. These different preferences were expressed over relatively distant embedding distances. This is shown in Figure 11.
Sustainability 2020, 12, x FOR PEER REVIEW 18 of 25 educational facilities, residential facilities, accessibility to parks, and investment value. For each housing preference variable, the respondent answered on a scale ranging from 0 (least important) to 100 (most important). The interior factors of the dwelling are the area of the unit, price per pyeong, number of rooms, and number of bathrooms, which may not show a linear preference. They were grouped as house interior factors and their importance was indicated. Each of the four items was divided into three levels. Area of house was delineated as small (80 m 2 or less), medium (80-109 m 2 ), and large (109 m 2 ). Price per pyeong was divided into KRW 10 million or below, KRW 10 to 15 million, and KRW 15 million or more. Number of rooms was delineated as two or less, three, and four or more, and number of bathrooms as one and two or more. The score for each was then assessed. The house interior factors were combined in the assessment because individuals' nonlinear preference was evident. For example, in the case of the importance score for time to metro, which showed a linear preference, less time means higher utility. However, for area of housing, which has a nonlinear preference, a larger size does not mean higher utility. The preferred size of homes may vary by respondent because they may prefer smaller houses considering the maintenance costs or larger houses because of the size of the family.
Of the survey respondents, randomly selected "USER_A001" was aged 50 to 60 years (age), earned KRW 4 million or more (monthly household income), and was leasing (housing tenure type). In addition, the importance values (weights) of the user's residential preference variables were as follows: accessibility to the subway (80), accessibility to the supermarket (50), accessibility to educational facilities (20), house interior factors (80), accessibility to parks (60), and investment value (20).
First, of the users who share all the socio-demographic attributes of USER_A001, "USER_Y031" is the 152 nd furthest away from USER_A001. Comparing the importance of the residential preference variable of the reference respondent and USER_Y031, accessibility to the subway is similar: 80 for the reference respondent and 90 for the comparison respondent, below the difference range of 10. However, the weights for the other five categories assigned by the comparison respondent were accessibility to the supermarket (30), accessibility to educational facilities (60), house interior factors (20), accessibility to parks (40), and investment value (20). Being above the margin of error of 10 or more, the two respondents' preferences for most categories were non-similar. Therefore, even in groups with matching demographic characteristics, residential preferences may vary depending on the difference in importance each respondent assigns to each variable. These different preferences were expressed over relatively distant embedding distances. This is shown in Figure 11.

Scenario Presentation
According to the Population and Housing Survey conducted by Statistics Korea in 2017, 19% of Korea's total population resides in Seoul, which counts about 9,700,000 ( [1,2]). Based on the population movement in Seoul in 2019, of the about 1,400,000 people that moved in. In addition, of the 1,400,000 people that moved out [66]. This shows that population movement and housing market in Seoul are relatively active. This study presents and implements a scenario in which a family searches for the housing throughout Seoul, with a high preference for the apartment "HanhwaGgumAeGreen" located in Jayang-dong, Gwangjin-gu, a district of Seoul. By doing so, the study visually presents specific utilization measures of the model and its results.

Model Application
Through the model's training process, the profiles were mapped to the embedding layer. The distances of the mapped profiles were calculated based on the preferred profile(target). If a particular user prefers the profile, the recommendation system will work in a way that sequentially presents some profiles close to the profile.
The dataset for the demonstration consisted of the embedding distances from the target apartment to the other apartments, names of the apartments, latitude, and longitude. This dataset was visually represented on the map of Seoul using Tableau, a data visualization program. Individual profiles are represented in a marker style circle with a black border on the map. A closer embedding distance from the preferred apartment was represented in red, and a farther distance in green. "HanhwaGgumAeGreen", the reference for the calculation, was expressed in blue "X" characters on the map. This is presented in Figure 12: (a) shows a geographical range based on the entire Seoul area, and (b) is based on the Gwangjin-gu area where the apartment is located. Sustainability 2020, 12, x FOR PEER REVIEW 20 of 25  Table 3 shows the values of the nine attributes for the entire profiles and the preferred apartment: time to metro (ATTR#1), accessibility to market (ATTR#2), number of schools (ATTR#3), housing area (ATTR#5), number of rooms (ATTR#6), number of bathrooms (ATTR#7), distance to park (ATTR#8) and investment value (ATTR#9).

Data Analysis
The table also shows the average value corresponding to the attributes of the top 50, top 25, top 10, and top 5 apartments with a close embedding distance from the preferred apartment. First, for ATTR#1, time to metro, a smaller value means better accessibility. The average of the 679 profiles is about 11.23 minutes and 7 minutes for the reference. We see that a closer embedding distance starting from the Top 50 to the Top 25, 10, and 5 apartments means better accessibility to the subway station. For ATTR#2, a higher value indicates less accessibility. Interesting is that a higher value here means lower worth. Closer distance from the preferred profile shows an increasing value. Both ATTR#1 and ATTR#2 are related to accessibility. While a closer embedding distance improves the accessibility of  Table 3 shows the values of the nine attributes for the entire profiles and the preferred apartment: time to metro (ATTR#1), accessibility to market (ATTR#2), number of schools (ATTR#3), housing area (ATTR#5), number of rooms (ATTR#6), number of bathrooms (ATTR#7), distance to park (ATTR#8) and investment value (ATTR#9).

Data Analysis
The table also shows the average value corresponding to the attributes of the top 50, top 25, top 10, and top 5 apartments with a close embedding distance from the preferred apartment. First, for ATTR#1, time to metro, a smaller value means better accessibility. The average of the 679 profiles is about 11.23 minutes and 7 minutes for the reference. We see that a closer embedding distance starting from the Top 50 to the Top 25, 10, and 5 apartments means better accessibility to the subway station. For ATTR#2, a higher value indicates less accessibility. Interesting is that a higher value here means lower worth. Closer distance from the preferred profile shows an increasing value. Both ATTR#1 and ATTR#2 are related to accessibility. While a closer embedding distance improves the accessibility of ATTR#1, that for ATTR#2 decreases. This suggests that if this apartment is preferred, access to the subway station rather than to the supermarket will play a more important role in forming the preference. This may be interpreted as a trade-off between the two attributes.
For ATTR#5, housing area, the reference apartment has an area value higher (larger area) than the average of the overall profile. The top 50 has a higher value than the top 25 and top 10; thus, it is not possible to identify trends. However, the top 5 housing profiles with the nearest embedded distance have a relatively high value of 96.96. For ATTR#7, the number of bathrooms, the top 50, 25, 10, and 5 all have values higher than the average of the entire Seoul area, but no change in attribute values were found based on the difference in the distance.
Regarding ATTR#3, the reference apartment was preferred, despite its value of 6 for the number of elementary, middle, and high schools within 1 km, which is less than the average 7.79 of the entire profiles. This suggests that ATTR#3 had a relatively low weight in survey respondents' preferences. For ATTR#6, the number of rooms, the top 5 had a higher value than the overall average in Seoul, but no difference was confirmed based on the embedding distance. ATTR#8 is the distance to the nearest park. The reference apartment had a lower value (closer to the parks) than the entire Seoul area, but no relationship was found between changes in the embedding distance and the value. Table 4 shows top 5 housing profiles which have close embedding distance to the reference. In sum, in the demonstration, the preference for the reference apartment may be linked to other apartments that are close to the subway station, have a large area, have a large number of bathrooms, and are worth investing in.  1st Dongdaemun-gu, Jangan-dong, Raemian Jangan 2-Cha 2nd Gangnam-gu, Apgujeong-dong, Hanyang 3 3rd Gangseo-gu, Banghwa-dong, Banghwa 3-Danji 4th Mapo-gu, Yonggang-dong, Mapo Yongang Samsung Raemian 5th Gangdong-gu, Cheonho-dong, Raemian Gangdong Palace

Conclusions
To build SeoulHouse2Vec, an embedding-based recommendation system, a demonstration was conducted by creating housing profiles, conducting preference surveys, constructing, validating and evaluating a model, and presenting two scenarios. The significance and contributions of the study are highlighted below.
• Sustainability in architecture, previous research focused on the use of energy-efficient materials, designing high performance building envelop and optimizing HVAC operation, etc. Unlike previous research, this study is meaningful in that it investigates the rational use of limited housing-related goods. Given that the consumption and supply of housing utilizes limited land and spatial resources, both consumption and supply are closely related to sustainability, which has long-term personal, social, and environmental impacts. Moreover, it may not be possible to revise or reverse the decision. This study suggested the feasibility of using a recommender system to support rational decision making in both housing consumption and supply.

•
Even with the fact that housing supply ratio in Seoul is about 95%, housing prices are rapidly increasing as of late. To address this in terms of massive housing supply, policymakers are discussing the lifting of the greenbelt zones where development has been restricted over the years. While there are various causes of steep rises in the prices, the model proposed in this study has one potential technique to solve problems known to prevent the housing market from functioning rationally, including imbalanced information between housing consumers and suppliers, rather hasty housing decision based on consumers' biased information, and the limited exploration of the alternatives.

•
From the user's viewpoint, the scope of existing housing alternative searches was limited to the local scope of dong or gu (district). However, the SeoulHouse2Vec model proposed in this study is significant in that it extends the search scope for housing alternatives from the previous dong to the entire Seoul area by utilizing public big data and GIS data.

•
If Seoul's regional scope is expanded through data mining and web crawler technology to collect alternatives throughout Gyeonggi-do and South Korea, it will be possible to apply a further expanded model. • The SeoulHouse2Vec model provides one possibility of assessing the outcome of past housing decision-making. If the level of housing satisfaction is higher than the current one, certain alternative with the attributes similar to the current one can be presented. Conversely, if the current housing satisfaction level is low, an alternative with the opposite attributes, one whose embedding distance is far, may be prioritized. This will help support the current housing decision-making process by quantitatively analyzing and reflecting the past decision-making process. This may be particularly useful for users who have little experience and knowledge in searching for housing alternatives. • SeoulHouse2Vec has the potential to track the user's decision-making process, analyze preferences, and support the architect's planning and initial design stage. It is now becoming increasingly important to reflect users' perspective in architectural planning and design. This is an important factor not only in design quality, but also in determining the market price of buildings. Currently, the architectural planning phase involves analyzing the requirements of prospective users and contractors, and relying on the architect's knowledge, experience, and intuition to generate the information necessary to proceed with the design process. The model proposed here includes user information on age, income, and housing tenure type; housing profile information related to housing attributes; and preference information, which is the relationship between the user and the alternatives. The dataset may provide a quantitative basis in the architectural decision-making process. • The SeoulHouse2Vec model not only measures users' housing preferences based on demographic attributes, but users with divergent demographic characteristics may also have highly similar housing preferences depending on the importance of each preference variable. Even in groups with matching demographic characteristics, housing choice may vary depending on how significant respondents consider each variable. This preference tendency can be reflected through the embedding method.