A Thematic Network-Based Methodology for the Research Trend Identiﬁcation in Building Energy Management

: The rapid increase in the number of online resources and academic articles has created great challenges for researchers and practitioners to e ﬃ ciently grasp the status quo of building energy-related research. Rather than relying on manual inspections, advanced data analytics (such as text mining) can be used to enhance the e ﬃ ciency and e ﬀ ectiveness in literature reviews. This article proposes a text mining-based approach for the automatic identiﬁcation of major research trends in the ﬁeld of building energy management. In total, 5712 articles (from 1972 to 2019) are analyzed. The word2vec model is used to optimize the latent Dirichlet allocation (LDA) results, and social networks are adopted to visualize the inter-topic relationships. The results are presented using the Gephi visualization platform. Based on inter-topic relevance and topic evolutions, in-depth analysis has been conducted to reveal research trends and hot topics in the ﬁeld of building energy management. The research results indicate that heating, ventilation, and air conditioning (HVAC) is one of the most essential topics. The thermal environment, indoor illumination, and residential building occupant behaviors are important factors a ﬀ ecting building energy consumption. In addition, building energy-saving renovations, green buildings, and intelligent buildings are research hotspots, and potential future directions. The method developed in this article serves as an e ﬀ ective alternative for researchers and practitioners to extract useful insights from massive text data. It provides a prototype for the automatic identiﬁcation of research trends based on text mining techniques.


Introduction
After the concept of global warming was proposed by Broecker [1] in 1975, energy consumption was considered to be the main cause of air pollution that had a significant impact on global warming [2]. Building energy consumption accounts for about 40% of the total global energy consumption, and it is one of the important sources of energy consumption. One-third of the total global greenhouse gas emissions come from construction [2,3]. The Intergovernmental Panel on Climate Change (IPCC) assessment report has mentioned that greenhouse gas emissions caused by building energy consumption have more than doubled between 1970 and 2010 [4]. Consequently, it is critical to reducing building energy consumption for countering global warming [5].

Text Data Collection
The text data used in this article are academic articles retrieved from two well-known scientific databases, i.e., Web of Science and JSTOR (Journal Storage). Because of articles written by different languages cannot be analyzed together, this study only focuses on the article written by English. The searching keywords were set as follows.
TITLE-ABS-KEY ("building" OR " construction " OR " architecture " AND "energy" AND "saving " OR " management ") In total, 52,030 and 14,864 articles can be found in JSTOR and Web of Science, respectively. Manual examinations on titles and abstracts were conducted to remove duplicate and irrelevant articles. As a result, 5712 academic articles, ranging from the Year of 1972 to 2019, were selected for further analysis. The articles selected to account for around 10% of all relevant articles.

Text Data Partitioning
The focuses of research may vary, due to the changes in global and social environments over time. A large number of review articles adopted the Year for data partitioning with the aim of enhancing the analysis sensitivity [25][26][27].
To enhance the sensitivity and reliability of text mining results, articles are partitioned into different groups according to influential timestamps. The Annual United Nations Climate Change Conference (UNCCC) has a huge impact on building energy management research [4], and therefore, the numbers of articles published at different annual UNCCC are investigated and selected as indicators. As shown in Figure 2, the numbers of articles published at the Tokyo, Copenhagen, and Paris conferences are significantly higher than others, indicating potentially dramatic changes in social concerns on global sustainability. Therefore, 1992Therefore, , 1997Therefore, , 2009, and 2015 are selected as the timestamps for data partitioning. The resulting numbers of articles in each data partition are summarized in Table  1. It should be noted that such an approach is a rather subjective way for data partitioning. Further research can be conducted to optimize the data partitioning process.

Text Data Collection
The text data used in this article are academic articles retrieved from two well-known scientific databases, i.e., Web of Science and JSTOR (Journal Storage). Because of articles written by different languages cannot be analyzed together, this study only focuses on the article written by English. The searching keywords were set as follows.
TITLE-ABS-KEY ("building" OR "construction" OR "architecture" AND "energy" AND "saving" OR "management") In total, 52,030 and 14,864 articles can be found in JSTOR and Web of Science, respectively. Manual examinations on titles and abstracts were conducted to remove duplicate and irrelevant articles. As a result, 5712 academic articles, ranging from the Year of 1972 to 2019, were selected for further analysis. The articles selected to account for around 10% of all relevant articles.

Text Data Partitioning
The focuses of research may vary, due to the changes in global and social environments over time. A large number of review articles adopted the Year for data partitioning with the aim of enhancing the analysis sensitivity [25][26][27].
To enhance the sensitivity and reliability of text mining results, articles are partitioned into different groups according to influential timestamps. The Annual United Nations Climate Change Conference (UNCCC) has a huge impact on building energy management research [4], and therefore, the numbers of articles published at different annual UNCCC are investigated and selected as indicators. As shown in Figure 2, the numbers of articles published at the Tokyo, Copenhagen, and Paris conferences are significantly higher than others, indicating potentially dramatic changes in social concerns on global sustainability. Therefore, 1992Therefore, , 1997Therefore, , 2009, and 2015 are selected as the timestamps for data partitioning. The resulting numbers of articles in each data partition are summarized in Table 1. It should be noted that such an approach is a rather subjective way for data partitioning. Further research can be conducted to optimize the data partitioning process.

Data Preprocessing
The main task of data preprocessing is to transform unstructured text data into a structured articleword matrix and remove noisy and meaningless words. A two-step approach is used in this article. The first is to remove conventional stop words, which refer to a set of words that are frequently used, yet can bring little value for insight extraction, such as numbers and prepositions. The second is to adopt the term frequency-inverse document frequency (TF-IDF) method to identify the most representative words for each article. Equations for TF-IDF calculation are shown as Equations (1)-(4).
In the Equation (1), the term , is the occurrences of the i-th word in the j-th article; ∑ , represents the sum of the occurrences of each word in the j-th article; The term , represents the term frequency of i-th word in the j-th article, higher , means that the i-th word is frequently used in the j-th article.
In the Equation (2), | | represent the number of total article, and : ∈ represents the number of articles containing the i-th word; The term represents the inverse document frequency of i-th word in other article. Higher means that the i-th word appears less frequently in other articles.
In the Equation (3), the term _ , represent the indicator TF-IDF. Higher _ , represent the i-th word is frequently used in j-th article, but less common in the others.
A word may be used in different documents, and has different TF-IDF value. The different TF-IDF value will be sum up as shown on the Equation (4), the term _ _ represent the sum of TF-IDF value of i-th word. The result would be sorted by the value of _ _ and manual inspection would be applied to eliminate the academic, but meaningless word like ("science", "journal", "address" and etc.)

Data Preprocessing
The main task of data preprocessing is to transform unstructured text data into a structured article-word matrix and remove noisy and meaningless words. A two-step approach is used in this article. The first is to remove conventional stop words, which refer to a set of words that are frequently used, yet can bring little value for insight extraction, such as numbers and prepositions. The second is to adopt the term frequency-inverse document frequency (TF-IDF) method to identify the most representative words for each article. Equations for TF-IDF calculation are shown as Equations (1)-(4).
In the Equation (1), the term n i,j is the occurrences of the i-th word in the j-th article; k n k,j represents the sum of the occurrences of each word in the j-th article; The term t f i,j represents the term frequency of i-th word in the j-th article, higher t f i,j means that the i-th word is frequently used in the j-th article.
In the Equation (2), |D| represent the number of total article, and j : t i ∈ d j represents the number of articles containing the i-th word; The term id f i represents the inverse document frequency of i-th word in other article. Higher id f i means that the i-th word appears less frequently in other articles.
In the Equation (3), the term t f _id f i,j represent the indicator TF-IDF. Higher t f _id f i,j represent the i-th word is frequently used in j-th article, but less common in the others.
A word may be used in different documents, and has different TF-IDF value. The different TF-IDF value will be sum up as shown on the Equation (4), the term toal_t f _id f i represent the sum of TF-IDF value of i-th word. The result would be sorted by the value of toal_t f _id f i and manual inspection Energies 2020, 13, 4621 5 of 33 would be applied to eliminate the academic, but meaningless word like ("science", "journal", "address" and etc.)

Knowledge Discovery
As shown in Figure 3, three types of data mining techniques, i.e., LDA, word2vec, and community detection, were adopted for knowledge discovery. LDA model is used to discover hidden topics and keywords by lexical combination. Considering that there are no semantic connections among the keywords identified, the results can be very difficult to interpret. To overcome this drawback, the word2vec model is used to quantify the semantic relations among keywords. A thematic network model can be constructed by treating each keyword as a node and the pairwise Euclidean distance between keywords as edges. Afterwards, community detection is performed to identify significant research trends.

Knowledge Discovery
As shown in Figure 3, three types of data mining techniques, i.e., LDA, word2vec, and community detection, were adopted for knowledge discovery. LDA model is used to discover hidden topics and keywords by lexical combination. Considering that there are no semantic connections among the keywords identified, the results can be very difficult to interpret. To overcome this drawback, the word2vec model is used to quantify the semantic relations among keywords. A thematic network model can be constructed by treating each keyword as a node and the pairwise Euclidean distance between keywords as edges. Afterwards, community detection is performed to identify significant research trends. The LDA model is able to reveal potential topics from massive text data using keyword probabilities [28]. As shown in Figure 4, the vocabulary from the article collection is numbered randomly at step 1. At step 2, a new topic number will be assigned according to a certain sampling algorithm and probability calculations. Step 2 should be repeated until convergence. At step 3, the cooccurrence frequency of theme-terms in the article collection would be calculated and analyzed as the final LDA model.
In practice, LDA models have two major limitations: (1) Due to the unsupervised learning nature of LDA methods, and it is very difficult to eliminate the generation of redundant or invalid topics. Manual inspections are needed for information summarization, and the process is subjective. (2) Each topic is represented as a set of keywords, which is discovered based on word frequency and lexical cooccurrences. Keywords within each topic may present little semantic relationships, and therefore, making it very difficult for human interpretation. that the occupant thermal The LDA model is able to reveal potential topics from massive text data using keyword probabilities [28]. As shown in Figure 4, the vocabulary from the article collection is numbered randomly at step 1. At step 2, a new topic number will be assigned according to a certain sampling algorithm and probability calculations. Step 2 should be repeated until convergence. At step 3, the co-occurrence frequency of theme-terms in the article collection would be calculated and analyzed as the final LDA model.
In practice, LDA models have two major limitations: (1) Due to the unsupervised learning nature of LDA methods, and it is very difficult to eliminate the generation of redundant or invalid topics. Manual inspections are needed for information summarization, and the process is subjective. (2) Each topic is represented as a set of keywords, which is discovered based on word frequency and lexical co-occurrences. Keywords within each topic may present little semantic relationships, and therefore, making it very difficult for human interpretation that the occupant thermal.

Word2Vec Approach
Word2vec model is proposed based on the concept of distributed representation. It is used to transform identified keywords into numeric vectors and thereby, providing a quantitative approach to facilitate human interpretations [29]. In essence, the word2vec model is a neural network model, which transforms each word into an N-dimensional numeric vector. As shown in Figure 5, the word2vec model is a neural network model, which calculates the possibility of occurrence of a word according to word-vectors from the literature context. The wt is defined as a word under research; wt+c, wt+2, wt+1, wt-1, wt-2, and wt-c refer to the context of wt. According to the analysis of the context of wt (e.g., wt+c, wt+2…wtc), the probability distribution of wt can be obtained. Therefore, the semantic relationships between any two words can be quantitatively evaluated.

Community Detection
Given the results of LDA and word2vec models, a social network can be created. As shown in Figure 6, each node in the social network represents a keyword and the edges between two nodes are the semantic relationships quantified by distance metrics. Afterwards, community detection is performed to automatically identify significant groups in network data using clustering techniques. A

Word2Vec Approach
Word2vec model is proposed based on the concept of distributed representation. It is used to transform identified keywords into numeric vectors and thereby, providing a quantitative approach to facilitate human interpretations [29]. In essence, the word2vec model is a neural network model, which transforms each word into an N-dimensional numeric vector. As shown in Figure 5, the word2vec model is a neural network model, which calculates the possibility of occurrence of a word according to word-vectors from the literature context. The w t is defined as a word under research; w t+c , w t+2 , w t+1 , w t−1 , w t−2 , and w t−c refer to the context of w t . According to the analysis of the context of w t (e.g., w t+c , w t+2 . . . w t−c ), the probability distribution of w t can be obtained. Therefore, the semantic relationships between any two words can be quantitatively evaluated.

Word2Vec Approach
Word2vec model is proposed based on the concept of distributed representation. It is used to transform identified keywords into numeric vectors and thereby, providing a quantitative approach to facilitate human interpretations [29]. In essence, the word2vec model is a neural network model, which transforms each word into an N-dimensional numeric vector. As shown in Figure 5, the word2vec model is a neural network model, which calculates the possibility of occurrence of a word according to word-vectors from the literature context. The wt is defined as a word under research; wt+c, wt+2, wt+1, wt-1, wt-2, and wt-c refer to the context of wt. According to the analysis of the context of wt (e.g., wt+c, wt+2…wtc), the probability distribution of wt can be obtained. Therefore, the semantic relationships between any two words can be quantitatively evaluated.

Community Detection
Given the results of LDA and word2vec models, a social network can be created. As shown in Figure 6, each node in the social network represents a keyword and the edges between two nodes are the semantic relationships quantified by distance metrics. Afterwards, community detection is performed to automatically identify significant groups in network data using clustering techniques. A

Community Detection
Given the results of LDA and word2vec models, a social network can be created. As shown in Figure 6, each node in the social network represents a keyword and the edges between two nodes are the semantic relationships quantified by distance metrics. Afterwards, community detection is performed to automatically identify significant groups in network data using clustering techniques. A community can be treated as a collection of nodes which have similar characteristics and are closely connected.
Energies 2020, 13, 4621 7 of 33 the number of communities is equal to the number of nodes.
At step 2, each node would be temporarily divided into a community. If the modularity increases, the dividing action will be accepted.
Step 2 would repeat until the modularity is stable.
At step 3, the community discovered in step 2 would be taken as a node for further grouping. Similarly, step 3 repeats until the community structure is stable. Data visualization techniques are then used to visualize the text mining results for human interpretation.

LDA Model Development
The LDA model was developed using Python. One of the major challenges in topic modeling is to determine the appropriate topic number, which has a huge impact on model analysis. Topic coherence was used as an objective indicator to determine optimal topic numbers [33]. The resulting coherence scores in each phase given different topic numbers are shown in Figure 7. The optimal topic numbers in each phase are shown in Table 2.

Phase
Optimal To evaluate community detection results, Newman et al.
proposed the concept of modularity [30,31]. It reflects the relationship between within-community and between-community densities. A higher modularity represents a better community detection result. Fast unfolding is an iterative algorithm used to maximize the modularity of the network [31,32]. It mainly includes three steps, as follows.
At step 1, the community detection algorithm would take each node as a community, which means the number of communities is equal to the number of nodes.
At step 2, each node would be temporarily divided into a community. If the modularity increases, the dividing action will be accepted.
Step 2 would repeat until the modularity is stable.
At step 3, the community discovered in step 2 would be taken as a node for further grouping. Similarly, step 3 repeats until the community structure is stable. Data visualization techniques are then used to visualize the text mining results for human interpretation.

LDA Model Development
The LDA model was developed using Python. One of the major challenges in topic modeling is to determine the appropriate topic number, which has a huge impact on model analysis. Topic coherence was used as an objective indicator to determine optimal topic numbers [33]. The resulting coherence scores in each phase given different topic numbers are shown in Figure 7. The optimal topic numbers in each phase are shown in Table 2. To better depict the process of LDA analysis, Phase 5 is taken as an example to show the details of data analysis. Further information about the LDA analysis results can be found in Appendix A. Figure 8 depicts a part of topic information deduced by the LDA model from articles published in phase 5. The integer in the upper left represents the topic number. The decimal before each word is the proportion of the word in the corresponding topic. The word with higher proportions has more representation power for that topic. To better depict the process of LDA analysis, Phase 5 is taken as an example to show the details of data analysis. Further information about the LDA analysis results can be found in Appendix A. Figure  8 depicts a part of topic information deduced by the LDA model from articles published in phase 5. The integer in the upper left represents the topic number. The decimal before each word is the proportion of the word in the corresponding topic. The word with higher proportions has more representation power for that topic. Subsequently, analysis results in phase 5 are shown in Table 3, where the first column represents the topic ID. The proportion under each topic ID represents the importance of each subtopic. The second column presents the most significant keywords, which will be used to deduce the corresponding subtopic. The parent topic is the collection of subtopics, and their proportion will be calculated and corrected after meaningless topics were removed.  To better depict the process of LDA analysis, Phase 5 is taken as an example to show the details of data analysis. Further information about the LDA analysis results can be found in Appendix A. Figure  8 depicts a part of topic information deduced by the LDA model from articles published in phase 5. The integer in the upper left represents the topic number. The decimal before each word is the proportion of the word in the corresponding topic. The word with higher proportions has more representation power for that topic. Subsequently, analysis results in phase 5 are shown in Table 3, where the first column represents the topic ID. The proportion under each topic ID represents the importance of each subtopic. The second column presents the most significant keywords, which will be used to deduce the corresponding subtopic. The parent topic is the collection of subtopics, and their proportion will be calculated and corrected after meaningless topics were removed. Subsequently, analysis results in phase 5 are shown in Table 3, where the first column represents the topic ID. The proportion under each topic ID represents the importance of each subtopic. The second column presents the most significant keywords, which will be used to deduce the corresponding subtopic. The parent topic is the collection of subtopics, and their proportion will be calculated and corrected after meaningless topics were removed.
(1) Heating, ventilation, and air conditioning (HVAC) system (17.7%): Studies have shown that HVAC systems account for 50% of all building energy consumption [34]. This topic contains three subtopics with a similar proportion. It emphasizes the different research directions of the HVAC system: Terms like "Winter", "Climate", "Summer", "weather", and "Season" in topic 3 reflect the seasonal differences in the energy consumption research. The terms like "Pump", "Chiller", "Exchanger", "Boiler", and " Compressor" in topic 8 focus on equipment and component research in HVAC system. "Air", "Ventilation", "Airflow", "Humidity", and "Wind" in topic 16 refer to ventilation systems research.  (2) Enclosure structure (15%): Improving building envelope structure is an effective mean to achieve building energy management. Envelope structure researches in Table 3 are concentrated in three directions: Terms like "Wall", "Insulation", "Thermal", "Concrete", and "Insulation" in topic 18 emphasize wall materials and properties research. Furthermore, thermal insulation and electrical conductivity are the most concerned performances of the wall. Meanwhile, topic 20 also emphasizes the wall research and "Material", "Phase", "Change", "Wallboard", and "Storage" refer to the phase change material research. "Air", "Ventilation", "Airflow", "Humidity", and "Window" within topic 23 emphasizes the research on window materials and their properties, especially the characteristics related to the utilization and protection of solar energy.
(3) Solar energy utilization (6.3%): The research on solar energy utilization focuses on two research directions: "Transfer", "Surface", "Radiation", and "Solar" within topic 4 refer to the research on solar radiant heat utilization. The terms "Passive", "Solar", "Orientation", "Bioclimate", "Summer", and "Winter" in topic 15 emphasizes passive solar building design, especially taking the consideration of building orientation, envelope structure and seasonal climate.
(4) Green buildings (10.4%): The research on green buildings mainly focuses on two aspects: "Sustainable", "Green", "Environment", "Criterion", and "Assessment" within topic 2 refer to green building design, especially the green building design standards and evaluation. The terms "Roof", "Green", "Vegetation", "Albedo", "Island", and "Climate" within topic 22 refer to green roof design and urban heat island effect. It emphasizes the connection between urban heat island effect and green roof design.
(5) Chinese building energy management (5.2%): Topic 5 emphasizes China-related research in building energy management. "Policy", "Government", "Public", "Standard", and "Economic" reflect that China attaches great importance to research on building energy management policies and the economy. On the other hand, "Rural", "Urban", and "Region" depicts that building energy Energies 2020, 13, 4621 11 of 33 management in China has strong regional characteristics. Moreover, other terms like "Consumption", "Emission", "Carbon", and "Coal" stress that coal is an important energy sources to support building energy consumption in China [35], and therefore, serious carbon emissions problems caused.
(6) Power system (5.5%): Topic 9 is a research emphasis on power systems. Terms like "Solar", "Collector", "Storage", "Fuel", "Renewable", and "Photovoltaic" indicate that fuel-fired power generation and renewable energy storage and power generation are current hotspots, especially solar photovoltaic power generation.
(7) Energy-saving (6.8%): "Retrofit", "European", "Thermal", and "Envelope" within topic 10 emphasizes the research about energy-saving renovation on existing buildings, especially envelope structure renovation. Meanwhile, it indicates that Europe pays more attention to building energy management renovation research.
(8) Economic impact (6.5%): "Cost", "Investment", "Price", and "Economic" within topic 12 indicates the economic impact of building energy management, but the vocabularies under this topic are economic-related terms and do not belong to specific influencing factors.
(10) Resident behavior (6.1%): "Household", "Behavior", and "Occupant" within Topic 14 indicates resident behavior is an important part of building energy management. "Survey", "Interview", and "Attitude" reflect that researches often use questionnaires and interviews to explore the opponent's attitude and behavior towards energy consumption.
(11) Environmental impact (5.2%): "Emission", "Environment", "Carbon", and "Greenhouse" within topic 17 emphasizes the environmental influence caused by building energy consumption through carbon emissions and greenhouse gases. Meanwhile, "Waste", "Construction", "Recycling", and "Demolition" refer to the construction waste disposal problem that has received much attention in recent years, emphasizing the realization of building energy-saving through the recycling of waste.

Analysis of the LDA Model
The procedure of LDA analysis is depicted in Section 3.1. The detailed LDA analysis results in different phases are shown in Appendix A Tables A1-A5, and visualized in Figures A1-A5. According to the LDA model, building energy management evolution path from 1972 to 2019 can be summarized as followed: (1) A HVAC system is an important research topic in every phase. In the earlier phase, the equipment and components of HVAC systems were much concentrated. The proportion of seasonal differences [37][38][39], floor heating systems [40,41], and ventilation systems [42] were increased since phase 3.
(2) There is increasing attention focused on building envelope research [43]. The envelope material [43,44] and its insulation performance [43] are hotspots, especially the research taking the wall as the research object.
(3) The indoor thermal environment forms a critical part of building energy management, and its proportion keeps stable. Related research focuses on indoor temperature regulation and thermal comfort research [45,46].
(4) Research topics about solar energy utilization, lighting systems, power systems, economic impacts, and environmental impact are generally declining.
Solar energy utilization has always been dominated by solar thermal utilization [47] and passive solar energy construction [48,49]. Lighting systems always focused on lighting control [50,51] and natural lighting utilization [50]. After phase 4, intelligent lighting control [52] has become a new hotspot in lighting systems. The power system research has evolved from circuit component research to power consumption analysis [53,54], then focusing on renewable energy [55]. The proportion of economic impact is relatively high before phase 3, when it pays more focus on energy cost [56]. However, research about economic impact is decreasing since phase 4, and it is difficult to draw insights from keywords provided by LDA. The environmental impact mainly concentrated on building energy emission [57] and the heat island effect [58,59]. Although the proportion of environmental impact is decreasing, environmental impact related research like green buildings [60,61] and sustainable development research [60,61] is in the upward trend. Therefore, environmental impact research is generally on the rise.
(5) Proportions of green building, occupant behavior, energy-saving renovation, and intelligent building research are general upward. Green building research mainly focuses on sustainable development and green roof design [62,63]. Although the roof just forms a part of the building envelope structure, it always divided into topic having terms like "green", "vegetation", and other words, so it is divided into green building research. Questionnaire [64][65][66][67] and interview [67][68][69] are the conventional methods used in occupant behavior research. A large number of studies have been conducted in Europe which focuses on building envelope structure and HVAC systems.
Although the result reflects the research emphasis and development trends in building energy management. Related topics are rarely depicted, and therefore, it is difficult to describe a complete thematic network map in building energy management. Moreover, the semantic relationship between keywords and topics could not extract from the LDA model. To further detect insights from academic articles, comprehensive thematic network analysis will be conducted based on the LDA model.

Development of Thematic Network Model
As described in Section 2.5, the word2vec model was used to transform each keyword into a numerical vector. The Euclidean distance was used to quantify the semantic relationships between keywords. A thematic network was developed by using keywords and their interrelations. Community detection was then performed to identify significant communities. The node size in the thematic network diagram is proportional to the keyword weightings. Keywords with higher weightings can better represent the research topic. The edge data includes source data, target data, and the degree of relationship. The comprehensive thematic network graphs in different phases are shown in Figure 9. The holistic network can be found in Appendix B as Figures A6-A10.

Analysis of Comprehensive Thematic Network Model at Different Phases
The numbers of nodes, edges, communities, and modular values in different phases are summarized in Table 4. The modularity is close to 0.8-indicating network graph has stable clustering performance. Different topics are drawn with different colors. Keywords with low proportions are neglected for better interpretation. Results in phase 1 will be described in detail. While the analysis in other phases will pay more focus on its difference.   Based on analysis results from the LDA model and thematic network model in different phases, the research in building energy management can be categorized into 16 sub-topics and shown in Table 5. As shown in Figure 10, their relationships can be summarized as follows. Table 5. Topics in the field of building energy management *.

No.
Topic Period
(1) HVAC system and indoor thermal environment are two of the most related research topics.
The results show that the HVAC system is always the most important research topic, especially the heating system. Meanwhile, the indoor thermal environment is the most important factor that will directly affect the HVAC system [70,71]. Therefore, these two research topics are closely connected in all phases. However, subtopics within the HVAC system and the indoor thermal environment have also changed significantly over time.
In Phase 1, the HVAC system pays more attention to the plumbing system [72][73][74]. The energy consumed by the heating system, which mainly happened in winter [75]. The energy-saving research in the HVAC system only considers the efficiency of fossil energy combustion [76,77]. In Phase 2, the number of electric equipment like air-conditioners increases significantly that the HVAC system power consumption has received increasing attention [78][79][80]. Moreover, the ventilation system, as part of the HVAC system, is inclined to ventilate through the building enclosure [81,82] in Phase 1 and Phase 2. Since Phase 3, the research trend in the HVAC system becomes stabilized, focusing on three aspects, i.e., HVAC energy storage, HVAC equipment and component performance improvement, and HVAC system power consumption.
(2) Building enclosure structures play an important role in energy-saving Appropriate building enclosure structure design [83,84] and enclosure material [85,86] could effectively regulate the indoor environment. The insulation performance of enclosure structures is the main research focus. Meanwhile, since Phase 4, renewable materials and 'greening' buildings are integrated into the building enclosure structure research. The lighting system is closely linked to the building enclosure. Appropriate material selection and building enclosure orientation designs are helpful for the efficient utilization of sunlight and thereby, thus reducing lighting system energy consumption [84].  (3) Research related to lighting systems Related research mainly focuses on enhancing the energy efficiency of lighting system equipment, optimizing lighting system control, and the utilization of natural daylighting. In Phase 1, studies are more focused on developing technologies to enhance the energy efficiency of lighting equipment. In the following phases, the optimization of lighting controls [87] and the utilization of daylighting [50] has received increasing attention, especially since Phase 4, when intelligent lighting controls [10,52,88] becomes a hot topic.
(4) Renewable energy-related research mainly focuses on solar energy utilization Solar energy utilization has been the focus of building energy-related research. Passive solar building [2,48,89] and solar photovoltaic utilization [47] have been identified as the most important research direction throughout the entire period of 1972-2019.
(5) Occupant behavior has become an important factor in building energy management Since Phase 4, studies related to occupant behavior has become one of the main research topics in building energy management. Some researchers believe that the occupant thermal adaptation behavior within the indoor thermal environment is the most essential factor affecting building energy consumption [90,91]. In addition, the social environment and occupant energy consumption habits and attitudes all contribute to the building energy consumption patterns [91][92][93]. Studies have shown that different occupant habits could lead to a huge difference in building energy consumption [94,95].
(6) Building renovation technologies have become a major research field Since Phase 4, the worldwide urbanization has slowed down. It becomes increasingly important to develop building renovation technologies to enhance the energy efficiency performance of existing buildings. The results show that most building renovation technologies stem from the European countries at this moment.
(7) Green building and intelligent buildings Green building-related research generally emerges from Phase 3 and becomes a major research field since Phase 4. Meanwhile, with the development of information technologies, intelligent buildings, which utilizes smart sensors and advanced control strategies [96,97], have become prevalent in the building energy research field.
(8) China plays a critical role in building energy-savings China has a significant impact on global sustainability. On the one hand, China covers a large land area with significant climate differences. As a result, various types of studies have been performed accordingly, such as climatic variations [98,99]. On the other hand, building may present significantly different energy use behaviors in rural and urban areas [100][101][102]. The rapid urbanization in China has encouraged the flourish of relevant studies [103][104][105].
(9) Potential trends in building energy research On the periphery of the thematic network graph of Phase 5, there are some topics like construction wastes, sponge cities, and bioclimatic buildings. According to results in previous phases, those temporal peripheral topics may become the main research topic in the future. Such topics are very likely to be the major trends considering the ever-increasing concerns over global sustainability and urban renewal.

Conclusions
The rapid increase in academic articles has imposed great challenges for researchers and practitioners to efficiently grasp the status quo of building energy-related research. This study proposes a novel text mining-based methodology to identify major research topics and trends in the field of building energy research. LDA, word2vec models, and social network analysis have been integrated to ensure the reliability and interpretability of the analysis result. The methodology has been applied to analyze 5712 articles ranging from 1972 to 2019. It reflects the research hotspots in the field of building energy management at different phases. Conclusions are drawn as follow: (1) Influencing factors of building energy consumption and energy-saving methods The indoor thermal environment is the most important factor affecting building energy management. The requirement of human thermal comfort has prompted people to spend lots of energy on the HVAC system. Moreover, with the emergence of new buildings and people's pursuit of living comfort, the demand for HVAC systems is increasing. The HVAC research mainly focuses on HVAC energy storage, HVAC equipment, and component improvement, power consumption, and energy consumption of HVAC systems.
Improving the thermal insulation of buildings is the most effective way to reduce the energy consumed by the HVAC system. The appropriate building envelope design and the utilization of the retaining structure materials can effectively control indoor temperature. Thereby it would reduce the energy consumed by the HVAC system and mitigate the effect of the urban heat island. In order to obtain better insulation performance, the building envelopes research focuses on solar energy utilization, green design, and enclosure materials to maximize the enclosure's insulation performance.
Indoor illumination is another important factor affecting building energy consumption. It focuses on lighting control, natural lighting, and the improvement of lighting system equipment. However, the specific implementation methods change over time.
Occupant behavior is also an important factor affecting building energy consumption-both the HVAC system and lighting system is controlled based on the thermal comfort of the human body. Occupant's thermal adaptation behavior and energy using attitude will directly determine building energy consumption. However, the occupants' behavior related research has not received much attention until the last decade.
(2) Promoting factors of building energy management Economic impact, environmental impact, and government policies are three important factors driving building energy efficiency research.
Academic researchers highlight the building energy management through clarifying the seriousness of the energy crisis, heat island effect, greenhouse gases, global warming, and carbon emissions. Whereas, the government promotes building energy management through formulating policies and regulations and encouraging the development of energy-saving techniques.
(3) Development trend in building energy-saving Building energy-saving renovation, green buildings, and intelligent buildings are research focuses and development trends in this field. With the slowdown of urbanization, buildings stock tends to be saturated, and a large scale of buildings have huge energy-saving renovation potential. Therefore, Green buildings, which is designated for energy-efficient buildings, is the future development trend in the field of building construction.
Intelligent buildings are the production of integrating intelligent detection and building energy consumption control. With the development of information technology, there would be increasingly researches on building energy management. In addition, based on the analysis results, it can speculate that future development may epitomize on researches about construction waste, sponge city, and bioclimatic construction.
The method developed in this article serves as an effective alternative for researchers and practitioners to extract useful insights from massive text data. It provides a prototype for the automatic identification of research trends based on text mining techniques. However, since there will be a follow-up research conducted based on this code, it is inconvenient to provide the codes to the reader. This study is limited by the knowledge of the author and underdeveloped data partition method. Manually inspection was involved during data preprocessing. Some academic terms cannot be identified by the author, and therefore, were mistakenly removed. Additionally, analysis results in some phases are similar, which may be caused by data overlap. Further studies can be conducted by interviewing experts to establish a dictionary of academic terms in building energy management. The collection of academic terms can reduce subjectivity during data preprocessing. Moreover, future studies can focus on enhancing the overall text mining efficiency by automatically identifying redundant information obtained during the knowledge discovery process.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript