Bert-Based Latent Semantic Analysis (Bert-LSA): A Case Study on Geospatial Data Technology and Application Trend Analysis

: Geospatial data is an indispensable data resource for research and applications in many ﬁelds. The technologies and applications related to geospatial data are constantly advancing and updating, so identifying the technologies and applications among them will help foster and fund further innovation. Through topic analysis, new research hotspots can be discovered by understanding the whole development process of a topic. At present, the main methods to determine topics are peer review and bibliometrics, however they just review relevant literature or perform simple frequency analysis. This paper proposes a new topic discovery method, which combines a word embedding method, based on a pre-trained model, Bert, and a spherical k-means clustering algorithm, and applies the similarity between literature and topics to assign literature to different topics. The proposed method was applied to 266 pieces of literature related to geospatial data over the past ﬁve years. First, according to the number of publications, the trend analysis of technologies and applications related to geospatial data in several leading countries was conducted. Then, the consistency of the proposed method and the existing method PLSA (Probabilistic Latent Semantic Analysis) was evaluated by using two similar consistency evaluation indicators (i.e., U-Mass and NMPI). The results show that the method proposed in this paper can well reveal text content, determine development trends, and produce more coherent topics, and that the overall performance of Bert-LSA is better than PLSA using NPMI and U-Mass. This method is not limited to trend analysis using the data in this paper; it can also be used for the topic analysis of other types of texts


Introduction
Geographical data describes a location and its spatial characteristics attributes. With the rapid development of information technology, geospatial data has become an indispensable data resource for research and application in many fields, such as natural resource management, disaster emergency management, climate change and precision agriculture, etc. [1]. The technologies and applications related to geospatial data are also constantly advancing and upgrading, making new ways of thinking possible, so identifying technologies and applications is helpful to foster and fund further innovation. Through topic analysis, we can identify new research hotspots, acquire knowledge transfer processes [2], and quickly analyze the entire development process of research areas, thus benefiting researchers who are interested in a topic. In addition, it can also provide signals for paradigm shifts in discipline development [3]. For individuals, the results of topic analysis provide 2 of 13 an overview of the evolution of the research field and are helpful to us in grasping research trends, keeping up-to-date with the latest research trends in the field, and seeking scientific collaborators [4].
The extensive scientific literature provides researchers with a wealth of information, which is also an important data resource for analyzing the development trends. However, the time and cost for understanding and analyzing the complex dynamics of current technical approaches related to geospatial data are increasing [5]. Therefore, researchers try to save time and reduce costs by seeking automated analysis methods, which allow them to quickly find the most important information, in order to make critical decisions without consulting voluminous literature [6]. At present, there are two main methods for identifying topics in texts. One is a qualitative appraisal method used by the academia, which is known as expert overview. The other is a scientometrics-based approach. Expert overview is a comprehensive and effective method for topic identification, but it is highly dependent on expert opinion, which is time-and energy-consuming. In addition, expert overview is becoming an increasingly inefficient means, due to the explosive growth of the scientific literature. In comparison, the bibliometric approach uses related papers for statistical frequency analysis, and simply captures information such as citation statistics to identify topics. An article with a high citation count is considered as a high-value one [7]. The structural and geospatial developments of industrial symbiosis as subfields of industrial ecology have been explored by using bibliometrics [8]. A statistical approach to bibliometric data from U.S. institutions has also been used to identify institutional hotspots on a map where many high-impact papers are published. The bibliometrics-based approach plays a role in identifying the development of trends, but it lacks consideration of the content of the literature texts themselves.
Previous studies are mainly based on traditional methods, which merely review relevant literature or conduct simple frequency analysis without providing insights beyond revealing information about the contents of literature texts [9]. Therefore, it is urgent to conduct comprehensive and in-depth trend analysis of literature texts. Recently, a popular method involves text analysis techniques to identify the main viewpoints and trends of the research [10], for example, by using textual data such as user comments, papers, and patents to analyze keywords or social networks [11][12][13][14][15][16]. In particular, topic modeling has recently attracted the attention of trend analysis researchers, since the main purpose of trend analysis based on textual data is to detect the upward and downward trends in the frequency of each topic in the target document [17]. Topic modeling originates from early latent semantic analysis (LSA), which aims to discover meaningful semantic structures in the corpus [18], with a focus on keyword extraction. The representative approaches are through the use of TF-IDF, which is based on statistical features [19,20], TextRank, based on word graph models [21,22], and Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), based on topic models [23]. PLSA and LDA are the most widely used probabilistic techniques in topic modeling [24]. PLSA is a latent variable model based on co-occurrence data item-document matrices, also known as the ASPECT model [25]. The superiority of PLSA is demonstrated by its comparison with k-means and LSA [26]. As a variant or extension of PLSA, LDA uses Bayesian methods for parameter estimation to compensate for the incompleteness of PLSA, in terms of topic probability distribution. However, it is difficult to explain LDA without prior knowledge of the underlying topics and hyperparameters. All these approaches mentioned above ignore the most important semantic features of words and the semantic associations between words. Although pre-trained word embeddings are widely used in classification tasks, its application in topic modeling mainly focuses on probability techniques, such as in LDA [27][28][29][30][31], and there is also preliminary work using these embeddings to evaluate the consistency of topic models [32,33].
The probability-based statistical topic modeling methods aforementioned are unable to capture the whole context of a document, as they usually consider only a single graph representation of a word [1]. Alternatively, the n-gram representation that considers multi-ple words simultaneously can be used, but the efficiency of the model rapidly decreases due to the dimension disaster [34]. Therefore, Bert quantifies words as a vector, which takes into account the context and locates similar words in a similar space to address the limitations of this representation. Although this representation, based on pre-trained models, is widely used and its performance has been validated in recent text analysis, few attempts have yet been made to develop new topic models that are based on Bert. At present, only a few studies have adopted semantic embedding in topic analysis. A recent study uses Bert to generate text semantics as the input for topic classification [35]. While literature abstracts are a core corpus that reveal the distribution of research topics [36,37], these classic scientific journals can extract topic terms to analyze the trends in the research fields [38,39].
In conclusion, both the existing types of studies on geospatial data trend analysis have their limitations. Studies based on screening reviews require a lot of time and energy to screen and summarize all the literature. Methods based on bibliometric analysis are not suitable for discovering potential patterns in fields related to geospatial data. In addition, topic models, which are often used for trend analysis in other fields, are usually based on single-word vector representations that are non-contextual and sparse. In order to overcome these problems, this paper proposes a new topic modeling approach, which applies a new word embedding method in the field of computer linguistics to topic models and can help extract textual topics, namely the Bert-based Latent Semantic Analysis (Bert-LSA) topic modeling approach. It utilizes the Bert contextual word embedding algorithm and spherical k-means clustering to combine context embedding and clustering in a coordinated way, and finally assigns topics to documents. The proposed method is used to conduct a specific trend analysis of the technologies related to geospatial data, which can serve as an advanced and useful alternative method to extract meaningful topics involved in the current trend of geospatial data.
The structure of this paper is as follows. In Section 2, the textual data sources and data pre-processing are introduced. A new topic modeling approach is proposed to compensate for the limitations of existing technologies, which will be discussed in detail in Section 3. Section 4 presents the results of the trend analysis. Section 5 evaluates the proposed method in contrast with existing methods for topic consistency. Section 6 discusses the results and conclusions of this study. Figure 1 shows the process of data collection and pre-processing used to conduct topic modeling about geospatial data. For the analysis of geospatial data technologies and application trends, abstracts of papers related to geospatial data were collected from two paper databases, namely, ScienceDirect and Scopus. A total of 609 abstracts of papers were collected, which contained terms such as "geospatial data "from 2016 to 2020. In the ScienceDirect database, the query statement was "TITLE-ABS-KEY (geospatial AND data)", and in the Scopus database, the query statement was based on keywords (geospatial data). In the query result, only that were those connected with two words: "geospatial" and "data" were selected. In order to ensure that each abstract contains rich information, only those abstracts with more than 180 characters were selected, and 266 abstracts were finally analyzed. For those collected data, each abstract was used as an input to the Bert model, and obtained the corresponding word vector.

Overall Framework
In this paper, the Bert-LSA topic model is proposed, which combines Bert and spherical k-mean clustering. The model is featured due to its ability to fully take into account the context of documents and to overcome the shortcomings of existing statistical models. Figure 3 depicts the whole process of document topic generation, which is mainly divided into four steps as follows.

Overall Framework
In this paper, the Bert-LSA topic model is proposed, which combines Bert and spherical k-mean clustering. The model is featured due to its ability to fully take into account the context of documents and to overcome the shortcomings of existing statistical models. Figure 3 depicts the whole process of document topic generation, which is mainly divided into four steps as follows.

Overall Framework
In this paper, the Bert-LSA topic model is proposed, which combines Bert and spherical k-mean clustering. The model is featured due to its ability to fully take into account the context of documents and to overcome the shortcomings of existing statistical models. Figure 3 depicts the whole process of document topic generation, which is mainly divided into four steps as follows.

•
Step 1: All documents are taken as corpus, and the m-dimensional word vector corresponding to the documents is obtained by the Bert model, which is denoted as v i ∈ V m , where v i is the word vector and V m is the m-dimensional vector space. Note that here the word vectors are obtained after the documents are processed as inputs into the Bert model, rather than being directly obtained from the pre-trained model.

•
Step 2: All vectorized words undergo spherical k-means clustering, which first initializes the centroid according to the K value, and then calculates the spherical distance from each word vector v i to the centroid. According to the distance value, v i will be assigned to different categories, which will be iterated until convergence. Finally, K clusters are obtained, each of which is called a topic.

•
Step 3: The graphical representation of the generation method for each particular document vector d j , j = 1, 2, . . . , D, is shown in Figure 4, which is obtained by multiplying the m-dimensional vector v i of all words in the corpus with the term document matrix Num × D, where Num is the number of words in the corpus and D is the number of documents. See Section 3.4 for details.

•
Step 4: Figure 5 depicts the process of document topic generation. The cosine distance between each document and the word vector contained in each topic in Step 2 is calculated in turn, and each document is assigned to a different topic by using a topic assignment method. See Section 3.5 for details.

•
Step 1: All documents are taken as corpus, and the m-dimensional word vector corresponding to the documents is obtained by the Bert model, which is denoted as ∈ , where is the word vector and is the m-dimensional vector space. Note that here the word vectors are obtained after the documents are processed as inputs into the Bert model, rather than being directly obtained from the pre-trained model.

•
Step 2: All vectorized words undergo spherical k-means clustering, which first initializes the centroid according to the K value, and then calculates the spherical distance from each word vector to the centroid. According to the distance value, will be assigned to different categories, which will be iterated until convergence. Finally, K clusters are obtained, each of which is called a topic. • Step 3: The graphical representation of the generation method for each particular document vector , = 1, 2, … , , is shown in Figure 4, which is obtained by multiplying the m-dimensional vector of all words in the corpus with the term document matrix × , where is the number of words in the corpus and is the number of documents. See Section 3.4 for details. • Step 4: Figure 5 depicts the process of document topic generation. The cosine distance between each document and the word vector contained in each topic in Step 2 is calculated in turn, and each document is assigned to a different topic by using a topic assignment method. See Section 3.5 for details.

Word Vector Generation Based on Bert
Bert (Bidirectional Encoder Representations from Transformers) [40] is a pre-trained language model released by Google that has occupied a state-of-the-art position in 11 tasks in the NLP (natural language processing) field. It is based on a multi-layer bidirectional transformer [41], and the framework consists of two steps: pretraining and fine-tuning. In the pretraining stage, it is trained on existing unlabeled text in advance and is released as a general language model. In the fine-tuning stage, it can be fine-tuned using learning data, according to the task to be performed [42,43].
In this paper, the pre-trained Bert model called "Bert-Base, Uncased" [44] is used to generate word vectors and represent the semantics of words. The reason for choosing this model is that the Bert-Base model is smaller than the Bert-Large model, and the language used in the research is only English and does not need to be case-sensitive. By entering sentences of each document, we obtain the word vector corresponding to each word in the sentence, which can accurately represent the semantic meaning of the word in its context. Python executes all the word vector generations mentioned in this paper by using the API released by [45]. It is an open-source Bert service, which allows users to use the

Word Vector Generation Based on Bert
Bert (Bidirectional Encoder Representations from Transformers) [40] is a pre-trained language model released by Google that has occupied a state-of-the-art position in 11 tasks in the NLP (natural language processing) field. It is based on a multi-layer bidirectional transformer [41], and the framework consists of two steps: pretraining and fine-tuning. In the pretraining stage, it is trained on existing unlabeled text in advance and is released as a general language model. In the fine-tuning stage, it can be fine-tuned using learning data, according to the task to be performed [42,43].
In this paper, the pre-trained Bert model called "Bert-Base, Uncased" [44] is used to generate word vectors and represent the semantics of words. The reason for choosing this model is that the Bert-Base model is smaller than the Bert-Large model, and the language used in the research is only English and does not need to be case-sensitive. By entering sentences of each document, we obtain the word vector corresponding to each word in the sentence, which can accurately represent the semantic meaning of the word in its context. Python executes all the word vector generations mentioned in this paper by using the API released by [45]. It is an open-source Bert service, which allows users to use the Bert model by calling the service without paying attention to the details of Bert implementation. The important parameters max_seq_len and pooling_strategy are set to 512 and NONE respectively.
Bert takes an input of a sequence of no more than 512 tokens and outputs the representation of the sequence that has one or two segments. The first token of the sequence is always [CLS] which contains the special classification embedding, and the other special token [SEP] is used for separating segments. Bert takes the final hidden state, h, of the first token [CLS] as the representation of the whole sequence. WordPiece embedding [46] is used and split word pieces is denoted with ##. So, the statistics of the length of the documents in the datasets are based on the word pieces [47].

Spherical k-Means Clustering
The spherical k-means method is introduced for clustering sets of sparse text data. This method is based on a vector space model, whose basic principle is to describe the degree to which two vectors point in the same direction by their similarity, rather than their length [48]. For example, in the vector space model V m , for each word vector w i ∈ V m , i = 1, 2, . . . , N, the inner product (Formula (1)) of two vectors is used to express the semantic similarity, where the column vectors are normalized (Formula (2)) to the unit length of the Euclidean norm, with the aim of assigning equal weights to each of the n points in the data set. Of course, we obtained these vectors after entering the text into the Bert model, rather than directly from the Bert model.
cos θ x, y =: x T y cos θ x, y = ||x|| 2 ||y|| 2 In Formula (1), it describes the result of the normalization of x and y. In Formula (2), it is the definition of the standard inner product.
Finding clustering centers is also very important. For the clustering vector v(i) ∈ 1, 2, . . . , v and w i , the center of clustering is to find the minimum cosine value between w i and c v , v = 1, 2, . . . , v [49]. To find the number of clusters in the dataset used in the experimentation, we ran the spherical clustering algorithm for a range of multiple values and compared the results obtained for each value.

Example of Document Vector Generation
The document vector is generated by multiplying the word vector matrix (A) and the term document matrix (B) in the figure below, i.e., A × B. The word vector matrix (A) is composed of m-dimensional word vectors obtained from all the words contained in the document according to the method in Section 3.2. The term document matrix (B) is obtained by combining the word frequencies of the words contained in a single document, provided that the order of the words in a single document (i.e., the columns in B) needs to be the same as the position of corresponding words in matrix (A) (i.e., the rows in A), and if the words contained in matrix A do not appear in a single document (e.g., DOC 1), the value of that position is set to 0.

Method of Document Topic Determination
Section 3.3 describes how we obtained multiple topics after clustering all documents, including Topic 1, Topic 2, Topic 3, and Topic 4 in Figure 5. Section 3.4 describes how we obtained the vector of each document, Doc N, shown in Figure 5. The method of assigning documents to topics is shown in Figure 5. Taking the first document as an example, firstly, the average value of the five words (boldface in Topic 1) with the largest cosine distance between the document and the first topic (Topic 1) was obtained, which were taken as the similarity between the document and Topic 1. By analogy, the similarity values between the document and other topics were then calculated. Finally, the similarity values between the document and all topics were compared, and the document was assigned to its corresponding topic with the maximum similarity value. Other documents were calculated in the same way, and finally, all documents were assigned to different topics.

Method of Document Topic Determination
Section 3.3 describes how we obtained multiple topics after clustering all documents, including Topic 1, Topic 2, Topic 3, and Topic 4 in Figure 5. Section 3.4 describes how we obtained the vector of each document, Doc N, shown in Figure 5. The method of assigning documents to topics is shown in Figure 5. Taking the first document as an example, firstly, the average value of the five words (boldface in Topic 1) with the largest cosine distance between the document and the first topic (Topic 1) was obtained, which were taken as the similarity between the document and Topic 1. By analogy, the similarity values between the document and other topics were then calculated. Finally, the similarity values between the document and all topics were compared, and the document was assigned to its corresponding topic with the maximum similarity value. Other documents were calculated in the same way, and finally, all documents were assigned to different topics.

Method of Document Topic Determination
Section 3.3 describes how we obtained multiple topics after clustering all documents, including Topic 1, Topic 2, Topic 3, and Topic 4 in Figure 5. Section 3.4 describes how we obtained the vector of each document, Doc N, shown in Figure 5. The method of assigning documents to topics is shown in Figure 5. Taking the first document as an example, firstly, the average value of the five words (boldface in Topic 1) with the largest cosine distance between the document and the first topic (Topic 1) was obtained, which were taken as the similarity between the document and Topic 1. By analogy, the similarity values between the document and other topics were then calculated. Finally, the similarity values between the document and all topics were compared, and the document was assigned to its corresponding topic with the maximum similarity value. Other documents were calculated in the same way, and finally, all documents were assigned to different topics.

Topic Selection
In the trend analysis based on Bert_LSA, first, Bert is used to obtain the vector of words contained in the document, and the acquisition method and parameter setting are detailed in Section 3.2. Then, the spherical k-means clustering algorithm is applied for clustering, and the optimal number of clusters k is determined by the elbow method. The core index of the elbow method is Sum of the Squares Errors (SSE), and the formula is as follows: where C i is the i-th cluster, p is the sample point in C i , m i is the centroid of C i , and SSE is the clustering error of all samples, which represents the clustering effect. However, when the effect of the elbow method is not obvious, it is combined with the Silhouette Coefficient method to jointly determine the number of clusters. The Silhouette Coefficient is an index to evaluate the degree of density and dispersion of the class. The calculation method is listed as follows, and its value ranges between [−1, 1]. The larger the value is, the more reasonable it is [50].
where a(i) represents the average value of the dissimilarity of the i vector to other points within the same cluster, b(i) represents the minimum value of the average dissimilarity of the i vector to other clusters.

The Result of Trend Analysis
According to the topic number selection method in Section 4.1, the cluster numbers of USA, China, India, Germany, UK, Russian Federation, Italy, and Others were finally set to 6, 5, 6, 5, 7,6, 7, and 4, respectively. Table 1 shows the results of topic modeling using Bert_LSA. The results of the trend analysis show that the focus of each country's concern is different. USA, for example, focuses on the environment, buildings, and fires. However, China pays attention to livability, information extraction, and crops, while words like government and policy also appear. India focuses on information technology, such as cloud computing, SDI and WPS, etc., as well as focuses on disaster events such as floods. Italy and others also pay attention to disaster-related content. Germany, UK, and the Russian Federation all focus on content related to climate change. In general, countries pay more attention to disaster events (e.g., fires and floods), and related information technologies, such as cloud computing, Hadoop and GEE, etc., have also received higher attention. This is also a good indication that our proposed method can successfully identify the current technologies and application trends related to the use of geospatial data, which can quickly provide research hotspots for relevant researchers, especially those who are not specialized in GIS. In this way, it considerably saves the time needed to read a large amount of literature, which is of practical significance.

Evaluation Method
The methods of evaluating topic models mainly include perplexity and topic consistency. The perplexity has its own merits, as it can evaluate probability-based topic models well, whereas in non-probability-based topic models, these methods do not capture semantic consistency between words [51]. Topic consistency can be used to measure whether words within a topic are coherent, i.e., if a group of terms are consistent with each other, then these terms are coherent. For a specific topic, the semantic similarity between words in the topic determines the degree of coherence of the topic, so topic consistency can be measured by the semantic similarity between words in the topic [52][53][54][55]. The greater the consistency value of the topic is, the more coherent the words of each topic will be. To evaluate the non-probabilistic topic model proposed in this paper, the following two consistency measures were used: (1) University of Massachusetts (U_Mass) [56], (2) Normalized Pointwise Mutual Information (NPMI) [57].
U_Mass is defined as: where P w i , w j is the joint probability of two words w i and w j . A small value for is chosen to avoid calculating the logarithm of 0. NPMI is defined as: where K is the number of topics, and each topic consists of the T most relevant word. p w i , w j is the probability that the word pair w i , w j co-occurs in a document, and p(w i ) is the probability that the word w i appears in the document.

Evaluation Result
The method proposed in this paper was compared with PLSA in terms of its topic consistency, where the PLSA implementation uses open-source code (https://github.com/ yedivanseven/PLSA) (accessed on 7 November 2021). Figures 6 and 7 show the average topic consistency calculated by using PLSA and Bert-LSA, respectively, where the abscissa is the number of words N selected in each topic, with values of N ranging from 3 to 13, and the ordinate is the topic consistency value. Here, the value of topic consistency is the average of the corresponding topic consistency values for all countries when different numbers of topics are selected. When evaluated with the U-mass method, the topic consistency of the PLSA model remains almost constant as N increases, and its value is generally low. Similarly, the topic consistency of the Bert-LSA model gradually decreases as N increases, but its value is generally higher than that of PLSA model, which means that the Bert-LSA model performs better than the PLSA model.

consistency,
where the PLSA implementation uses open-source code (https://github.com/yedivanseven/PLSA) (accessed on 7 November 2021). Figures 6 and 7 show the average topic consistency calculated by using PLSA and Bert-LSA, respectively, where the abscissa is the number of words selected in each topic, with values of ranging from 3 to 13, and the ordinate is the topic consistency value. Here, the value of topic consistency is the average of the corresponding topic consistency values for all countries when different numbers of topics are selected. When evaluated with the U-mass method, the topic consistency of the PLSA model remains almost constant as increases, and its value is generally low. Similarly, the topic consistency of the Bert-LSA model gradually decreases as increases, but its value is generally higher than that of PLSA model, which means that the Bert-LSA model performs better than the PLSA model. When evaluated with the NPMI method, the topic consistency of the PLSA model decreases when the value of N is from three to five, increases when the value of N is from five to seven, and then remains basically unchanged thereafter. For the Bert-LSA model, the topic consistency keeps decreasing when the value of N is from three to seven, increases when the value of N is from seven to nine, and then keeps decreasing. On the whole, the Bert-LSA model still outperforms PLSA model.

Conclusions and Discussion
In this paper, a new method of topic identification has been proposed. First, a word embedding algorithm was adopted that was based on a pre-trained model, which generates a word representation that can capture the context of a document. After that, we used When evaluated with the NPMI method, the topic consistency of the PLSA model decreases when the value of N is from three to five, increases when the value of N is from five to seven, and then remains basically unchanged thereafter. For the Bert-LSA model, the topic consistency keeps decreasing when the value of N is from three to seven, increases when the value of N is from seven to nine, and then keeps decreasing. On the whole, the Bert-LSA model still outperforms PLSA model.

Conclusions and Discussion
In this paper, a new method of topic identification has been proposed. First, a word embedding algorithm was adopted that was based on a pre-trained model, which generates a word representation that can capture the context of a document. After that, we used a spherical k-means clustering algorithm to construct topic clusters. Finally, a topic assignment method was used to assign documents to different topics. The assignment process was in order to calculate the similarity between documents and topics.
The method proposed in this paper was applied to the literature abstracts related to geospatial data. First, it shows the characteristics of geospatial data technology and application development trends in related research in several leading countries. Second, the topic coherence of this method was evaluated by using U-Mass and NPMI, and its performance was compared with that of the existing method, PLSA. The results show that the proposed method can produce highly coherent topics. The research in this paper provides new ideas for the trend analysis of technologies and applications related to geospatial data, and helps professionals engaged in research related to geospatial data to identify their future research directions at any time. In addition, this method captures the development trends of related technical fields through text, which can be used as an information tool for anyone who is responsible for strategic decision-making in sectors related to geospatial data, to determine the prospect and market of the fields. This paper also has some shortcomings, for example, it is unable to successfully identify the topic when the number of texts is extremely large. In the future, we will work hard on topic modeling for a large number of texts.