Latent Dirichlet Allocation and t-Distributed Stochastic Neighbor Embedding Enhance Scientiﬁc Reading Comprehension of Articles Related to Enterprise Architecture

: As the amount of scientiﬁc information increases steadily, it is crucial to improve fast-reading comprehension. To grasp many scientiﬁc articles in a short period, artiﬁcial intelligence becomes essential. This paper aims to apply artiﬁcial intelligence methodologies to examine broad topics such as enterprise architecture in scientiﬁc articles. Analyzing abstracts with latent dirichlet allocation or inverse document frequency appears to be more beneﬁcial than exploring full texts. Furthermore, we demonstrate that t-distributed stochastic neighbor embedding is well suited to explore the degree of connectivity to neighboring topics, such as complexity theory. Artiﬁcial intelligence produces results that are similar to those obtained by manual reading. Our full-text study conﬁrms enterprise architecture trends such as sustainability and modeling languages.


Introduction
Comprehending a scientific article (reading comprehension) is a sophisticated cognitive process that depends on numerous extrinsic and intrinsic factors [1]. First of all, precise keywords have to be grasped. The correlations in which these keywords occur are crucial. Additional proximal terms increase the dimensionality of the data acquisition and complicate the analysis with artificial intelligence (AI). However, this may be reduced by lowering the dimensionality. One method to cope with this process is "neighbor embedding". It allows topics that are not directly related to the core area to be captured. This gives insight beyond the core area. It can also predict whether a trend is connected to an external field. Therefore, the question arises as to whether it makes more sense to capture the whole text or perhaps only the summaries (abstracts) to explore this interconnectedness arises. In this paper, we want to demonstrate that for this purpose, it makes more sense to focus on the abstracts by examining both.
The number of publications in the enterprise architecture (EA) research area is continuously growing [2]. In his much-quoted article, Zachman, who is regarded as one of the most significant researchers in modern EA, describes the need for a logical architecture to structure companies' systems as early as 1987 [3]. While early approaches were heavily information-technology focused, the focus has shifted towards business-related contexts such as business processes and organizational goals, as described by Winter and Fischer in 2006 [4].
Today, various methods and approaches support existing architectures' management and for further development and transformation into future state-of-the-art EA [5].
After three decades of scientific research, a broad range of topics has emerged within the EA research area, growing every year [6]. A holistic analysis of current trends and topics within EA is therefore difficult to perform by conventional means. However, a solutionoriented approach can be found in the field of artificial intelligence. Modern methods, such as topic modeling, make it possible to carry out full-text analyses systematically [7]. Thus, a large number of publications in the field of EA can be investigated simultaneously.
This study aims to provide an overarching view of current developments in the research area EA and to complement existing research carried out in the past. The identified topics and trends are discussed and examined for practical relevance.
To conduct the trend analysis based on artificial intelligence practices, design science research is used, defined in more detail by Hevner et al. [8] and Winter [9]. Using topic modeling as an algorithm for the automatic evaluation of scientific publications is not new. Buchkremer et al. [7] also used the methodology in their studies.
This article is structured as follows: In the next section, theoretical basics are explained to create a deeper understanding of the technology in use and serve as a theoretical introduction to the overall topic. In Section 3, the data preparation for the topic modeling analysis is described in detail. The data preparation and the number of topics to be examined will be determined and discussed. Furthermore, the selection of the training algorithm is explained. Next, the application is described with particular attention to the parameterization of the analysis procedure. Subsequently, in Section 4, the determined topics are presented, described individually, and discussed. Finally, in Section 5, the procedure is critically reflected, and limitations of the analysis are pointed out. The elaboration comes to an end with a conclusion, including an outlook on possible further research approaches.

Enterprise Architecture
As Saint-Louis et al. [10] confirmed in a recent study, heterogeneous definitions and descriptions of the term 'Enterprise Architecture' can be found in the literature. In terms of this study, we rely on the definition pronounced by the international standard ISO/IEC/IEEE 42010:2011 [11], which describes EA as a methodology for managing and developing enterprise architecture and defines the term as follows: 'Enterprise Architecture is a discipline that manages the fundamental organization of an enterprise, which is embodied in its components, their relationships to one another and the environment, and the principles that govern its design and evolution.'

State-of-the-Art Reviews on Enterprise Architecture
Since our previous study, conducted in 2016, roughly four years have passed, and the body of knowledge of EA has continued to grow. More than 300 peer-reviewed scientific contributions concerning EA have been published each year [6]. Among these contributions, there are also several recent state-of-the-art reviews: Kitsios and Kamariotou [12] examine how existing EA modeling frameworks cover business strategy optimization and provide insight into a special subarea of EA. Zhang et al. [13] analyze the subject EA with its links to the closely related subject of business-it alignment. Ansyori et al. [14] consider precisely the subject of critical success factors to implement EA. Dumitriu and Popescu [15] cover the design of EA frameworks in their review.
However, none of the reviews conducted since 2016 come from an overarching point of view that considers the subject of EA as a whole. We want to address this gap with our work. As shown previously, natural language processing (NLP) is well suited to achieve this objective, so we have chosen it again for our research. Moreover, this work presents an opportunity to validate the predictions made in 2016 and indicate whether an NLP-supported systematic review can provide an accurate prognosis on developing a scientific research field.

Topic Modeling as a Part of Natural Language Processing
NLP is empirically referred to as a collective term for the machine processing of natural language. It combines computational linguistics, information technology, artificial intelligence, and cognitive sciences [16]. The combination of computer science and linguistics can create possibilities to process natural language by machine, for example, through stochastic algorithms and program logic [17]. In doing so, mainly recognized language areas, syntactic features, and semantics from linguistics are used [18]. By using intelligent and self-learning algorithms of artificial intelligence, machines can cognitively interpret and process natural language [19].
An essential part of artificial intelligence research in machine learning (ML) was defined by Samuel [20] in Empirical Research. Employing ML, computers can learn and continuously improve independently and without explicit programming [20]. In general, ML can be defined by different algorithms. In essence, however, a distinction is made between supervised and unsupervised learning [20].
A combination of both is used with topic modeling, which identifies various topics for given documents by unsupervised data clustering [21]. As Sun et al. [22] explain, topic modeling consists of finding topics T that best describe the text's content. It is assumed that each document is a mixed model of topics, where T is given, and a multinomial distribution of words described for each topic. Hong and Davison [23] describe topic modeling as an automatism for extracting dominant topics from a text corpus. Often, the Latent Dirichlet Allocation (LDA) algorithm presented by Blei et al. [24] is considered the basis of topic modeling. Anwar et al. [25] describe the algorithm as a flexible, generative, probabilistic topic model for collecting discrete data, which considers documents to collect a topic selection. Each topic is represented as a list of words with probabilities of belonging. Haidar and Kurimo [26] also understand LDA as a generative, probabilistic topic model and define it as a three-level Bayesian model. In a recent paper, Hussein et al. explain that in addition to classical deep learning methods, transformer technologies such as BERT (Bidirectional Encoder-Representations from Transformers) can be applied to explore trends in texts. Similar to BERT, LDA and IDF also target the frequency of words in a text. t-SNE, similar to transformers, examines the proximity of words to each other [27]. Hao et al. tackle cross-domain sentiment alignment by applying stochastic word embedding [28].

Topic Modeling Methodology for Literature Reviews
To carry out the analysis of the given publications without disturbances, text preparation steps must be carried out in advance [7]. Since the research contributions are initially available as a PDF and are therefore encrypted, they must be converted into a processable text format. As Welbers et al. [29] show, the R-framework 'pdftools', developed by Ooms [30], offers simple possibilities for converting the text. The texts are vectorized for more efficient processing and then cleaned up using various text mining methods [31,32]. As the original wording of the publications is to be used for evaluation, punctuations and stop words such as 'and' or 'the' are removed from the texts, as these do not offer any significant added value for the analysis. Yaram [33] uses the R-Framework 'tm' for text cleansing, which was introduced by Feinerer [34] and will also be used in this work. To perform a topic modeling using LDA afterward, it is necessary to transform the already vectorized texts into a document term matrix (DTM) [35]. The DTM includes the number of respective words per document and represents them numerically. In the following figure (see Figure 1), d denotes the documents and w the weighting or occurrence of the words t per document [36]. According to this, for example, W 11 represents the weighting or occurrence of word 1 to document 1. m and n are used as counters for words and documents considered. After the transformation into a DTM, a list of words with the most robust frequency is generated and analyzed. This list shows, for example, that words like 'IEEE,' 'vol' or 'city' are located in the corpus at high frequency, which can lead to a distortion of the results of the topic modeling [37,38]. These very words can be classified as stop words and should be cleaned up. For exclusion, a separate function is implemented in R to exclude a range of freely selectable words from the text.
Modeling is mainly influenced by the number of topics to be identified [39]. To determine the ideal number of topics, empirical studies often rely on log-likelihood measurement [40,41].
To validate the results determined by the LDA algorithm, cross-validation is applied [42]. As part of this procedure, the training text corpus is divided into two parts. The first and usually more extensive body of the text is used for training and the second for testing. The test corpus w consists of documents that have not been considered. The training model is described as a topic matrix with Φ as topic-word distribution. The parameter Θ is not considered because it represents the training set's document-topic distribution and is therefore unsuitable for evaluation. (1) Thus, considering the logarithmic probability of a set of unseen documents, wi considers the topic-word distribution Φ for the document-topic distribution θ [43]. In particular, the perplexity of the training data on the test corpus is used as a measure to determine the transferability of the model [44]. The perplexity is defined as: where N is the total number of words in the test corpus [43]. A decreasing function of the logarithmic probability L(w) can be seen. Therefore, it is generally true that the lower the calculated perplexity, the better the performance of the trained topic modeling model [45,46].
For the analyses of this paper, we used the full text of 231 scientific publications in the subject area of EA from various scientific libraries. The process of data retrieval is explained in more detail in Section 3.4.
In this paper, the evaluation of the optimal number of topics is implemented in the programming language R. A separate function that performs multiple k-fold cross-validations is developed. For this purpose, a model is trained in each case, then transferred to a test corpus to measure the perplexity. A value range between 2 and 300 subjects is analyzed to determine the optimal number of subjects. Since the evaluation is computationally intensive, several computing clusters have to be formed to parallelize the calculations. The following graphic in Figure 2 visualizes the results of the analysis. After the transformation into a DTM, a list of words with the most robust frequency is generated and analyzed. This list shows, for example, that words like 'IEEE,' 'vol' or 'city' are located in the corpus at high frequency, which can lead to a distortion of the results of the topic modeling [37,38]. These very words can be classified as stop words and should be cleaned up. For exclusion, a separate function is implemented in R to exclude a range of freely selectable words from the text.
Modeling is mainly influenced by the number of topics to be identified [39]. To determine the ideal number of topics, empirical studies often rely on log-likelihood measurement [40,41].
To validate the results determined by the LDA algorithm, cross-validation is applied [42]. As part of this procedure, the training text corpus is divided into two parts. The first and usually more extensive body of the text is used for training and the second for testing. The test corpus w consists of documents that have not been considered. The training model is described as a topic matrix with Φ as topic-word distribution. The parameter Θ is not considered because it represents the training set's document-topic distribution and is therefore unsuitable for evaluation.
Thus, considering the logarithmic probability of a set of unseen documents, w i considers the topic-word distribution Φ for the document-topic distribution θ [43]. In particular, the perplexity of the training data on the test corpus is used as a measure to determine the transferability of the model [44]. The perplexity is defined as: where N is the total number of words in the test corpus [43]. A decreasing function of the logarithmic probability L(w) can be seen. Therefore, it is generally true that the lower the calculated perplexity, the better the performance of the trained topic modeling model [45,46].
For the analyses of this paper, we used the full text of 231 scientific publications in the subject area of EA from various scientific libraries. The process of data retrieval is explained in more detail in Section 3.4.
In this paper, the evaluation of the optimal number of topics is implemented in the programming language R. A separate function that performs multiple k-fold cross-validations is developed. For this purpose, a model is trained in each case, then transferred to a test corpus to measure the perplexity. A value range between 2 and 300 subjects is analyzed to determine the optimal number of subjects. Since the evaluation is computationally intensive, several computing clusters have to be formed to parallelize the calculations. The following graphic in Figure 2 visualizes the results of the analysis. The optimum number of topics for the application is between 40 and 100 iterations (see Figure 2). Another possibility for evaluating the best possible number of topics is the R-framework 'ldatuning', developed by Murzintcev [47]. The framework uses several metrics for measurement. We use three of them for our study. Griffiths 2004 [48] represents an approach where the number of topics is optimal when the log-likelihood for the data becomes maximum. CaoJuan 2009 [49] metric measures the stability of topic structure using an average cosine distance between every pair of topics. Arun 2010 [50] finds the optimal number of topics by applying a symmetric Kulback-Leibler divergence on the distributions generated from topic-word and document-topic matrices, as they viewed a topic model as matrix factorization [51].
All three metrics (see Figure 3) suggest that the optimal number of topics is between 40 and 60. A low value is preferred in the upper graph, while in the lower graph, a high value is favored. This confirms the perplexity analysis carried out previously. Thus, for this study, 50 topics are to be determined.  The optimum number of topics for the application is between 40 and 100 iterations (see Figure 2). Another possibility for evaluating the best possible number of topics is the R-framework 'ldatuning', developed by Murzintcev [47]. The framework uses several metrics for measurement. We use three of them for our study. Griffiths 2004 [48] represents an approach where the number of topics is optimal when the log-likelihood for the data becomes maximum. CaoJuan 2009 [49] metric measures the stability of topic structure using an average cosine distance between every pair of topics. Arun 2010 [50] finds the optimal number of topics by applying a symmetric Kulback-Leibler divergence on the distributions generated from topic-word and document-topic matrices, as they viewed a topic model as matrix factorization [51].
All three metrics (see Figure 3) suggest that the optimal number of topics is between 40 and 60. A low value is preferred in the upper graph, while in the lower graph, a high value is favored. This confirms the perplexity analysis carried out previously. Thus, for this study, 50 topics are to be determined. The optimum number of topics for the application is between 40 and 100 iterations (see Figure 2). Another possibility for evaluating the best possible number of topics is the R-framework 'ldatuning', developed by Murzintcev [47]. The framework uses several metrics for measurement. We use three of them for our study. Griffiths 2004 [48] represents an approach where the number of topics is optimal when the log-likelihood for the data becomes maximum. CaoJuan 2009 [49] metric measures the stability of topic structure using an average cosine distance between every pair of topics. Arun 2010 [50] finds the optimal number of topics by applying a symmetric Kulback-Leibler divergence on the distributions generated from topic-word and document-topic matrices, as they viewed a topic model as matrix factorization [51].
All three metrics (see Figure 3) suggest that the optimal number of topics is between 40 and 60. A low value is preferred in the upper graph, while in the lower graph, a high value is favored. This confirms the perplexity analysis carried out previously. Thus, for this study, 50 topics are to be determined.  The R-framework 'topicmodels' presented by Grün and Hornik [52] will be used for the investigation intended in this paper. This framework includes implementing the topic modeling algorithm LDA and other basic methods required for the analysis, for example, perplexity. The framework is based on the R-Framework 'tm', which is also used in this study [53]. In general, a distinction can be made between the two approaches VEM and Gibbs sampling, evaluate the topics found [52]. In this study, the Gibbs sampling algorithm implemented by Griffiths and Steyvers [48] will be used, as it has proven itself in various studies [37,40,54]. The algorithm performs topic modeling as follows: Given are the vectors of all words →w and →z and their topic assignment by data collection W. The topic assignment depends on the assignment of all other word positions, namely, the topic assignment of a word t is taken from the following multinomial distribution [55].
In the formula, n k , ¬i denotes the number of assignments of the word t to the topic k.
represents the total number of words t assigned to the topic k. Furthermore, n (k) m , ¬i deals with the number of words in the document m assigned to topic k. All the terms described above exclude the current assignment. The formula element ∑ K i=1 n (j) m stands for the total number of words in the document m except for the word t that is currently being dealt with. α and β denote the Dirichlet parameters, which are symmetric [55].
In addition to selecting the training algorithm, an optimal parameterization of the methodology according to the requirements is necessary.

t-Distributed Stochastic Neighbor Embedding for Topic Model Visualization
The tSNE algorithm, introduced by van der Maaten and Hinton [56], is used to overcome the challenge of representing high-dimensional data, and it has a wide range of applications, e.g., life sciences, analysis of deep learning networks [57][58][59].
In general, tSNE reduces data's dimensionality and produces 2D or 3D embeddings, preserving local structures in the high-dimensional data. Typical tasks performed by users of tSNE are based on identifying relationships between data points and their origin. The tasks often include identifying visual clusters and their verification, e.g., using parallel coordinate plots [60].
The algorithm tSNE calculates two joint probability distributions P to represent the total distance between data points in the high-dimensional space and Q describing the similarity in the low-dimensional space. The goal of this logic is to achieve a faithful representation of P in low-dimensional space by Q. This can be achieved by minimizing the cost function C given by the Kullback-Leibler divergence between the joint probability distributions P and Q to optimize the positions in the low-dimensional space [60]. The minimization of the Kullback-Leibler divergence and the change in position of the low-dimensional points for each step of the gradient descent is defined as [61]: where ε denotes an s-dimensional embedding, p ij defines the joint probabilities that measure the pairwise similarity between two high-dimensional input data points and q ij the embedding similarity between two low-dimensional points (low-dimensional version of data points of p ij ) [61]. Maaten describes that the objective function focuses on modeling high values of p ij (similar objects) by high values of q ij (nearby points in the embedding space) [61]. This is because Kullback-Leibler divergence is asymmetric. In case of the em-bedding ε, the objective function is non-convex and is therefore typically minimized by using the Gradient descent function: where y i and y j denote the low-dimensional input data points, and Z is defined as normalization term [61]: We refer to van der Maaten and Hinton [56] and Maaten [61] for more details on the algorithm.
In this paper, we use the R-framework 'Rtsne' introduced by Krijthe and van der Maaten [62], which implements the Barnes-Hut tSNE algorithm to reduce the computational complexity [61].
While the original tSNE uses a brute-force approach that increases computational complexity and memory complexity (which is O(n 2 )), the Barnes-Hut variant uses a quadtree, resulting in a reduction in computational complexity of O(nlog(n)) and the memory complexity to O(N) per iteration [60].
For more details, we refer to van der Maaten [61].

Comparison to the Methodology of Previous Studies
Gampfer et al. [6] ran an EA trend analysis supported by NLP in 2016, published in 2018. Table 1 shows a comparison of the methods used in both papers. While the overall approach and the subject are similar, there are also crucial differences that will be addressed subsequently. To better judge the comparability of the methods used, we applied the methodology presented in this study to Buchkremer and coworkers [6].
The application shows that the topics cloud, agile/adapt, smart, big data, sustainable, entrepreneurial, complexity theory, and IoT could also be determined by the new methodology. Table 2 shows the characteristic terms of the EA publications and the trend and topics determined based on these terms: The direct comparison shows that the same trends and topics could be determined with both algorithms. Hence, from a high-level point of view, both methods produce similar results. However, a significant difference becomes evident when looking at how individual documents can be mapped to trends. Gampfer et al. [6] use an n-to-n mapping between documents and trends-meaning a document can belong to multiple trends. This work uses n-to-1 mapping-meaning a document can belong to one trend only.

Information Retrieval: Publication Search and Selection Process
For a careful consideration of current research topics within EA, publications from the years 2019 and 2020 were extracted from multiple relevant databases. In the selection of scientific publications, the focus is on peer-reviewed journals and conference proceedings. Furthermore, potential duplicates were cleaned up based on a manual comparison of the publication titles. The following combined search string is used to identify as many publications with topic-specific focus as possible: '"Enterprise Architecture" OR "Enterprise Architecture Management."' The individual search result was limited to the publication period between 2019 and 2020 to ensure that the publications are up to date. The search results of the following databases are looked at: Within the scope of data acquisition, 271 documents on the subject area EA were attained from 30 December 2019 to 10 January 2020. After the manual review and the cleaning of potential duplicates, 231 documents were identified, examined within the study.

Application of Algorithm and Parametrization
The parameter seed is often used in the programming language to reproduce the results [63,64]. The hyperparameter marks a starting point for the generation of a sequence of random numbers. If the random number generator is identical, the results' reproducibility can be achieved with the same configuration. For this study, the hyperparameter should be set to seed = 2020. This selection was made arbitrarily. In their studies, Green and Hornik specify a repetition rate between 1000 and 2000 [52]. Séaghdha uses a repetition rate of 1000 in his studies [65]. Our tests have shown that the best results can be achieved with 2000 repetitions. Accordingly, the algorithm is parameterized with iteration = 2000. In principle, the hyperparameter best can be used to decide whether all training runs should report a result-for this purpose, the parameter best = FALSE is set-or only those which, in retrospect, show the best logarithmic probability [52]. In this study, only results with the best possible probabilities should be used. Thus, the parameter best = TRUE is defined. The control value k indicates the number of topics to be determined and is set to k = 50 for this investigation, as described in the previous chapter. The estimation parameter α is set to 50/k, based on the recommendation of Griffiths and Steyvers [48]. Furthermore, the DTM, as the basis of the analysis, is transferred to the function.

Current Enterprise Architecture Research Trends
To identify the current EA trend topics, we analyzed full-text publications and applied an approach that combines unsupervised and supervised techniques. First, we ran an utterly unsupervised algorithm to obtain clusters of terms that occur together in the documents. We identified current trends based on the clusters and validated them by checking assigned documents in a second step. In the next step, we used both steps to analyze the topics in more detail. In the analysis, we compared the trend we had identified with other studies and the Gartner Hype Cycle for Enterprise Architecture [66] to examine the relevance of the practice topics.

Identifying and Measuring Current EA Trends
For the topic identification, we used the algorithm LDA as an unsupervised method for clustering terms. The algorithm defines the clusters based on the occurrence of terms that belong to a topic or the probability of words belonging to a topic; for details on the method, see Section 3. The method's application resulted in a list of manually reviewed terms to identify relevant subjects in terms of content. The terms that cannot be mapped to a topic were excluded from the analysis. Table 3 shows the result of the review and the topic assignment. We tagged the topic with the highest fit as per probability calculated by the algorithm for each document. According to the results shown in Figure 4, the most significant relevance is Sustainability, while the Internet of Things showed the smallest number of assigned documents.
AI 2021, 2, FOR PEER REVIEW 9 with other studies and the Gartner Hype Cycle for Enterprise Architecture [66] to examine the relevance of the practice topics.

Identifying and Measuring Current EA Trends
For the topic identification, we used the algorithm LDA as an unsupervised method for clustering terms. The algorithm defines the clusters based on the occurrence of terms that belong to a topic or the probability of words belonging to a topic; for details on the method, see Section 3. The method's application resulted in a list of manually reviewed terms to identify relevant subjects in terms of content. The terms that cannot be mapped to a topic were excluded from the analysis. Table 3 shows the result of the review and the topic assignment. We tagged the topic with the highest fit as per probability calculated by the algorithm for each document. According to the results shown in Figure 4, the most significant relevance is Sustainability, while the Internet of Things showed the smallest number of assigned documents. The distribution of EA trends identified by this work mostly confirms the predictions of Buchkremer and coworkers [6] made in 2016. Sustainability is clearly on the rise, while Agile methodology has lost researchers' attention in the field. One striking deviation of the prediction concerns the field of complexity theory. While the forecast of 2016 indicated The distribution of EA trends identified by this work mostly confirms the predictions of Buchkremer and coworkers [6] made in 2016. Sustainability is clearly on the rise, while Agile methodology has lost researchers' attention in the field. One striking deviation of the prediction concerns the field of complexity theory. While the forecast of 2016 indicated that this topic would play a niche role, current results show an increasing interest in the subject. In Section 4 of this paper, we have a more detailed look at the individual trends.

Significance of Full-Text Mapping and the Deployment of t-SNE in Analyzing EA Trends
Compared to the 2018 study by Buchkremer and coworkers, it can be seen that complete text analysis yields similar results to that of abstract analysis. To identify trends for a topic with many publications, we recommend examining abstracts instead of full texts. The results are comparable, and it is more cost-effective and less computationally intensive. t-SNE is helpful in identifying the degree of interconnectedness. Thus, trends are indexed that the average expert might not have determined to be directly related to the topic area, such as identifying complexity theory. t-SNE also shows that the discipline of EA is becoming more interconnected and multidisciplinary overall, and an increasing number of trends can be identified that are not part of the core EA topic (see Figure 5).
The following figure shows that many documents have more than one topic assigned. This is an indicator that topics are related to each other.
AI 2021, 2, FOR PEER REVIEW 10 that this topic would play a niche role, current results show an increasing interest in the subject. In Section 4 of this paper, we have a more detailed look at the individual trends.

Significance of Full-Text Mapping and the Deployment of t-SNE in Analyzing EA Trends
Compared to the 2018 study by Buchkremer and coworkers, it can be seen that complete text analysis yields similar results to that of abstract analysis. To identify trends for a topic with many publications, we recommend examining abstracts instead of full texts. The results are comparable, and it is more cost-effective and less computationally intensive. t-SNE is helpful in identifying the degree of interconnectedness. Thus, trends are indexed that the average expert might not have determined to be directly related to the topic area, such as identifying complexity theory. t-SNE also shows that the discipline of EA is becoming more interconnected and multidisciplinary overall, and an increasing number of trends can be identified that are not part of the core EA topic (see Figure 5).
The following figure shows that many documents have more than one topic assigned. This is an indicator that topics are related to each other.
. Figure 5. Document grouping based on their similarity using tSNE.

Cloud Computing and EA
The investigation shows that cloud computing [67,68] as an overarching theme currently forms a trend within EA. Gampfer et al. [6] forecast a continuing trend of the topic, which can also be seen in this analysis, as the focus of modern IT architectures, in particular, has moved away from stationary system landscapes to cloud deployment in recent years [69]. In practice, the relevance of cloud computing seems to be declining. While the topic has already been listed as 'sliding into the trough' in Gartner's 2017 'Hype Cycle' [70], it is no longer identified as a trend in the 2019' Hype Cycle' [66].

Sustainability and EA
As the study by Gampfer et al. [6] shows, there is a growing interest in the topic of sustainability in the EA sector. In general, sustainability is considered one of the significant challenges of contemporary society [71]. Likewise, more and more companies adopt a sustainability strategy, which underlines the relevance of the economy's topic [72]. Thus, sustainability is also in demand in the EA context, especially regarding how developments can be made long-term and sustainable [73]. In the 'Hype Cycle' by Gartner for EA, this topic does not appear [66].

Cloud Computing and EA
The investigation shows that cloud computing [67,68] as an overarching theme currently forms a trend within EA. Gampfer et al. [6] forecast a continuing trend of the topic, which can also be seen in this analysis, as the focus of modern IT architectures, in particular, has moved away from stationary system landscapes to cloud deployment in recent years [69]. In practice, the relevance of cloud computing seems to be declining. While the topic has already been listed as 'sliding into the trough' in Gartner's 2017 'Hype Cycle' [70], it is no longer identified as a trend in the 2019' Hype Cycle' [66].

Sustainability and EA
As the study by Gampfer et al. [6] shows, there is a growing interest in the topic of sustainability in the EA sector. In general, sustainability is considered one of the significant challenges of contemporary society [71]. Likewise, more and more companies adopt a sustainability strategy, which underlines the relevance of the economy's topic [72].
Thus, sustainability is also in demand in the EA context, especially regarding how developments can be made long-term and sustainable [73]. In the 'Hype Cycle' by Gartner for EA, this topic does not appear [66].

Digital Transformation and EA
Digital transformation is the umbrella term for the digitization of today's society [74]. According to Zimmermann et al. [75], the term covers technological megatrends such as big data, artificial intelligence, and cloud computing. From an economic perspective, digital transformation enables new technologies to achieve competitive advantages [76]. EA's task is to support the digital transformation process by continuously evaluating and reconfiguring a company's value creation mechanisms. The resulting change interacts with all information systems and affects the existing system architectures [77]. The other trends identified can be assigned to this topic. Gartner's current 'Hype Cycle' covers a sub-discipline of digital transformation and focuses on the change within companies through the hype 'Digital Business Transformation' [66].

Pattern Recognition and EA
A current trend in pattern recognition or machine learning in EA can be identified based on the topic analysis. The 'Hype Cycle' for Gartner's EA area also shows a current trend in this topic area [66]. Machine Learning is currently being taken up in practice and research and influences the digital transformation [78]. In particular, the opportunity to automate processes and interactions as far as possible based on continuous automatic learning is prompting companies to implement machine learning [79].

Complexity Theory and EA
Although Gampfer et al. [6] only forecast a continuation of the trend complexity theory until 2017, this work's investigation shows an ongoing trend. In principle, complexity can be used in EA to understand architectures and measure their complexity [80]. The Gartner hype cycle does not attach any relevance to the topic complexity theory [66].

Modeling Languages and EA
EA modeling languages heavily center around Archimate. Fritscher and Pigneur [81] describe Archimate as a support language for modeling structures within EA. Landthaler et al. [82] describe Archimate as standard within the modeling of modern EA. Archimate does not appear in the Gartner hype cycle for EA [66], although Perez-Castillo et al. [83] show its high relevance in various modeling applications in the field of EA. In empirical studies, Archimate can be identified as a trend within EA.

Big Data and EA
As Lu and Liu [84] point out, there is a steady increase in Big Data technology publications from 2011 onwards. In practice, Big Data can be beneficial in EA decision making and strategy development [85]. The topic modeling conducted in this study confirms a continuation of the trend in EA. The current Gartner hype cycle for EA does not identify Big Data as a trend [66].

Microservices and EA
According to the analysis results of this study, microservices can be classified as a trend within EA. Zimmermann et al. [75] describe microservices as the core area of digitization. In principle, microservices can be defined as individual applications with independent functionalities and the opposite of large monolithic systems [86]. Although not included in the Gartner hype cycle [66], microservice architectures are gaining more importance in practice due to their flexibility and independence [87].

Security and EA
Another EA trend can be found in the area of security or cybersecurity. This trend is understandable since the increasing number of IT-supported processes and the ongoing digitalization of companies mean an equally large degree of cyber-threats [88]. As Halawi et al. [5] describe, empirically, this trend is not new but firmly connected with IT and EA. In the current Gartner hype cycle, the topic of security architecture is currently at a peak and can therefore be considered relevant in practice [66].

Internet of Things and EA
Based on the topic modeling results, the topic area Internet of Things (IoT) [89,90] can be classified as relevant. It is necessary to integrate IoT into the EA as an essential part of 'Industry 4.0' [91]. Zimmermann et al. [75] describe IoT as a core aspect of digitization and a megatrend in digital architecture. Gampfer et al. [6] forecast a continuous growth of technology in the following years and note a high impact on EA. Although IoT was identified as a trend by Gartner in the hype cycle of 2017 [70], the topic is no longer considered as such in the current analysis [66].

Agile Methodology and EA
According to Gartner's hype cycle [66], Agile methodology or Agile Architecture is a trend that has great significance. Although it is seen as a challenge to combine agile methods and EA [92], companies trust efficient software management and higher software maintenance flexibility through agile approaches [93]. The present analysis shows that the topic of agile methodology is also a current trend in EA science.

Continuous Planning and EA
According to the identified word-groups of the topic modeling, Continuous Planning is another current trend within EA. Continuous Planning allows us to meet the challenges of a constant need for new requirements [94]. The planning methodology, which is significantly based on release management, uses agile methods and attempts to create a close integration between planning and implementation of changes [95] and a more frequent provision of new functionalities in the software area. The Gartner hype cycle shows that this topic is currently becoming a trend in practice [66].

General Discussion and Conclusions
In this paper, an analysis of issues and trends of the years 2019-2020 in EA was conducted using machine learning. Therefore, a method was developed, making it possible to perform arbitrary topic analyses in empirical studies and practice. The implementation was based on the automatic evaluation of scientific papers published by Buchkremer et al. [7]. To enable a high-performance implementation in the programming language R, various existing functionalities were extended by parallel processing, which proved helpful. Furthermore, relevant findings of using the R programming language in machine learning were collected, which will be helpful in future analyses.
As a result, of the investigation, the trends in the area of EA predicted by Gampfer et al. [6] could be verified and confirmed. Some new topics and research trends were also identified. Finally, the topics of cloud computing, sustainability, capability management, digital transformation, pattern recognition, complexity theory, modeling languages, big data, microservices, security, the internet of things, agile methodology, and continuous planning were identified as relevant. To establish a link to practice, all identified topics were compared with the current Gartner hype cycle for EA. In this way, discrepancies between empiricism and practice were evaluated.
The automated analysis of full texts has proven beneficial to uncover additional insights, especially when there is a limited number of documents to be analyzed. The analysis of 231 full texts yielded similar results compared to 3799 abstracts. However, in terms of the methodology, selecting abstracts only reveals the most relevant content. Therefore, as a general rule of thumb for any future study applying a methodology similar to ours, we recommend analyzing not full texts but only abstracts-given that the corpus has sufficient size. Still, we also want to emphasize that the context of the state-of-the-art analysis has a significant impact, which is why we suggest iteratively model the approach in a way that is both goal-oriented and fit for purpose.
In summary, this paper shows that the use of state-of-the-art methods such as machine learning are beneficial for the topic and trend analysis, as it allows for an overarching view of the literature in contrast to classical systematic literature analysis.
What is new is that different methods lead to nearly the same result, and the exciting thing is that the investigation of full texts does not provide any significant added value in the trend analysis. It should be noted, however, that we have studied this phenomenon only for EA. By LDA/t-SNE, a known method to capture the complexity of texts, it can be quickly determined how far a trend is already differentiated from other trends. No one has ever done a study on EA using text analysis tools. Moreover, predictions about EA, which have been studied by text analysis, have been confirmed for the first time.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to license agreements and copyright.

Conflicts of Interest:
The authors declare no conflict of interest.