Frequency of Neuroendocrine Tumor Studies: Using Latent Dirichlet Allocation and HJ-Biplot Statistical Methods

Background: Neuroendocrine tumors (NETs) are severe and relatively rare and may affect any organ of the human body. The prevalence of NETs has increased in recent years; however, there seem to be more data on particular types, even though, despite the efforts of different guidelines, there is no consensus on how to identify different types of NETs. In this review, we investigated the countries that published the most articles about NETs, the most frequent organs affected, and the most common related topics. Methods: This work used the Latent Dirichlet Allocation (LDA) method to identify and interpret scientific information in relation to the categories in a set of documents. The HJ-Biplot method was also used to determine the relationship between the analyzed topics, by taking into consideration the years under study. Results: In this study, a literature review was conducted, from which a total of 7658 abstracts of scientific articles published between 1981 and 2020 were extracted. The United States, Germany, United Kingdom, France, and Italy published the majority of studies on NETs, of which pancreatic tumors were the most studied. The five most frequent topics were t_21 (clinical benefit), t_11 (pancreatic neuroendocrine tumors), t_13 (patients one year after treatment), t_17 (prognosis of survival before and after resection), and t_3 (markers for carcinomas). Finally, the results were put through a two-way multivariate analysis (HJ-Biplot), which generated a new interpretation: we grouped topics by year and discovered which NETs were the most relevant for which years. Conclusions: The most frequent topics found in our review highlighted the severity of NETs: patients have a poor prognosis of survival and a high probability of tumor recurrence.


Introduction
Neuroendocrine tumors (NETs) are so called because the organs they affect present cells with neural and endocrine properties [1]. NETs are malignant epithelial neoplasms with a predominant neuroendocrine differentiation [2].
NET severity may be graded from 1 to 3, but some well-differentiated NETs have a distinctive feature related to Ki67 index marker counts higher than 20% or a mitotic count higher than 20 per 2 mm 2 [3]. In 2010, the World Health Organization (WHO) created a three-tier grading scheme and pointed out the that the grade 3 group can indeed be heterogeneous, comprising both poorly differentiated and well-differentiated forms [4]. Recently, a new guideline proposed a change in the grade 1-2 cutoff, which was implemented (grade 1: Ki67 ≤ 3%). In fact, the NET precocious diagnostic is essential to increasing patient survival [5]. NETs may appear in any organ of the body, but they are more frequently the source of metastasis in the small intestine followed by the pancreas, which causes metastases in the lungs and colon [6].
The incidence of NETs has increased in different countries over the years, probably because of advances in diagnostic techniques, such as endoscopy and computerized tomography, higher awareness among clinicians, and an aging population [7]. Several studies over the years have shown an increase in the prevalence and incidence of NETs [7,8], but the majority were conducted in the United States and Europe [8], so little is known in undeveloped countries about NET incidence and prevalence or which organs are the most affected. The aim of this review was to analyze which countries contributed the best scientific evidence about NETs. In addition, we investigated which related topics were the most studied between 1981 and 2020.
In this work, we used a Latent Dirichlet Allocation (LDA) model to identify and interpret NTE topics and research trends in our set of documents [9]. Thus, from this collection, called the "corpus," the Bayesian-based LDA generates a probabilistic extension of the latent semantic analysis [9,10]. In addition, it assumes an a priori sparse Dirichlet distribution on topics in the documents using Gibbs sampling [11], which determines a topic's probability from documents that combine various topics in different proportions. In contrast, the HJ-biplot method [12] was used to obtain a more precise data evaluation by revealing relationships between the analyzed data.

Materials and Methods
First, the basic concepts of the Latent Dirichlet Allocation (LDA) are introduced. Then, the procedure is used to identify latent topics and research trends in neuroendocrine tumors. Next, the quantitative indices are defined so the results can be explained. Finally, the HJ-biplot provides a graphic multivariate representation of rows and columns of a data matrix in a low-dimensional subspace where the results' relative positions are interpretable.

Latent Dirichlet Allocation (LDA)
The LDA is a method of unsupervised text mining in which themes or topics of certain documents can be identified from a larger collection, the corpus. It is based on Bayesian models and is considered to be an extension of probabilistic latent semantic analysis [10,13]. The LDA adds a sparse Dirichlet prior to distribution of the items to a document, using a Gibbs sampler [11] to assign probabilities to topics for each term. The documents are then grouped into the topics to which they belong, assuming that the documents exhibit multiple mixtures of subjects in different proportions. The LDA allows the inference or estimation of latent variables; that is, it calculates their conditional distribution in documents. Equation (1) shows the statistical assumptions behind the generative process of the LDA.
Here, K, m, and n denote, respectively, the number of topics, the number of articles, and the number of words in a given document; α and η are Dirichlet hyperparameters of the prior distributions over θ and β, respectively; θm is the distribution of topics for article m (real vector of length K); z m,n is the topic for the nth word in the mth article; w m,n is the nth word of the mth document; and βk is the distribution of words for topic k. We had to focus on the only observable variable (i.e., the words within the documents) to infer the hidden structure with the statistical inference methods. The conditional probability, also known as the posteriori probability, is expressed by Equation (2).
Although the posterior probability cannot be exactly computed due to the term in the denominator [13], an approximation to its true value can be achieved with statistical posterior inference methods. Two main types of inference techniques can be discerned: variation-based [13] and sampling-based algorithms [14]. An example of the latter is the Gibbs sampler [9]. Both algorithms provide similarly accurate results [15].

Identifying Research Topics
To identify the research topics in NETs using the LDA, the study was divided in four stages: Literature Search, Preprocessing, Selection of the number of LDA topics and model construction, and Labeling the topics (Figure 1). Four variables were used: author, title, year of publication, and author affiliation. Books, reviews, book chapters, gray literature, and reports were not included to avoid noise in the results. After executing the search query, the preliminary database returned 8216 documents, which were filtered to remove those that were repeated, misclassified, or contained no abstract. Based on the inclusion criteria, 7658 articles published between 1981 and 2020 were included.

Preprocessing
After reviewing the full texts of the eligible studies, LDA was used-a "bag of words" model in which documents are represented as a sequence of individual words. Each document was tokenized, a necessary process for obtaining individual words (also known as unigrams) from sentences. Lowercase text, punctuation marks, hyphens, square brackets, blank spaces, and other characters were eliminated. In addition, a standard list of words called "stop words" was also removed to ensure that each sentence was grammatically correct. As a result of this preprocessing, a matrix of document terms was obtained where each article was represented in a V-dimensional vector of a collection of words. Data processing in this part of the study was carried out using LDAShiny [16], an open-source package for R programming language (R Development Core Team 2019), which contains a tool that provides a web-based graphical user interface to perform a review of the scientific literature under the Bayesian approach of Latent Dirichlet Allocation (LDA) and machine learning algorithms.

Selection of the Number of LDA Topics and Construction of the Model
Topics are latent variables that use correlations between words and latent semantic themes in a collection of documents [17]. The definition assumes that the expected number of topics k (i.e., latent variables) must be established a priori. Simulations were carried out varying k from 4 to 30 in incremental steps of one, and an inference algorithm with 500 iterations was used, namely Gibbs sampling [11]. A quality LDA model was determined by using a topic coherence measure [18], which is a measure of a topic model from the perspective of human interpretability and is considered a more adequate measure than a computational metric such as perplexity [19].

Labeling of Topics
A naive labeling algorithm based on probable bigrams provided by the package textmineR was used [20]. However, given that algorithms have a very limited ability to understand the latent meaning of human language [21], it was also decided to use manual labelling, which is considered standard in topic modeling, from three sources of information: the 20 most frequent word lists (most likely), a sample of the titles, and the three most loaded articles.
In addition, to improve the labeling, we visualized the topics in a two-dimensional area by computing the distance between them [22]. This method showed the area representing the prevalence of the topic and indicated how widespread a topic was within all documents. This analysis displayed the similarity between topics with respect to their probability distribution over words [23].

HJ-Biplot
The HJ-biplot [12], an extension of the classical biplots introduced by Gabriel [24], is an exploratory data analysis method that looks for hidden patterns in a data matrix, and then graphically represents the information contained in the rows (years) and columns (topics) [12]. This multivariate statistical technique was chosen since it offers a more precise data evaluation in which the relationships between the parts, years, and topics are highlighted. For this analysis, Multbiplot software [25] was used because it provides a fast and easy way to incorporate our tables from an Excel format.
The data representation consisted of visualizing a matrix of multivariate data X nxp using vectors as points called markers g 1 , g 2 , . . . , g n for each row, and vectors called markers h 1 , h 2 , . . . , h p for each column. Each row represented a subject, and each column a variable, such that both marker sets could be superimposed onto the same reference system with the maximum quality of representation. If the rows of matrix A were described as markers g 1 , g 2 , . . . , g n and matrix B as markers h 1 , h 2 , . . . , h p , the result was X = AB T .
The markers were obtained from the usual singular value decomposition (SVD) of the data matrix. The SVD of matrix X was defined by X = UDV T , where U is the matrix where columns are the eigenvectors of XX T , V is the matrix where columns are the eigenvectors of X T X, and D is the matrix diagonal of the singular values λ i of X. Let A and B be the matrices of the first two columns of UD and VD, respectively ( Figure 2).

Quantitative Indices Used to Analyze the Trend of Topic
The number of documents and words found were quite large, so it was not possible to understand the topic trend intuitively. Therefore, we used some quantitative indices [26] to analyze the main outputs of the topic modeling algorithm: the collection of terms with associated frequencies that characterized a topic, and the percentage composition for each document. The topic distribution over time θ y k was: where m∈y represents articles published in a given year, θ mk is the proportion the kth topic in each article, and n y is the total number of articles published in the year.
To facilitate the characterization of the topics by their tendency, simple regression slopes for each topic were used, in which the year was a dependent variable and the proportion of the topics in the corresponding year was the response variable [9]. The topics obtained by regression were positive or negative at a statistical significance level of 0.01 and classified as positive or negative trends, respectively (Equation (3)).

Results
Reference lists of the articles were reviewed to identify additional studies. The detailed search strategy is presented in Table 1.

Database Search Filters/Limits
PubMed Tumors neuroendocrine Humans English The full search strategy for each database (Mesh: Medical Subject Headings) and the results are available from the authors.
The initial search yielded 8216 articles. After applying the inclusion and exclusion criteria and searching the reference lists of included studies, we identified a total of 7658 studies ( Figure 3).   The coherence scores for all LDA models are shown in Figure 5. The results suggested that the LDA model with the optimum coherence score contained 25 topics (k = 25). Distribution by document θ m was added to calculate average probability θ y k , all the articles published in a given year, and to identify trends ( Figure 6). We found that the probability of some topics increased gradually over time (red): t_2 (cost and effect on quality of life), t_3 (Markers for carcinomas), t_6 (GEP-NETs), t_10 (solid-pseudopapillary tumor of the pancreas), t_11 (Pancreatic neuroendocrine tumors), t_13 (Patients one year after treatment), t_15 (Radionuclide therapy), t_17 (Prognosis of survival before and after resection), t_20 (pheochromocytoma and paraganglioma), t_21 (Clinical benefit), t_22 (AJCC staging system), and t_24 (long term NF-pNETs).  The topics that showed decreasing probability (blue) were t_1 (gene expression in cell line normal tissue), t_7 (Merkel cell carcinoma), t_14 (Large-cell, small-cell, carcinoma), and t_25 (chromogranin A levels); topics where there was no observed trend (black) were t_4 (Ectopic ACTH syndrome), t_5 (liver transplantation for hepatic metastases), t_8 (primary tumor well-differentiated), t_9 (somatostatin receptor subtypes), t_12 (lymph node metastasis), t_16 (Fine needle aspiration), t_18 (grade neoplasm), t_19 (multiple endocrine neoplasia), and t_23 (Bile duct: a case report).
The heat map (Figure 7) shows how the topics were distributed by years. Each pixel represents the probability that a topic was found in a specific year. A cluster analysis was performed on the years and topics to calculate the Euclidean distance between each pair of topics. The dendrogram represents a greater distance between the years, which means that they differed significantly from each other. Smaller distances implied that the years were similar. The cluster analysis showed built hierarchies, so it was possible to divide the components into several groups. Our study found five groups, through bootstrap analysis, that had higher probability proportions for certain years: Topic 7 (Merkel cell carcinoma) of group 5 (1981-1992); topics 23 (bile duct: a case report) and 8 (primary tumor well-differentiated) of group 4 (1982); and topic 25 (chromogranin A levels) of group 3 (1983).  Table 2 shows the topic names generated from the words with the highest number of repetitions and ranked for relevance. After searches within the articles according to each topic, 5 words with the highest number of repetitions generated the prevalence rankings.
Our study also carried out an analysis using the LDA method where relationships between topics were observed. The existence of a possible relationship was obtained by the statistical analysis of pancreatic cancer according to the matrix generated in the LDA model ( Figure 8). The grouping of the topics and years was also analyzed using multivariate HJ-Biplot analysis with the theta matrix from the results. The matrix was composed of values between 0 and 1 according to the topics and years under study. We found coherent groupings as shown in Figure 9, with 79.9% data variability.  The topics that had greater relevance for 1981-1985 were t_23 (Bile duct: a case report), t_8 (Primary tumor well-differentiated), t_25 (Chromogranin A levels), t_14 (Large-cell, small-cell, carcinoma), and t_7 (Merkel cell carcinoma).

Discussion
The presence of NETs is relatively rare, with fewer than 10 cases per 100,000 inhabitants annually [27]. However, the data prevalence of NETs has increased in recent years due to the development and wide application of modern imaging and endoscopic technology [28]. In the U.S., the 20-year limited-duration prevalence increased dramatically from 0.006% in 1993 to 0.048% in 2012, with NETs more frequently observed in the rectum, followed by the lungs and small intestine [7]. According to the SEER database in the U.S., it went from 3.9 in 1995 to 6.61 in 2012 [7]. Because this topic is particularly important, and because there is a lack of data on NETs from undeveloped countries [8], this work investigated countries that produced the most scientific studies on NETs between 1981 and 2020: the United States, Germany, the United Kingdom, France, and Italy. To the best of our knowledge, this is the first study to assess which countries are the most frequent publishers of scientific works about NETs, as well as the most frequent publishers of related topics.
Because most NETs are misdiagnosed, proper diagnosis and treatment after the first symptoms takes years [29]. In fact, our statistics showed that from 1981 to 2020, the most frequently published topic was "markers for carcinomas." Classical neuroendocrine markers include the expression of synaptophysin, considered more sensitive, and chromogranin A, considered more specific [30]. Serum chromogranin A also may be used as an immunohistochemical biomarker to assess NETs in addition to serving as a treatment monitor [31].
NETs may be classified into two subtypes due to clinical and genetic differences. The well-differentiated form is defined as a NET, and the poorly differentiated form is defined as a neuroendocrine carcinoma (NEC) [1]. Following the WHO, the main factor when characterizing NETs is proliferation fraction grading, measured by either mitotic count or by the Ki67-positive percentage [32]. Nevertheless, it seems a great difficulty exists in diagnosing NETs due to the heterogeneous nature of tumors and different patient symptoms [7,29]. Indeed, this fact explains our results, at least in part, because the most frequently studied topic for NETs was "markers for carcinomas." In line with these facts, our work showed that the second-and third-most frequently studied topics were "clinical benefit" and "carcinoma treatment, tumor response, and survival," respectively, because physicians look for patents' increasing survival and improved quality of life [33]. Currently, NET treatments include therapies in isolation or in combination with others [34]. Common therapies are the multi-targeted receptor tyrosine kinase inhibitor Sunitinib [35], the radiolabeled somatostatin analogue lutetium-177 (177Lu)-dotatate [36], the mechanistic target of rapamycin inhibitor everolimus [36], and the vascular endothelial growth factor antibody bevacizumab [37]. In response to pharmacological advances, new diagnostic techniques, and early-stage diagnosis, the survival of patients with NETs has improved over time [7]. Those with distant gastrointestinal or pancreatic NETs reported higher indices of survival [7]. Unsurprisingly, our study found pancreatic NETs to be the most widely studied between 1981 and 2020.
A pancreatic NET is a rare malignancy with relatively non-painful biologic behavior compared with a pancreatic adenocarcinoma [38]. A median survival for patients with advanced pancreatic NET of around 27 months was reported [39]. However, the prevalence of pancreatic tumors represents 10% of NETs and only 1% of all cancer cases [39][40][41]. Approximately 64% of patients with pancreatic NETs present with metastatic disease and are diagnosed at the advanced stage; consequently, they have a poor prognosis [39]. Pancreatic NET commonly affects people between 40 and 69, but there is a significant number of patients younger than 35 that are diagnosed [40,42]. An estimated 40-91% of pancreatic NETs are nonfunctional, having no clinically evident hormonal symptoms [2,43]. Thus, therapeutic management of pancreatic NETs depends on the degree to which the tumor is well-or poorly differentiated. Symptoms caused by hypersecretion of hormones and the disease diagnosis stage must be considered [44]. Based on an analysis of the SEER database from 1973 to 2000, the annual incidence of pancreatic NETs was 1.8 and 2.6 per million for women and men, respectively, [2,43]. Given the severity and prevalence of this tumor, our results showed the highest number of pancreatic NET studies compared with those for other organs.
Furthermore, we found "prognosis of survival before and after resection" and "patients one year after treatment" to be the fourth-and fifth-most frequent topics. In clinical practice, common, low-grade tumors are managed with surgical resection [45]; in the case of pancreatic NETs, American and European guidelines recommend resection for NETs > 2 cm [46,47]. However, there is controversy over resecting pancreatic NETs ≤ 2 cm. Some physicians prefer observation and surveillance [48][49][50], even though several lines of evidence recommend resection based on the potential for malignant differentiation, lymph node metastasis, and distant metastasis, even among patients with small, non-functional pancreatic NETs [51][52][53][54]. A study of 487 patients showed that the occurrence of pancreatic NET is rare and based on tumor size, nodal metastasis, grading, and vascular invasion. Thus, patients with a grade 1 pancreatic NET without nodal metastasis and vascular invasion may be cured by surgery [55].
Increasing prognosis of survival, preventing tumor recurrence, and promoting a better quality of life of patients are the basic objectives of NET research, and the more frequently studied topics in our review reflect these concerns. The guidelines are specific to each NEaffected organ. For example, in lung NETs, a large cell carcinoma represents approximately 3% of all cancers [56], and surgical procedures are not effective since most patients die from recurrence [57]. The prognosis for patients increases if surgery is performed when the tumor is <3 cm [58][59][60]. Furthermore, the National Comprehensive Cancer Network (NCCN) guidelines for large-cell neuroendocrine stage-specific carcinoma and non-smallcell lung carcinoma highlight the need for adjuvant chemotherapy or chemoradiotherapy after resection in some cases, including those where the tumor is >4 cm. Morphologically identical to small cell carcinoma of the lung, breast NETs have poorly differentiated cells [61] and represent less than 1% of breast cancers [62]. Endocrine therapy and radiotherapy are able to increase survival [63,64], but no consensus has been reached on prognosis; most studies suggest a poor outcome [61].
On other hand, in mid-gut NETs, the guideline proposes primary tumor resection alone in the setting of unrespectable metastatic disease because surgery can completely remove the tumor [49]. Nevertheless, studies of patients with gastrointestinal NETs have shown increased survival after resection in the setting of metastatic disease [65]. In the case of neuroendocrine liver metastases, resection represents the major potential for a cure [66][67][68]. Unfortunately, the curative role of surgery seems to be only feasible in 10-25% of patients after resection in liver metastases [69]. The recurrence of a NET represents the major cause of death in most patients after resection of metastatic tumors [67]. In our review, the topics "prognosis of survival before and after resection" and "patients one year after treatment" are among the five most-studied topics due to their importance for survival and quality of life patients with NETs. Indeed, data on overall NET survival are the most frequently reported in the literature [7]. It is worth mentioning that data on recurrence may inform patients about the probability of treatment success and the risk of recurrence following a surgical intervention [70].

Conclusions
In summary, our review found that the countries with the most research about NETs were the United States, Germany, the United Kingdom, France, and Italy, and statistical analyses of pancreatic NETs were more prevalent between 1981 and 2020. Furthermore, our results confirmed that "clinical benefits," "patients one year after treatment," "prognosis of survival before and after resection," and "markers for carcinomas" were the most frequent topics in the scientific literature during the last 39 years.
Finally, the LDA method employed in this review grouped each subject in a category based on the high probability subjective observations of words. Thus, the methods showed effectiveness in generating responses about the more common topics studied in the NETs. In contrast, the HJ-Biplot method was integral for grouping topics by year and finding which NETs were the most relevant and for which years.