Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines

Colangelo, Maria Teresa; Guizzardi, Stefano; Galli, Carlo

doi:10.3390/metrics1010003

Open AccessArticle

Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines

by

Maria Teresa Colangelo

,

Stefano Guizzardi

and

Carlo Galli

^*

Histology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, Italy

^*

Author to whom correspondence should be addressed.

Metrics 2024, 1(1), 3; https://doi.org/10.3390/metrics1010003

Submission received: 21 August 2024 / Revised: 8 October 2024 / Accepted: 10 October 2024 / Published: 13 October 2024

Download

Browse Figures

Versions Notes

Abstract

This study investigates the diversity and evolution of research topics within the dental sciences from 1994 to 2023, using Topic modeling and Shannon’s entropy as a measure of research diversity. We analyzed a dataset of 412,036 scientific articles across six dental disciplines: Orthodontics, Prosthodontics, Periodontics, Implant Dentistry, Oral Surgery, and Restorative Dentistry. This research relies on BERTopic to identify distinct topics within each field. The study revealed significant shifts in research focus over time, with some disciplines exhibiting robust growth in article numbers, such as Periodontics and Prosthodontics. However, despite the overall increase in publications, the number of topics per discipline varied, with Restorative Dentistry increasing at a faster rate and exceeding 50 topics over the last 15 years. We observed an increasing diversification of research efforts in disciplines such as Restorative Dentistry, with entropy levels consistently above 2 and progressively increasing. In contrast, fields such as Prosthodontics, despite high publication output, maintained a more specialized research focus, reflected in entropy levels remaining below 1.5. Oral Surgery showed a steep increase in research diversification until 2000, after which it stabilized. Taken together, our findings describe the dynamic nature of dental research and highlight the balance shifts in research focus across several key areas of Dentistry.

Keywords:

research diversity; Shannon’s entropy; Topic modeling

1. Introduction

As research progresses and scholars explore new research lines and publish their most recent results, the range of topics in a science field typically broadens. New discoveries open up new questions that can be explored, often generating novel research directions. Understanding the dynamics and diversity of research topics in a discipline is potentially a relevant indicator of how active the area is [1]. Although diversity is a polysemic word, which, in the context of research, has been variously interpreted as genre/race diversity [2], interdisciplinarity [3], and journal diversity [4], an important area of research is dedicated to thematic diversity, sometimes referred to as scientodiversity [5]. In the absence of precise definitions and metrics for scientodiversity, parallels with biodiversity have been proposed [6]: just like biodiversity is the richness of living species in an ecosystem, research diversity is the richness of different research topics in a science area. Thus, a lower degree of diversity in a scientific field may indicate the presence of one or more prominent topics on which most publications would focus; this, in turn, could be driven by various factors, both intrinsic to the field (such as a compelling scientific conundrum that attracts the attention of many scholars of the community) or external (e.g., socioeconomic incentives) [7,8]. Understanding research diversity entails understanding its epistemic composition; it requires analyzing the different scientific areas that it consists of. Different approaches have been proposed in the past, and the first bibliometric approaches date back to 1990 [9], when entropy was proposed as a way to gauge diversity. Shannon’s entropy, a concept derived from information theory and a measure of uncertainty or randomness in a dataset [10], was identified as a potential way to measure diversity. It was originally formulated by Claude Shannon in 1948, and it has since been adapted for various domains. A field where there is, e.g., one predominant topic that absorbs 80% of the global research efforts alongside n-1 minor niche research areas is in a very different situation than a field where there are n equally relevant research areas, and this is exactly what entropy measures. Higher entropy values indicate a more even distribution of topics, reflecting greater scientodiversity, while lower entropy values suggest that research is concentrated around a few dominant topics. Alternative measures, such as the Simpson Diversity Index or the Rao-Stirling Index, have also been proposed [11]. Simpson’s index has been used to measure diversity in health professions education [12], and we decided to use it to confirm the results obtained with Shannon’s entropy.

However, regardless of the measurement used to assess diversity, theme extraction has long been a lengthy and cumbersome procedure because it mostly included manual screening and allocation of scientific articles to topics [5] or bibliometric techniques such as latent theme extraction via singular value decomposition [11] or network analysis [13].

As an alternative, Topic modeling can be an easier and faster resource to unpack the main lines along which research evolves and is conducted [14]. Topic modeling is a computational approach that has the purpose of retrieving the topic, the theme, and the “aboutness” from an unstructured document of corpus of documents. Topic-modeling algorithms are, therefore, potentially convenient tools to get a sense of the array of topics that are present in the scientific literature of a field [15]. These algorithms, such as LDA or, more recently, BERTopic, are capable of scanning large datasets and segmenting a whole corpus of documents according to their semantic content. For the present work, we used BERTopic, an algorithm centered on BERT (Bidirectional Encoder Representations from Transformers), a recent instrument developed by Google, which is based on a mechanism known as attention. BERT relies on embeddings, i.e., numerical representations of the semantics of a piece of text, be it a word, a sentence, or even a whole document [16]. Transformer-based embeddings are very effective in capturing the nuances of meaning in a text much better than earlier embeddings, and the performance of BERTopic has surpassed that of earlier algorithms in various tasks [17]. One of the key features of BERTopic is that it can rely on HDBSCAN, a powerful clustering algorithm, which does not require users to pre-determine the number of topics in a dataset, but it computes that automatically, making it possible to study topic diversity.

Although entropy measurements have been applied to investigate the performance of Topic modeling algorithms [18,19,20], we propose to use Shannon’s entropy to quantify the research diversity by assessing the distribution of research topics as determined by BERTopic. Despite its potential, the use of Shannon’s entropy to measure research diversity remains underexplored in many disciplines, including dental sciences. Dental research includes a wide array of specialties, each with its own unique focus and themes, which have undergone tremendous growth in the last decades of the 20th century [21]. The introduction of groundbreaking techniques (such as Implant Dentistry) has also opened up new therapy opportunities, challenges, and questions [22]. Understanding how research topics evolve and diversify within these specialties could help guide future research efforts and policy decisions.

In this study, we applied a solid and tested mathematical method, Shannon’s entropy, to analyze the diversity of research topics across multiple dental disciplines over a thirty-year period (1994–2023) after automatic identification by Topic modeling. Using BERTopic, we extracted and categorized research topics from a comprehensive dataset of scientific articles from MEDLINE. By calculating the entropy of these topics over time, we believe we could provide a quantitative measure of research diversity within each discipline in a simpler way than with previously proposed methods and offer insights into the evolution and distribution of research themes in the field of dental research.

2. Materials and Methods

2.1. Data Collection

Data were collected and analyzed with Google Colab Pro notebooks powered by Python 3.10.12 [23] and running on T4 GPUs [24]. The corpora we used for the investigation were generated with the Biopython library [25] through a query-driven exploration of MEDLINE facilitated by the Entrez.esearch function. The disciplines included were Orthodontics, Prosthodontics, Periodontics, Implant Dentistry, Oral Surgery, and Restorative Dentistry, based on the authors’ domain knowledge. For each discipline, relevant scientific articles were retrieved from PubMed using a series of discipline-specific search queries. The search terms were designed to capture a broad range of publications within each field, utilizing both Medical Subject Headings (MeSH) and title/abstract keywords to ensure comprehensive coverage, as shown in Table 1.

The searches were limited to articles published between 1994 and 2023 to try and capture the big scientific developments that occurred in the field at the end of the 20th century and have enough articles for extensive analysis.

Figure 1 provides an overview of the process.

2.2. Data Cleaning and Preprocessing

The data retrieved from MEDLINE were formatted into pandas dataframes [26], which contained the relevant bibliographic details for the analysis, i.e., author names, article titles, publication years, abstracts, publication type, and journal names. The datasets underwent several cleaning steps:

Deduplication: Duplicate entries were identified and removed based on article titles.
Missing Data: Articles without titles were excluded.
Standardization: The publication years were standardized to four-digit integers. Articles published before 1994 or after 2023 were excluded from further analysis.

2.3. Topic Modeling

2.3.1. Sentence Embedding and Topic Modeling

To identify and analyze research topics across disciplines, we employed BERTopic, a state-of-the-art Topic modeling technique [27]. BERTopic leverages sentence embeddings [28] to encode text into a numerical representation, UMAP (Uniform Manifold Approximation and Projection) for dimensionality reduction [29], and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) for clustering [30]. Briefly, once the text has been converted into embeddings, these are compared and clustered based on their similarity. To do that, they are first reduced by UMAP and then clustered by HDBSCAN. Each cluster is considered a topic in the corpus. The algorithm then proceeds to extract significant keywords from each cluster of documents and these form the default topic representation of BERTopic. Since several sentence transformer models are currently available, with varying sizes and processing speeds, we decided to use the “all-MiniLM-L6-v2” model from SentenceTransformers, which balances speed and accuracy, making it suitable for generating sentence embeddings in large-scale text analysis [31].

2.3.2. Topic Extraction

To characterize the topics in the dataset, we chose to use only titles in accordance with our previous experience [32]. Titles can be conceived as very short summaries of the content of an article and are therefore suited to be used to characterize the topic of a document. Abstracts could be used as an alternative, but they are considerably longer, which may affect processing time in large datasets. The titles from the articles of each discipline’s dataset were processed as described previously. The UMAP model reduced the dimensionality of the embeddings for better clustering, and HDBSCAN was used to cluster these reduced embeddings into distinct topics. After repeated trial runs, we arbitrarily decided to use the following parameters:

-: UMAP metric: cosine distance;
-: Size of the neighborhood: 50;
-: Number of components: 5;
-: HDBSCAN clustering metric: Euclidean;
-: cluster_selection_epsilon = 0.5
-: Minimum cluster size: 50.

Those articles that did not fit well into any specific topic were labeled as noise topics (Topic -1) and were excluded from further analysis. Although BERTopic allows users to pre-define the number of topics in a document set, we chose not to do so in order to accurately evaluate the natural changes in topic numbers across disciplines and over time. BERTopic’s standard topic representation output is just a series of keywords, which may not be convenient for analysis. Stopwords were removed after topic creation using the sklearn Countvectorizer function [33]. BERTopic allows the integration of additional representation models, including LLMs. To get a better representation of the topics, we then relied on OpenAI’s GPT 3.5 turbo to generate labels for the topics from the keywords [34]. To function, LLMs need a prompt from users, which serves as the starting point for the model to generate relevant text based on the information provided. We set the following prompt:

I have a topic that contains the following documents:

[DOCUMENTS]

The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:

topic: <topic label>

GPT 3.5 turbo generated a short phrase, which we used as a label for each topic. We assessed BERTopic’s performance and the quality of the GPT-generated labels through a mix of qualitative analysis and manual review. The coherence of keywords within each cluster was evaluated to confirm their representation of a cohesive and meaningful topic. A subset of articles from each topic group underwent manual review to determine their relevance to the corresponding topic. For the GPT-generated labels, we cross-checked them against the keywords to ensure they effectively captured the essence of the topics. Furthermore, multiple analysis iterations were performed to confirm the consistency of topic clusters.

2.4. Entropy Analysis

Shannon’s entropy H(X) was used as a measure of research diversity over time within each dental discipline [10], and it was calculated using the following formula [35]:

H (X) = - \sum_{i = 1}^{n} p (x i) {l o g}_{2} p (x i)

where p(x_i) is the proportion of articles assigned to topic i in a particular year. Higher entropy values indicate a more diverse distribution of topics, while lower values suggest that the research focus was concentrated on fewer topics. The calculated entropy values were plotted over time to visualize how the diversity of research topics evolved within each discipline using the matplotlib [36] and seaborn libraries [37].

To confirm Shannon’s results, we resorted to Simpson’s diversity index [11]:

D = 1 - \frac{\sum n (n - 1)}{N (N - 1)}

where n is the number of publications assigned to a specific topic and N is the total number of publications in that year or discipline.

3. Results

3.1. General Characteristics of the Dataset

The dataset generated for the present study comprises a total of 412,036 scientific articles published between 1994 and 2023 across six dental disciplines. The distribution of articles across the six disciplines is summarized in Table 2. The number of articles per discipline varies significantly, reflecting differing levels of research activity within each field.

Taken together, these fields possess, as expected, a vast research output, which reflects their central importance in dental practice and research, with noticeable differences, though. Prosthodontics emerged as the most prolific discipline, with 98,852 articles, followed closely by Periodontics, with 93,510. The remaining datasets were quite small, with Restorative Dentistry and Oral Surgery a little above 60,000 articles.

This disparity may suggest that these fields are more specialized sub-fields of other disciplines (such as might be the case with Implant Dentistry, which can often be conceived as a specific application of Oral Surgery) or that they have seen less research activity relative to the broader dental sciences, such as Orthodontics, because of their niche character. It must be remembered that we generated the datasets independently for each discipline and did not exclude articles that were present in more than one discipline. It can be assumed that certain topics can be ascribed to more than one discipline, and to investigate this assumption, we measured the overlap between datasets (Figure 2).

In most cases, the datasets displayed little overlap, with the noticeable exception of Implant Dentistry and Prosthodontics (87%), in which case the Implant Dentistry dataset can be considered a specialized sub-set of Prosthodontics, and a 54% overlap between Implant Dentistry and Oral Surgery, which again is not completely surprising considering the nature of the clinical interventions that are involved with implant insertion.

3.2. Distribution of Articles over Time

The diachronic distribution of articles may provide further insights into the evolution of research focus within dental sciences. Figure 3 illustrates the number of articles published per year within each discipline from 1994 to 2023.

From the data, it can be observed that there has been a general increase in the number of publications in every discipline over the study period and that the number of new articles per year has increased for most disciplines, reflecting the growing body of research and the expanding scope of the dental sciences, a phenomenon that is widely seen across various disciplines, not only dental-related ones [38]. The growth, however, differed visibly across disciplines.

Prosthodontics and Periodontics grew at an almost exponentially increasing rate, at least until about 2015, when they both plateaued at about 4000 new articles/year, only to start growing again only after a few years and recently exceed 5000 new publications per year (Figure 3). This suggests an increasing research focus on these areas, possibly driven by advancements in materials science, technology, or clinical techniques.

Other disciplines did not experience such a fast increase in the number of publications; in particular, Oral Surgery appears to have peaked around the mid-2010s, with about 3000 new articles/year, and the number of new publications has remained consistent or has even slightly dropped since (Figure 3).

3.3. Identification of Research Topics

We then identified distinct research topics within each of the six dental disciplines with BERTopic. This topic-modeling process analyzed the titles of a little more than 400,000 articles. Although abstracts could have been used to get an even more accurate overview of the themes of the different disciplines, we preferred to limit the analysis to titles. Our choice was supported by previous experience that showed us that titles could be sufficiently accurate in portraying the main area of a manuscript [31], and as they are considerably shorter than abstracts, this proved very advantageous to process hundreds of thousands of documents, as this operation can be very resource-consuming in such large datasets. BERTopic revealed a wide range of themes within each field, reflecting diverse research areas in these dental disciplines.

The number of distinct topics identified varied across the disciplines (Table 3), with Restorative Dentistry having the highest number of topics (n = 58) and Orthodontics and Prosthodontics both trailing behind with just 32 topics. The smallest dataset, Implant Dentistry, had also, not unexpectedly, the smallest number of topics, with just 22 topics identified. This could possibly reflect differences in the complexity and breadth of research within each field, but it must be remembered that the absolute number of topics in a dataset of documents heavily depends on the structure of the data and on the parameters used for Topic modeling, such as the minimum cluster sized used in the HDBSCAN algorithm and the number of neighbors used by the UMAP dimensionality reduction process, i.e., how granular our analyses intended to be, or additional parameters such as the cluster selection epsilon, which merges topics that are similar beyond a set threshold. We tested a wide range of parameters through extensive grid searches and arbitrarily chose the algorithm’s parameters that avoided hyperinflating the number of topics. Minimum cluster size = 50 and number of neighbors = 25 were chosen because they did not fragment the dataset across an excessive number of topics, and these parameters were used for all our analyses.

The investigation furthermore identified several key topics that dominated specific disciplines and can be found in Table 4. For instance, in Restorative Dentistry, the topic with the highest number of articles, indicated by BERTopic as topic 0, the main theme, so to speak, is “Childhood Caries Prevention Study”, while dental implants are a relevant topic in Implant Dentistry but also, consistently with Figure 2, in Prosthodontics and Oral Surgery.

To better understand, however, how these research fields evolved in time, we needed a diachronic analysis of research topics. Topics are not, in fact, consistently represented across the years; some topics emerge later in time or subside as the interest of the research community shifts elsewhere, e.g., because of changes in clinical priorities or the resolution of key research questions [14].

3.4. Diachronic Analysis of Topics

The number of topics across the disciplines we included in our dataset tended to increase over time in the 30-year period, although not homogeneously (Figure 4).

Generally speaking, the more numerous the articles in a dataset, the more numerous the topics (Figure S1), and thus, Implant Dentistry always possessed fewer topics than the remaining disciplines and maintained a fairly consistent number of them over the years (Figure 4). However, Restorative Dentistry and Periodontics, which is about 50% larger than the previous one, had similar numbers of topics until the early 2000s, around 35, when the number of topics in Restorative Dentistry started to grow at a faster rate and exceeded the number of topics in the Periodontics dataset, creating a gulf between the two that has remained stable. Topics in the remaining disciplines increased at a slower pace, from about 20 topics in the 90s to the current levels of 30 topics (Figure 4). The number of topics can be partially considered a measure of diversity in a research field, but it does not reveal much about how actively researched these topics are. To quantitatively assess how the attention of researchers is partitioned across topics within each discipline over time, we calculated Shannon’s entropy based on the distribution of the topics identified by the BERTopic model. As previously mentioned, Shannon’s entropy is a measure of uncertainty or diversity; in this context, it quantifies how evenly research efforts are distributed across different topics within a discipline. The analysis of entropy over time, as shown in Figure 4, highlights the changes in topic diversity within each discipline from 1994 to 2023.

This figure reveals that for most disciplines, entropy has remained fairly consistent over the study period, especially for disciplines such as Periodontics and Prosthodontics (Figure 5). This is consistent with a stable number of topics in Prosthodontics; however, vis a vis the increase in topics in Periodontics, this observation can be interpreted as a sign of the dominance of a few major topics that have attracted the most part of the new publications. Moreover, it should also be noted that Periodontics, which has a very high number of topics, also has the lowest level of entropy, together with Prosthodontics, confirming the idea that Periodontics revolves around a main core of topics. One main exception to the overall trend in entropy can be noted: a robust increase in entropy for Oral Surgery at least until the early 2000s, which could be interpreted as an increase in diversification of research by actively pursuing a wide array of topics.

These results are consistent with alternative measures of diversity, such as Simpson’s index (Figure 6), which highlights a similar trend for the investigated disciplines.

4. Discussion

The results of this study provide interesting insights into the diversity and evolution of research within dental sciences over the past three decades. By employing Topic modeling techniques and Shannon’s entropy, we quantified and compared the diversity of research topics across six distinct dental disciplines, and we showed that this goal can be reached quite easily using Topic modeling.

The dataset was arbitrarily generated through PubMed searches to collect a sizable corpus of publications in six core areas of dental practice and research without constraining them for publication number to allow for discipline-based differences to emerge. The queries were generic enough to embrace a large portion of publications in each discipline without getting into details of specific clinical or scientific issues within each discipline. Although other areas could have been—and they actually were—devised and searched, e.g., TMJ disorders or endodontics, some preliminary attempts did not yield a comparable number of publications, and we refrained from expanding our corpus to maintain a certain degree of comparability between disciplines. For the same reason, we abstained from expanding our corpus to earlier dates than 1994, although by doing so, we likely missed some pivotal events in the development of dentistry, such as the introduction of bonding in Restorative Dentistry [39] or the development of root-form osseointegrated fixtures in Implant Dentistry [22], which presumably greatly impacted the trends of research in those areas at the time.

Our results, somewhat unsurprisingly, confirmed that a growing number of publications is published each year, i.e., that the rate of publication is accelerating, as it has been previously reported [40], although different dental disciplines understandably move at different speeds. In particular, Periodontics and Prosthodontics have been experiencing faster growth than the other disciplines we considered. The growth of some disciplines, such as Oral Surgery, might even be starting to slow down, as the number of new publications per year has remained constant in the last 10 years. Several factors may compound in determining the number of publications in a discipline, including new discoveries (or new challenges), new needs that arise and must be addressed, or even changes in the way a certain issue is culturally conceived, interpreted, or categorized. Determining why these disciplines are behaving in a specific way is outside of the scope of the present report, which aimed at measuring one aspect of the research infoscape in dentistry, i.e., its diversity. Diversity may be interpreted in different ways, including the currently prevalent view in social sciences, which emphasizes the inclusivity of people in events, activities, jobs, etc., beyond racial, gender, or religious divides [41,42]. However, this word has a rich history in the natural sciences, where it has been long used to refer to a wide range of different life forms within ecosystems [43], i.e., biodiversity. Our present works have focused on scientodiversity, or thematic diversity in research, assessed through the main product of the scientific activity, i.e., its dissemination through published papers [44]. In this context, diversity can be conceived as the presence, within a certain scientific field, of multiple areas of investigation, which are actively pursued by the scientific community. Dentistry was chosen, out of all the life science fields, because of the specific domain knowledge of the authors and because it has quite defined boundaries, as compared to, e.g., medicine or biology, so it was easier to identify some major areas of investigation, which correspond to the main clinical branches of dental practice.

To understand more of what kind of research is conducted in these fields, however, we resorted to Topic modeling. Topic modeling is a field interested in processing unlabeled texts and automatically understanding their theme, their topic (their “about”) using a wide range of techniques [45,46]. Recently, neural networks have allowed the creation of high-performing algorithms to cluster documents based on their semantics, and BERTopic, in particular, has repeatedly exceeded the benchmark performance of previous algorithms on a wide range of tasks [27,47]. Clearly, BERTopic is far from being omniscient and still requires human input to work properly. At its core, there is a transformer architecture that takes sentences, translates them into numerical sequences, known as embeddings, then clusters them based on their similarity and finds appropriate labels to characterize these clusters; to do that, it relies on a series of representation models, which range from keyword descriptors to small sentences, as BERTopic can also accommodate LLMs such as GPT [34], which we used to quickly characterize the main topics in Table 3.

It could be argued that the algorithm we proposed might not work equally well in other science fields, and thus, the methodology might not be easily translated. However, the core step in BERTopic is the conversion of documents into a numerical representation (the embedding), which can then be clustered based on similarities to the other embeddings of the dataset. The more reliable the encoding into embedding, the more reliable the clustering is. We chose the “all-MiniLM-L6-v2” model, which was not trained specifically on dental publications, so it is reasonable to expect that its performance is not inferior when encoding publications from other medical areas.

While abstracts could have been used for this purpose instead of titles, our previous experience has shown that using abstracts significantly increases processing time compared to titles, which are much shorter. Titles typically provide a concise summary of the article’s main theme, making them efficient for Topic modeling and clustering. Although titles may not capture the full details of each publication, they have been demonstrated to be effective for identifying overarching research topics and trends, making them suitable for this kind of analysis [31,32].

Depending on how its parameters are set, BERTopic can identify fewer and larger topics or smaller and very numerous topics, splitting up the dataset into sometimes tiny clusters. Two critical parameters to this effect are the number of neighbors that the UMAP dimensionality reduction algorithm uses to process embeddings prior to clustering and the minimum size of clusters that the HDBSCAN algorithm uses. By changing their values, the number of topics in a discipline could go down to 2–3 or could skyrocket to several thousand for a large dataset such as Prosthodontics or Periodontics. The choice of the correct set of parameters is mostly arbitrary, according to the need of the investigators, and there are no safe and tested recipes to get the optimal—not to mention “true”—number of topics in a corpus. We picked the settings that yielded a sizable number of topics, around fifty or less, while keeping them manageable and avoiding micro-topics of just a few articles and dubious meaning. Interestingly, although fields with more articles also (unsurprisingly) had more topics, the diachronic change in topic number did not directly reflect the number of publications. So, at the beginning of the period we investigated, although the Prosthodontics and Periodontics datasets included about 50% more articles than Restorative Dentistry, the number of topics of the latter was progressively becoming higher than Periodontics and almost twice as large than Prosthodontics. This suggests that while the literature in Periodontics and Prosthodontics has expanded more rapidly over time compared to Restorative Dentistry, these fields have maintained a similar or even smaller number of distinct topics. This may reflect a concentrated research focus within these disciplines, where growth is driven by deeper exploration of established themes rather than diversification into new areas. We could envisage a situation in Restorative Dentistry where the front of research knowledge expands more homogeneously along a number of different trajectories. This would correspond to higher levels of entropy, which correlate with a more homogeneous distribution of articles across the topics. In contrast, our data suggest that Periodontics and Prosthodontics attract a substantial number of new publications annually; however, these tend to be more narrowly focused, i.e., these disciplines contain fewer topics (as in the case of Prosthodontics) or at least they have some main research directions which contain most of the publications and which are prominent over the rest. This pattern may indicate a level of maturity in these disciplines, with researchers concentrating on well-established themes rather than branching out into unexplored areas. On the other hand, disciplines like Oral Surgery have shown an increase in entropy, at least over a sustained period of time, reflecting a diversification of research topics. This may correspond to an expansion of this field, driven by innovations in implant materials (which are in part contained in this dataset), techniques (such as the use of laser in Oral Surgery [48]), and therapeutic strategies [49].

The divergence in entropy trends across disciplines raises some important questions about the direction of future research. Fields like Oral Surgery and, to a lesser extent, Restorative Dentistry and Orthodontics, exhibit increasing entropy, suggesting they are in an exploratory phase, where the topics expand isometrically, so to speak, and research questions are being continually investigated, although the number of new papers in a year is not increasing, i.e., the literature grows linearly. This could indicate a dynamic, evolving field driven by a specialized, ‘niche’ research community that is not growing as rapidly as other specialties, such as Periodontics. However, although a field like Periodontics is growing, as to the number of publications and the number of topics, some main research axes (e.g., Periodontal Disease Treatment Studies, Smoking and periodontal disease, Periodontitis and preterm birth relation, to mention a few of its largest topics) are still robustly leading the research landscape [50].

Our findings indicate that dental disciplines are evolving along different trajectories, underscoring the importance of maintaining a balanced research agenda that fosters both depth and breadth in scientific inquiry. While specialization is necessary for advancing knowledge in specific areas, it is equally important to encourage exploratory research that can lead to the discovery of new topics and subfields. Policymakers, funding agencies, and academic institutions should consider adopting similar data-driven approaches when setting research priorities and funding strategies, ensuring comprehensive development across both established and novel areas of study [51].

5. Outlook

While this study offers valuable insights, several limitations must be acknowledged. We relied on broad bibliographic data from PubMed, using generic keywords. As a result, the findings are not exhaustive for any specific field and may not fully reflect the entire scope of research within the included disciplines [51]. Additionally, the selection of disciplines and the exclusion of others, such as TMJ disorders or Endodontics, were based on the availability of a comparable volume of publications, and richer datasets might allow us to further expand this analysis and make it more extensive. Future studies could broaden the analysis to include these and other related fields to provide a more comprehensive overview of dental research or explore the connectivity between topics in different areas and so characterize the network density for the identified topics.

Incorporating non-English language publications and expanding the dataset to include research from underrepresented regions would also provide a more global perspective on research diversity [52]. This might help capture innovations and trends that may not be as visible in English-dominated databases like PubMed. This focus may inadvertently overlook important innovations and local trends occurring in non-English-speaking regions, where unique challenges, resources, and population needs might drive different research priorities [53,54].

6. Conclusions

Dentistry is a dynamic and multifaceted field that encompasses several research areas that have been actively investigated and have generated a vast body of literature accumulated over the decades. Not all these areas have attracted the same amount of attention from the scientific community, with some areas potentially under-investigated. The mere number of publications in an area may not be a sufficient measure because it fails to capture how diverse the research output is in a field. This study explores the use of Shannon’s Entropy to measure the diversity of research within six major dental disciplines over nearly three decades. We used BERTopic for Topic modeling, i.e., to cluster the literature according to research topics and applied Shannon’s entropy to gauge research diversity. We thus uncovered clear trends in how topics are distributed. For example, while fields like Periodontics and Prosthodontics saw a marked increase in article numbers, others displayed a broader diversification of research topics. These findings illustrate the varied paths of development within dental research, with some areas becoming more specialized while others branched out into a wider range of topics. This approach could help identify where innovation and research efforts should be focused to tackle the growing challenges in dental science. Ultimately, understanding these shifts is crucial for shaping future research priorities and policy decisions in the field.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/metrics1010003/s1, Figure S1: Scatterplot showing the relation between number of articles and number of topics in the dataset per discipline.

Author Contributions

Conceptualization, C.G. and S.G.; methodology, C.G. and M.T.C.; software, C.G.; formal analysis, C.G. and M.T.C.; writing—original draft preparation, C.G. and M.T.C.; writing—review and editing, C.G. and S.G.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, X.; Guo, J.; Gu, D.; Yang, Y.; Yang, X.; Zhu, K. Tracking Knowledge Evolution, Hotspots and Future Directions of Emerging Technologies in Cancers Research: A Bibliometrics Review. J. Cancer 2019, 10, 2643. [Google Scholar] [CrossRef] [PubMed]
Reich, S.M.; Reich, J.A. Cultural Competence in Interdisciplinary Collaborations: A Method for Respecting Diversity in Research Partnerships. Am. J. Community Psychol. 2006, 38, 51–62. [Google Scholar] [CrossRef] [PubMed]
van Leeuwen, T.; Tijssen, R. Interdisciplinary Dynamics of Modern Science: Analysis of Cross-Disciplinary Citation Flows. Res. Eval. 2000, 9, 183–187. [Google Scholar] [CrossRef]
Goyanes, M.; Demeter, M.; Cheng, Z.; de Zúñiga, H.G. Measuring Publication Diversity among the Most Productive Scholars: How Research Trajectories Differ in Communication, Psychology, and Political Science. Scientometrics 2022, 127, 3661–3682. [Google Scholar] [CrossRef]
Shimada, Y.; Suzuki, J. Promoting Scientodiversity Inspired by Biodiversity. Scientometrics 2017, 113, 1463–1479. [Google Scholar] [CrossRef]
Schmidt, M.; Glaser, J.; Havemann, F.; Heinz, M. A Methodological Study for Measuring the Diversity of Science. In Proceedings of the International Workshop on Webometrics, Informetrics and Scientometrics & Seventh COLLNET Meeting, Nancy, France, 10–12 May 2006. [Google Scholar]
Mantikayan, J.; Abdulgani, M. Factors Affecting Faculty Research Productivity: Conclusions from a Critical Review of the Literature. JPAIR Multidiscip. Res. 2018, 31, 1–21. [Google Scholar] [CrossRef]
Schulman, K.A.; Rubenstein, L.E.; Chesley, F.D.; Eisenberg, J.M. The Roles of Race and Socioeconomic Factors in Health Services Research. Health Serv. Res. 1995, 30, 179–195. [Google Scholar]
Grupp, H. The Concept of Entropy in Scientometrics and Innovation Research. Scientometrics 1990, 18, 219–239. [Google Scholar] [CrossRef]
Godden, J.W.; Bajorath, J. Analysis of Chemical Information Content Using Shannon Entropy. Rev. Comput. Chem. 2007, 23, 263–289. [Google Scholar]
Mitesser, O.; Heinz, M.; Havemann, F.; Glaser, J.; Gläser, J. Measuring Diversity of Research by Extracting Latent Themes from Bipartite Networks of Papers and References. In Proceedings of the Fourth International Conference on Webometrics, Informetrics and Scientometrics & Ninth COLLNET Meeting, Berlin, Germany, 29 July–1 August 2008. [Google Scholar]
McLaughlin, J.E.; McLaughlin, G.W.; McLaughlin, J.S.; White, C.Y. Using Simpson’s Diversity Index to Examine Multidimensional Models of Diversity in Health Professions Education. Int. J. Med. Educ. 2016, 7, 1–5. [Google Scholar] [CrossRef]
Havemann, F.; Gläser, J.; Heinz, M.; Struck, A. Identifying Overlapping and Hierarchical Thematic Structures in Networks of Scholarly Papers: A Comparison of Three Approaches. PLoS ONE 2012, 7, e33255. [Google Scholar] [CrossRef] [PubMed]
Guizzardi, S.; Colangelo, M.T.; Mirandola, P.; Galli, C. Modeling New Trends in Bone Regeneration, Using the BERTopic Approach. Regen. Med. 2023, 18, 719–734. [Google Scholar] [CrossRef]
Kherwa, P.; Bansal, P. Topic Modeling: A Comprehensive Review. ICST Trans. Scalable Inf. Syst. 2018, 7, 159623. [Google Scholar] [CrossRef]
Grootendorst, M. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic Modeling Algorithms and Applications: A Survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
Koltcov, S.; Ignatenko, V.; Koltsova, O. Estimating Topic Modeling Performance with Sharma–Mittal Entropy. Entropy 2019, 21, 660. [Google Scholar] [CrossRef]
Chen, L.; Zhang, H.; Jose, J.M.; Yu, H.; Moshfeghi, Y.; Triantafillou, P. Topic Detection and Tracking on Heterogeneous Information. J. Intell. Inf. Syst. 2018, 51, 115–137. [Google Scholar] [CrossRef]
Pulgar, R.; Jiménez-Fernández, I.; Jiménez-Contreras, E.; Torres-Salinas, D.; Lucena-Martín, C. Trends in World Dental Research: An Overview of the Last Three Decades Using the Web of Science. Clin. Oral Investig. 2013, 17, 1773–1783. [Google Scholar] [CrossRef]
Buser, D.; Sennerby, L.; De Bruyn, H. Modern Implant Dentistry Based on Osseointegration: 50 Years of Progress, Current Trends and Open Questions. Periodontol. 2000 2017, 73, 7–21. [Google Scholar] [CrossRef]
Bassi, S. A Primer on Python for Life Science Researchers. PLoS Comput. Biol. 2007, 3, e199. [Google Scholar] [CrossRef] [PubMed]
Jia, Z.; Maggioni, M.; Smith, J.; Scarpazza, D.P. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arXiv 2019, arXiv:1903.07486. [Google Scholar]
Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B. Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef] [PubMed]
McKinney, W. Data structures for statistical computing in Python. SciPy 2010, 445, 51–56. [Google Scholar]
Wang, Z.; Chen, J.; Chen, J.; Chen, H. Identifying Interdisciplinary Topics and Their Evolution Based on BERTopic. Scientometrics 2023, 1–26. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
McInnes, L.; Healy, J.; Astels, S. Hdbscan: Hierarchical Density Based Clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
Galli, C.; Donos, N.; Calciolari, E. Performance of 4 Pre-Trained Sentence Transformer Models in the Semantic Query of a Systematic Review Dataset on Peri-Implantitis. Information 2024, 15, 68. [Google Scholar] [CrossRef]
Galli, C.; Cusano, C.; Meleti, M.; Donos, N. Topic Modeling for Faster Literature Screening Using Transformer-Based Embeddings. Metrics 2024, 1, 2. [Google Scholar] [CrossRef]
Akre, P.; Malu, R.; Jha, A.; Tekade, Y.; Bisen, W. Sentiment Analysis Using Opinion Mining on Customer Review. Int. J. Eng. Manag. Res. 2023, 13, 41–44. [Google Scholar]
Gue, C.C.Y.; Rahim, N.D.A.; Rojas-Carabali, W.; Agrawal, R.; Rk, P.; Abisheganaden, J.; Yip, W.F. Evaluating the OpenAI’s GPT-3.5 Turbo’s Performance in Extracting Information from Scientific Articles on Diabetic Retinopathy. Syst. Rev. 2024, 13, 135. [Google Scholar] [CrossRef] [PubMed]
Vajapeyam, S. Understanding Shannon’s Entropy Metric for Information. arXiv 2014, arXiv:1405.2061. [Google Scholar]
Hunter, J.D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Waskom, M. Seaborn: Statistical Data Visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
Landhuis, E. Scientific Literature: Information Overload. Nature 2016, 535, 457–458. [Google Scholar] [CrossRef]
Singh, H.; Kaur, M.; Dhillon, J.S.; Mann, J.S.; Kumar, A. Evolution of Restorative Dentistry from Past to Present. Indian J. Dent. Sci. 2017, 9, 38–43. [Google Scholar] [CrossRef]
Rawat, S.; Meena, S. Publish or Perish: Where Are We Heading? J. Res. Med. Sci. 2014, 19, 87–89. [Google Scholar] [PubMed]
Dinesen, P.T.; Schaeffer, M.; Sønderskov, K.M. Ethnic Diversity and Social Trust: A Narrative and Meta-Analytical Review. Annu. Rev. Political Sci. 2020, 23, 441–465. [Google Scholar] [CrossRef]
Budescu, D.V.; Budescu, M. How to Measure Diversity When You Must. Psychol. Methods 2012, 17, 215. [Google Scholar] [CrossRef]
Peet, R.K. The Measurement of Species Diversity. Annu. Rev. Ecol. Syst. 1974, 5, 285–307. [Google Scholar] [CrossRef]
Hyland, K. Academic Publishing: Issues and Challenges in the Construction of Knowledge; Oxford University Press: Oxford, UK, 2016. [Google Scholar]
Churchill, R.; Singh, L. The Evolution of Topic Modeling. ACM Comput. Surv. 2022, 54, 1–35. [Google Scholar] [CrossRef]
Vayansky, I.; Kumar, S.A.P. A Review of Topic Modeling Methods. Inf. Syst. 2020, 94, 101582. [Google Scholar] [CrossRef]
Gan, L.; Yang, T.; Huang, Y.; Yang, B.; Luo, Y.Y.; Richard, L.W.C.; Guo, D. Experimental Comparison of Three Topic Modeling Methods with LDA, Top2Vec and BERTopic. In Proceedings of the International Symposium on Artificial Intelligence and Robotics, Beijing, China, 21–23 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 376–391. [Google Scholar]
Noba, C.; Mello-Moura, A.C.V.; Gimenez, T.; Tedesco, T.K.; Moura-Netto, C. Laser for Bone Healing after Oral Surgery: Systematic Review. Lasers Med. Sci. 2018, 33, 667–674. [Google Scholar] [CrossRef]
Lee, K.C.; Chuang, S.-K. History of Innovations in Oral and Maxillofacial Surgery. Front. Oral Maxillofac. Med. 2022, 4, 6. [Google Scholar] [CrossRef]
Alqahtani, H.M.; Haq, I.U.; Alrubayan, M.; Alammari, F.; Alotaibi, F.; Al Khammash, A. A Bibliometric Analysis of the Top 100 Cited Articles in Regenerative Periodontics Surgery: Insights and Trends. J. Int. Soc. Prev. Community Dent. 2024, 14, 167–179. [Google Scholar] [CrossRef]
Khare, R.; Leaman, R.; Lu, Z. Accessing Biomedical Literature in the Current Information Landscape. In Biomedical Literature Mining; Humana Press: New York, NY, USA, 2014; pp. 11–31. [Google Scholar]
Meneghini, R.; Packer, A.L. Is There Science beyond English? EMBO Rep. 2007, 8, 112–116. [Google Scholar] [CrossRef]
Hartling, L.; Featherstone, R.; Nuspl, M.; Shave, K.; Dryden, D.M.; Vandermeer, B. Grey Literature in Systematic Reviews: A Cross-Sectional Study of the Contribution of Non-English Reports, Unpublished Studies and Dissertations to the Results of Meta-Analyses in Child-Relevant Reviews. BMC Med. Res. Methodol. 2017, 17, 64. [Google Scholar] [CrossRef]
Walpole, S.C. Including Papers in Languages Other than English in Systematic Reviews: Important, Feasible, yet Often Omitted. J. Clin. Epidemiol. 2019, 111, 127–134. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the proposed step sequence for the analysis of the current dataset.

Figure 2. This heatmap visualizes the proportional overlap of articles between various dental disciplines.

Figure 3. Distribution of articles over time by discipline (1994–2023).

Figure 4. Line plots showing the number of topics identified over time for each discipline.

Figure 5. Line plots showing Shannon’s entropy levels identified over time for each discipline.

Figure 6. Line plots showing Simpson’s index identified over time for each discipline.

Table 1. Search strategies for the creation of the database.

Discipline	Number of Articles
Implant Dentistry	“Dental Implants” [MeSH] OR “Dental Implantation” [MeSH] OR “Dental Implant” [Title/Abstract] OR “Implant Dentistry” [Title/Abstract] OR “Implantology” [Title/Abstract]
Oral Surgery	“Oral Surgical Procedures” [MeSH] OR “Oral Surgery” [Title/Abstract] OR “Oral Surgeons” [Title/Abstract] OR “Maxillofacial Surgery” [Title/Abstract] OR “Oral and Maxillofacial Surgery” [Title/Abstract]
Orthodontics	“Orthodontics” [MeSH] OR “Orthodontics” [Title/Abstract] OR “Orthodontic Treatment” [Title/Abstract] OR “Orthodontic Appliances” [MeSH] OR “Orthodontic Brackets” [MeSH]) OR (“Malocclusion” [MeSH] OR “Teeth Misalignment” [Title/Abstract])
Periodontics	“Periodontics” [MeSH] OR “Periodontal” [Title/Abstract] OR “Periodontics” [Title/Abstract] OR “Periodontology” [Title/Abstract]) OR (“Periodontal Diseases” [MeSH] OR “Periodontitis” [MeSH] OR “Gingivitis” [MeSH] OR “Periodontal Pocket” [MeSH] OR “Gum Disease” [Title/Abstract]
Prosthodontics	“Dental Prosthesis” [MeSH] OR “dental prostheses” [Title/Abstract] OR “dental prosthesis” [Title/Abstract] OR “Prosthodontics” [MeSH] OR “Prosthodontics” [Title/Abstract]
Restorative Dentistry	“Restorative Dentistry” [Title/Abstract] OR “Tooth Filling” [Title/Abstract] OR “Dental Restoration” [Title/Abstract] OR “Restorative Treatments” [Title/Abstract] OR “Dental Caries” [MeSH] OR “Tooth Cavity” [Title/Abstract] OR “Dental Cavities” [Title/Abstract] OR “Tooth Decay” [Title/Abstract]

Table 2. Number of articles per discipline (1994–2023).

Discipline	Number of Articles
Prosthodontics	98,852
Periodontics	93,510
Restorative Dentistry	63,257
Oral Surgery	61,719
Orthodontics	51,872
Implant Dentistry	42,826
Total	412,036

Table 3. Number of topics per discipline.

Discipline	Number of Topics
Restorative Dentistry	58
Periodontics	49
Oral Surgery	34
Prosthodontics	32
Orthodontics	32
Implant Dentistry	22

Table 4. Prominent topics per discipline.

Discipline	Main Topic
Orthodontics	Orthodontic Treatment Evaluation Study
Prosthodontics	Dental Implant Clinical Study
Periodontics	Periodontal Disease Treatment Studies
Implant Dentistry	Dental Implant Management Insights
Oral Surgery	Dental Implant Surgical Evaluation
Restorative Dentistry	Childhood Caries Prevention Study

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Colangelo, M.T.; Guizzardi, S.; Galli, C. Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines. Metrics 2024, 1, 3. https://doi.org/10.3390/metrics1010003

AMA Style

Colangelo MT, Guizzardi S, Galli C. Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines. Metrics. 2024; 1(1):3. https://doi.org/10.3390/metrics1010003

Chicago/Turabian Style

Colangelo, Maria Teresa, Stefano Guizzardi, and Carlo Galli. 2024. "Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines" Metrics 1, no. 1: 3. https://doi.org/10.3390/metrics1010003

APA Style

Colangelo, M. T., Guizzardi, S., & Galli, C. (2024). Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines. Metrics, 1(1), 3. https://doi.org/10.3390/metrics1010003

Article Menu

Topic Modeling as a Tool to Identify Research Diversity: A Study Across Dental Disciplines

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Cleaning and Preprocessing

2.3. Topic Modeling

2.3.1. Sentence Embedding and Topic Modeling

2.3.2. Topic Extraction

2.4. Entropy Analysis

3. Results

3.1. General Characteristics of the Dataset

3.2. Distribution of Articles over Time

3.3. Identification of Research Topics

3.4. Diachronic Analysis of Topics

4. Discussion

5. Outlook

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI