Predicting Star Scientists in the Field of Artificial Intelligence: A Machine Learning Approach

Shirouyeh, Koosha; Schiffauerova, Andrea; Ebadi, Ashkan

doi:10.3390/metrics2040022

Open AccessArticle

Predicting Star Scientists in the Field of Artificial Intelligence: A Machine Learning Approach

by

Koosha Shirouyeh

¹,

Andrea Schiffauerova

²

and

Ashkan Ebadi

^2,3,*

¹

Mechanical, Industrial, and Aerospace Engineering Department, Concordia University, Montreal, QC H3G 2W1, Canada

²

Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC H3G 2W1, Canada

³

National Research Council Canada, Toronto, ON M5T 3J1, Canada

^*

Author to whom correspondence should be addressed.

Metrics 2025, 2(4), 22; https://doi.org/10.3390/metrics2040022

Submission received: 2 July 2025 / Revised: 25 August 2025 / Accepted: 26 September 2025 / Published: 11 October 2025

Download

Browse Figures

Versions Notes

Abstract

Star scientists are highly influential researchers who have made significant contributions to their field, gained widespread recognition, and often attracted substantial research funding. They are critical for the advancement of science and innovation and significantly influence the transfer of knowledge and technology to industry. Identifying potential star scientists before their performance becomes outstanding is important for recruitment, collaboration, networking, and research funding decisions. This study utilizes machine learning techniques and builds four different classifiers, i.e., random forest, support vector machines, naïve bayes, and logistic regression, to predict star scientists in the field of artificial intelligence while highlighting features related to their success. The analysis is based on publication data collected from Scopus from 2000 to 2019, incorporating a diverse set of features such as gender, ethnic diversity, and collaboration network structural properties. The random forest model achieved the best performance with an AUC of 0.75. Our results confirm that star scientists follow different patterns compared to their non-star counterparts in almost all the early-career features. We found that certain features, such as gender and ethnic diversity, play important roles in scientific collaboration and can significantly impact an author’s career development and success. The most important features in predicting star scientists in the field of artificial intelligence were the number of articles, betweenness centrality, research impact indicators, and weighted degree centrality. Our approach offers valuable insights for researchers, practitioners, and funding agencies interested in identifying and supporting talented researchers.

Keywords:

star scientist; rising stars; artificial intelligence; predictive modelling; social network analysis; machine learning

1. Introduction

Star scientists are individuals who have made exceptional contributions to their respective fields, characterized by high publication and citation rates, groundbreaking research contributions, and influential collaborations [1,2]. Identifying and understanding the characteristics of star scientists is crucial for advancing research and applications in various fields, as well as recognizing and supporting exceptional researchers [3]. Star scientists are vital as they drive innovation, push the boundaries of knowledge, and often serve as catalysts for breakthrough advancements. They often have the expertise and influence to bring together researchers from different fields and foster interdisciplinary collaborations, leading to new insights and breakthroughs in research [4,5].

Artificial intelligence (AI) is a rapidly evolving field characterized by constant innovation and transformative technologies, making it dynamic and ever-changing [6]. AI has a wide range of applications across industries, from healthcare [7,8] to finance [9], transportation [10,11], and construction [12], impacting society at large. Considering AI’s complex and constantly evolving nature, it necessitates a broad spectrum of expertise from various scientific domains. Therefore, the effective dissemination of knowledge can promote the exchange of insights and skills among AI scientists, aiding them in overcoming prevalent challenges when tackling intricate real-world problems [13].

Predicting star scientists in the field of AI is important since it allows for early identification and support of exceptionally talented individuals, aiding in recruitment, collaboration, and research funding decisions. This can lead to the acceleration of groundbreaking research and innovation in the field of AI, contributing to its continued growth and impact on society. Motivated by the importance of early recognition of talents and the special characteristics of AI as a rapidly evolving field, in this work, we present a machine learning (ML) model that accurately predicts the individuals destined to become star scientists in the field of AI. The model leverages a comprehensive set of features of different types, including network measures, diversity metrics, and research output, to enhance the accuracy of the prediction task. Additionally, the study seeks to discern the key differentiators between rising star scientists and their peers by comparing early-career characteristics, thereby contributing to a deeper understanding of the factors that drive success in AI research.

The majority of existing approaches for identifying rising star scientists focus on individual researchers, yet scientific success often hinges on collaboration. Therefore, it is essential to develop methodologies that assess not only individual contributions but also collaborative efforts and interactions within scientific networks. This study significantly contributes to the AI domain by using machine learning techniques to identify star scientists. It provides a predictive model that recognizes influential researchers early in their careers and highlights the unique patterns distinguishing rising stars from their peers. Additionally, the research underscores the impact of various factors, such as gender and ethnic diversity, on collaboration and career trajectories, emphasizing their role in researcher development and success. Identifying key features instrumental in predicting star scientists offers valuable insights for researchers, practitioners, and funding agencies aiming to support promising talents, thereby advancing AI and fostering innovation and knowledge transfer to industry.

The paper structure is as follows. It begins by reviewing the related works, outlining the significance of predicting star scientists. The “Data and Methodology” section describes the data, predictive model and its feature set. The “Results” section presents the findings, emphasizing the distinctions between rising star scientists and their counterparts and identifying key factors contributing to their success. The paper concludes by discussing the broader implications of these findings for researchers, practitioners, and funding agencies interested in nurturing talent in the field. Finally, the limitations of the study and directions for future research are presented in the “Limitations and Future Work” section.

2. Related Work

2.1. Definition of Star Scientists

The literature presents a wide range of definitions for star scientists, highlighting the diverse nature of scientific excellence. These definitions encompass various criteria, such as citations, funding, patenting activity, and membership in prestigious scientific organizations, reflecting the multifaceted ways in which researchers achieve star status. Lowe and Gozalez-Brambila [14] considered stars as those highly productive scholars who become entrepreneurs. In another study, faculty members who founded new technology ventures were considered star scientists [15]. Patents have also been used to identify star researchers. In a study conducted by Niosi and Queenton [16], researchers holding over five patents and producing at least one major publication per year were counted as stars. Schiffauerova and Beaudry [17] focused on the field of biotechnology and identified stars as those with at least 20 patents. Moretti and Wilson [18] also focused on patents and identified stars as patent assignees whose patent count over the past decade ranks within the top 5% of patent assignees on a national scale. In a more comprehensive definition, Azoulay et al. [19] defined star researchers as those satisfying at least one of the following criteria: (1) researchers with substantial funding, (2) extensively cited scientists, (3) leading patent holders, and (4) individuals elected as the National Academy of Science members. Hess and Rothaermel [20] used quantitative criteria and defined star scientists as researchers whose publication and citation rates exceeded the mean by three standard deviations. Another set of studies established star scientists by considering citation metrics. For example, Hoser [21] considered academics with the highest citation counts. Tartari et al. [22] defined stars as academics within the top 1% for citations in their field and within the top 25% for grants received from the Engineering and Physical Sciences Research Council (EPSRC) in the United Kingdom. In a more recent study in Japan, star scientists were characterized as researchers featured in the Highly Cited Researchers (HCR) published by Clarivate Analytics [23].

2.2. Importance of Star Scientists

Star scientists are crucial players in the advancement of research and applications in various fields [24]. They can drive innovation and interdisciplinary collaborations, often leading to new insights and breakthroughs in research [25]. Recognizing and supporting exceptional researchers is therefore critical for the progress of science and technology [26]. As described in the previous section, star scientists are often characterized by high publication and citation rates, groundbreaking research contributions, and influential collaborations [4]. They assume leadership roles in research domains, offering direction and inspiration to their peers. Moreover, they are often the recipients of prestigious awards and honours, which can have a significant impact on their professional journeys and research trajectories [27].

The impact of star scientists extends beyond their individual achievements. Research indicates that they make disproportionate contributions across diverse contexts [28] and engage in collaborations with a broader spectrum of scientists [29]. They could also play a gatekeeper role, facilitating the exchange of knowledge within several research groups and impacting neighbouring researchers in terms of productivity and recognition [19,30]. Stars not only affect academia but also have a significant influence on businesses, transferring advanced knowledge to emerging technology firms through different channels such as co-founding or advisory roles [31]. Their work frequently leads to the development of innovative technologies and applications, carrying substantial economic and societal benefits. In addition, they actively engage in policy formulation and public outreach, contributing to the formation of public discourse and opinion on important matters [32]. These all highlight the vital role of star scientists in the scientific ecosystem and society.

2.3. Predicting Star Scientists

Throughout their career, researchers may transition through various phases, such as declining, emerging as a rising star, maintaining stability, or achieving well-established status, based on their performance [33]. Certain researchers maintain consistent success throughout their professional journeys, while others exhibit fluctuating patterns. Productive scientists often receive increased recognition, which, in turn, fuels their future productivity. Consequently, star scientists are expected to have a dominant profile in the early stages of their career, capitalizing on the principle of accumulative advantage to ascend to star status [19]. Given the importance of star scientists, predicting rising stars in academia has emerged as an active area of research in recent years, leading to the development of various approaches and methodologies.

Utilizing bibliographic data, various techniques such as social network analysis (SNA) and machine learning (ML) have been employed for the identification of star scientists. Initially, researchers endeavoured to use the dynamics of author ranking based on compiling a list of each author’s important scores to perform social network analysis to find star scientists [34,35]. There have also been studies that used ML techniques to identify potential future star scientists. For instance, Daud et al. [36] built ML classifiers, considering increases in citation counts along with three kinds of characteristics, i.e., author, venue, and co-authorship, to forecast future rising stars, underscoring the significance of the publication venues in the prediction process. Regression models have also been deployed, accounting for both temporal and content variables [37]. Their research unveiled that temporal factors, as opposed to venue features, serve as the most influential predictors of star scientists. These studies rely on citation counts as the assessment metric for identifying star scientists, which may not provide a reliable signal of a researcher’s overall success [38]. Nie et al. [38] proposed a new approach to overcome this limitation. They considered multiple factors, including the quantity and impact of articles, citation counts, and the domain cited factor, to generate a composite score for each scientist. Next, they utilized changes in this composite score across two consecutive five-year intervals to label star scientists. Their methodology demonstrated superior performance compared to prior studies relying on citation growth. Additionally, their research highlighted the importance of the venue characteristics in the identification of star scientists.

Despite the numerous approaches proposed for identifying rising stars in the literature, there are still unaddressed research gaps. One notable gap is the absence of a universally reliable and comprehensive method for pinpointing star scientists. A universal scientific performance evaluation indicator is foundational to crafting an all-encompassing approach for identifying star scientists. While [38] introduced a composite score that draws from various indicators to overcome the limitations of single-aspect indicators, e.g., citation counts, they acknowledged that their approach requires a wide range of metadata encompassing citations, coauthors, and venues, which can add complexity to the evaluation of scientific performance. In Addition, several studies have examined the correlation between current research impact and different attributes. For example, using SNA, significant positive associations were found between degree centrality and the h-index [39]. Furthermore, studies have explored diversity metrics, underscoring a robust connection between diversity and scientific performance in general [40]. In another line of research, previous studies (e.g., [36,38]) have compared the predictive outcomes between different categories of features, identifying the significance and predictive power of those features. However, a more holistic approach could involve the consideration of feature combinations from diverse categories, as opposed to evaluating each category in isolation.

In summary, identifying star scientists in science is a complex task that requires considering a wide range of factors. Although existing approaches have made significant progress, several research areas remain unaddressed, and there is a need for more accurate predictive models that can handle large-scale data and incorporate diverse features. This study aims to address some of these gaps by developing a machine learning prediction model, trained on large-scale data, to predict star scientists using a combination of features of different types. We also investigate early-career features and compare rising and non-rising star scientists, examining how current characteristics of academics can influence their future success.

3. Data and Methodology

Figure 1 illustrates the high-level conceptual flow of the analyses. AI-related publications were collected from the Scopus database, while data about publishers and journals were sourced from Scimago. After merging the data, it underwent several pre-processing steps. Relevant metadata, including research themes, author gender and ethnicity, network measures, and diversities, were extracted. Rising star and non-rising star AI scientists were identified based on their scientific impact over two consecutive periods. Several ML classifiers were trained and tested. Feature importance was examined to propose a combination of features from different categories crucial for predicting future star scientists. The pipeline was coded in the Python 3.13 programming language. Detailed explanations of all the modules will be provided in the following sections.

3.1. Data

In this research, we utilized the same dataset that was used in our previous work [41]. The dataset encompassed articles related to AI published between 2000 and 2019, sourced from Elsevier’s Scopus database. The extraction process was facilitated using the following query: (“artificial intelligence” OR “machine learning” OR “deep learning”). The data was filtered to include research articles, conference papers, book chapters, and books. We opted for Scopus as our data source of choice, given its remarkable retrieval rate of computer science articles and its extensive indexing of unique content within the field [42]. Furthermore, we enriched the database by incorporating journal ranking data obtained from SCImago. We further incorporated historical publisher metadata, sourced from SCImago, to gauge publications, classifying them according to their respective publisher’s SCImago Journal Rank (SJR) at the point of publication. The SJR indicator serves as a metric to assess the academic eminence of scholarly journals, factoring in the volume of citations received by a journal as well as the prestige of the journals from which these citations originate. Publications were stratified into three tiers contingent on the SJR of their respective publishers at the time of publication. Journals with an SJR exceeding three standard deviations above the mean were categorized as “A”, those with an SJR surpassing one standard deviation but falling short of three standard deviations above the mean were designated as “B”, and journals with an SJR below one standard deviation above the mean were denoted as “C”. The dataset contains 46,341 publications and 162,561 authors.

3.2. Methodology

3.2.1. Textual Data Processing

The paper’s title and abstract serve as concise representations of key research elements, encapsulating specialized keywords and core research concepts in a publication [43]. As the first step, we merged the titles and abstracts of the publications in the dataset and conducted a series of preprocessing steps on the combined textual content. These steps encompassed converting text to lowercase, filtering out short words (those with fewer than 3 characters), removing custom stop words, eliminating non-alphabetic characters, lemmatization, and tokenization. Unigrams and bigrams were also extracted, and a document-term frequency matrix was generated. The processed data were employed later to extract researchers’ disciplinary profiles, as described in Section 3.2.2.

3.2.2. Feature Engineering

Exploring individual attributes of authors is essential for unravelling patterns within the academic landscape. Notably, gender disparities persist in academia, an issue that extends to the community of star scientists as well [44]. Additionally, star scientists exhibit a heightened propensity for international collaborations in contrast to their counterparts [29]. Moreover, various diversity metrics, including ethnic diversity, exhibit significant associations with scholars’ academic accomplishments [40]. Evaluating these factors requires extracting additional information not present in the original dataset. We engineered various features of different types to enhance the performance of the ML models, allowing for a more comprehensive data representation. The feature engineering process and the feature set are described in detail in this section.

Gender

The gender of the authors in the dataset was identified and added to the original data using the approach proposed in our previous work [41], where we combined machine learning and natural language processing techniques to build an automated gender assignment model. This model underwent training on an extensive dataset annotated with gender labels, enabling the inference of gender based on a set of core features encompassing full names, affiliations, and countries of origin.

Ethnicity

We used the “ethnicolr” package in the Python programming language to determine the ethnicity of the authors in the dataset. The package employs ML techniques to categorize any given name into one of the 13 ethnic groups listed below:

Asian, Greater East-Asian, East-Asian
Asian, Greater East-Asian, Japanese
Asian, Indian Sub-Continent
Greater African, Africans
Greater African, Muslim
Greater European, British
Greater European, East-European
Greater European, Jewish
Greater European, West-European, French
Greater European, West-European, Germanic
Greater European, West-European, Hispanic
Greater European, West-European, Italian
Greater European, West-European, Nordic

Diversity

Diversity and its implications have been extensively studied (e.g., [45,46]). Academic collaborations, which often involve researchers from varied geographical locations, fields of expertise, and backgrounds, provide insights into the structure of scholarly cooperation [47,48]. The growing interest in analyzing collaboration networks has driven efforts to understand the underlying factors influencing academic achievement [49]. Several of these factors, such as academic discipline [13], gender [41], career age [50], ethnicity [51], and institutional affiliation [52], have been investigated, and their impact on research performance has been documented. In this work, we investigated five classes of diversity as follows:

Ethnicity diversity: It takes the ethnic backgrounds of researchers into account.
Age diversity: The term “age” pertains to the academic age of researchers in the dataset, calculated based on the date of their first publication. We considered the following four sub-categories for this feature:
○
Group 1: 0–5 years of experience.
○
Group 2: 5–10 years of experience.
○
Group 3: 10–15 years of experience.
○
Group 4: 15–20+ years of experience.
Gender diversity: This category considers the gender of researchers in the collaboration networks.
Disciplinary diversity: It was measured both at the individual and group levels. At the group level, it involved examining the diversity of co-authors’ areas of expertise, which corresponded to the most probable topics in each author’s disciplinary profile. At the individual level, this diversity was defined by considering the variety of fields represented in an author’s publications.
Affiliation diversity: This category accounts for the countries listed in the affiliations of co-authors in a paper and aims to assess the extent of international collaboration diversity. We used Shannon entropy [53] as the diversity metric.

Disciplinary Profiles

To extract researchers’ disciplinary profiles quantitatively, we applied a topic modelling technique to the document-term frequency matrix, outlined in Section 3.2.1. Specifically, we built a Latent Dirichlet Allocation (LDA) topic model [54] to extract the main topics in the publications. To find the optimal number of topics, we built a series of LDA baseline models, each employing a varying number of topics, ranging from 3 to 20. These models were subsequently evaluated using metrics such as perplexity and log-likelihood [55]. Inter-topic distance mappings were also assessed. In addition to quantitative metrics, a qualitative evaluation was conducted by examining keywords and document-topic distributions within the models. Through a combined quantitative and qualitative analysis, the optimal number of topics was determined to be 8. Subsequently, leveraging the LDA model, each publication was associated with multiple topics. The authors’ research areas were established based on the average topic distribution across their publications, facilitated by the document-topic probability matrix. Each author was finally represented by a topic distribution vector of length 8, with each component reflecting the average thematic distribution of the author’s papers within that topic. These disciplinary profiles were added to the original dataset.

Network Structural Features

Collaboration is a fundamental aspect of scientific activities, serving as a conduit for researchers to share their expertise, resources, and insights to address intricate challenges and achieve significant breakthroughs [56]. The position of researchers within collaborative networks can influence their overall performance [57,58]. Although quantifying and representing scientific collaboration is a complex task, co-authorship is the widely accepted standard for measuring collaboration [59,60,61], as a robust indicator of mutual scientific engagement. In co-authorship networks, each node represents an individual researcher, while the presence of a link between two nodes indicates that these two researchers have collaborated on at least one shared publication. We used Pajek software, v5.1 [62] to create the co-authorship networks of researchers across each year within the study period. Subsequently, we calculated the following network structural variables at the individual level of researchers:

Betweenness centrality: Betweenness centrality serves as a metric for evaluating a researcher’s potential influence on network communication [39]. Researchers possessing high betweenness centrality can act as conduits between disparate research communities, regulate the dissemination of information, and wield more influence over their peers within the network. This influence extends to activities such as determining project priorities and facilitating the spread of knowledge [63]. This measure is computed by dividing the total number of shortest paths by the fraction of shortest paths passing through a particular node among all node pairs [64], and ranges between 0 and 1. The most central nodes, often called “gatekeepers”, exhibit the highest betweenness centrality values.
Degree centrality: Degree of a node is calculated based on the count of connections the node has [65]. Respectively, the degree centrality of a node is established according to the degree of the node, and the resulting values are normalized to fall within the range of 0 to 1. Within co-authorship networks, researchers with high degree centrality are indicative of higher activity, as they possess a greater number of direct connections [66]. Having a larger number of direct collaborators could facilitate the researcher’s access to diverse sources of skills and complementary expertise [63].
Weighted degree: The link weight between two nodes signifies the strength of their collaborative connection, representing the frequency of their collaborations. We calculated this measure for each researcher as the sum of the weights associated with a node’s direct connections, divided by the total number of distinct co-authors. Researchers with strong ties, often collaborating with the same partner, are considered to be loyal connections [39].
Clustering coefficient: The clustering coefficient assesses the inclination of nodes to assemble into clusters and quantifies the number of triangles, thereby indicating the extent of clustering [63]. Essentially, it gauges the probability that two neighbours of a node are interconnected with each other [67]. Researchers exhibiting a high clustering coefficient create closely knit clusters, potentially enhancing their capacity to generate superior research by leveraging the strong interconnections within their groups and facilitating internal referring among team members [63]. The clustering coefficient is calculated by dividing the actual number of edges between neighbours by the maximum possible number for the network under consideration.

Research Performance

Various metrics are commonly used to assess scientists’ productivity and the impact of their publications. These metrics encompass indicators such as impact factor, publication count, citation count, citations per publication, and the number of highly cited publications. The h-index, introduced by Hirsch in 2005, combines measures of a researcher’s publication quantity and impact into a single metric [1]. The h-index has found utility across a spectrum of fields, including biology [68], information science [69], and journal evaluation [70], to name a few. The key strength of the h-index lies in its ability to amalgamate measures of quantity and impact into a comprehensible and applicable metric; hence, it retains its practicality as a proxy for evaluating authors’ research performance within domains focused on authors from the same discipline. In this work, we selected the h-index as an indicator to gauge the research performance of authors at different stages of their careers.

3.2.3. Data Analytics

Feature Set and Label

As described in Section 3.2.2., we generated several features of different types and added them to the original dataset. Features that were used in the analyses are as follows: (1) Number of publications, (2) Number of publications based on journal ranking, (3) Citation count, (4) h-index, (5) Individual disciplinary diversity, (6) Group disciplinary diversity, (7) Ethnic diversity, (8) Gender diversity, (9) Affiliation diversity, (10) Age diversity, (11) Degree centrality, (12) Weighted degree centrality, (13) Clustering coefficient, and (14) Betweenness centrality.

To investigate the progression of junior researchers into star scientists, this research focuses on authors who published their first publication between 2006 and 2010, marking the initial decade of their professional journeys, and had at least one collaboration (n = 9391 authors). The selection of a ~10-year timeframe aligns with common practice in the analysis of star scientists (e.g., [36,38]), facilitating a more comprehensive exploration of their developmental trajectory during the early-career phase. By concentrating on researchers who commenced their careers within a similar timeframe, this approach enables a meaningful comparison of their relative success while considering potential confounding variables, such as historical context and technological advancements that could have influenced the career trajectories of authors who began their professional journeys at different periods. To generate the label, we considered star scientists as researchers in the dataset whose h-index experienced a notably substantial increase, i.e., exceeding the average by at least three standard deviations, as a statistically rigorous criterion, aligning with established conventions for identifying extreme outliers [71]. This threshold allowed us to effectively differentiate researchers with outstanding h-index increases, ensuring a balance between inclusivity and specificity in classifying rising stars. This growth rate, H_GR, is calculated by contrasting an author’s h-index during the initial and subsequent five-year periods of their career, as stated in Equation (1):

H_{G R} = \frac{(h_{2} - h_{1})}{(t_{2} - t_{1})}

(1)

where

h_{1}

and

h_{2}

represent the h-index during the first five years and first ten years, and

t_{1}

and

t_{2}

denote the time of the first and second periods, respectively. Having generated the label, we built ML models to predict star scientists and analyze their early-career characteristics, such as research productivity, diversity indicators, and network structural features during the first five years of their career.

Classification

We defined the star scientist prediction problem as a supervised learning task in which the ML model learns from labelled training data. More specifically, the problem was defined as a binary classification task, where the label was either 0 (non-rising scientist) or 1 (rising star scientist). Incorporating a binary classification framework allows us to effectively differentiate between rising stars and non-rising researchers, providing clear insights into the distinct patterns and traits that characterize rising star potential. The dataset consists of authors who had their first publication between 2006 and 2009 (n = 7311) for the training set and authors with their first paper in 2010 (n = 2313) for the unseen test set. To address the class imbalance in the distribution of rising star scientists and non-rising star scientists within the training set, we utilized the Synthetic Minority Over-sampling Technique (SMOTE) [72], using the “imblearn” package. SMOTE aids in rebalancing the classification by augmenting the number of samples in the minority class while preserving the inherent data distribution characteristics.

To ensure models’ generalizability and reduce the risk of overfitting, we used an expanding window cross-validation approach for hyperparameter tuning and training the models. The expanding window technique derives its name from the fact that the training set’s size increases progressively as the validation set shifts forward in time, initiating with a small validation set and gradually enlarging it, providing the model with a progressively larger dataset for learning. In particular, we implemented an expanding window cross-validation, wherein the data was divided into five equally sized (1-year) windows of data, with the fifth window set aside as the unseen test set. The validation procedure commenced with the first window as the training set and the second window as the validation set. Subsequently, the training set was progressively extended to encompass the second window (i.e., validation set in the previous iteration), and the third window was taken as the validation set. This process was repeated until the entire dataset was employed for validation. This approach allowed us to evaluate the performance of the model over time, ensuring the generalizability and high performance of the models.

In the feature selection process, we employed Recursive Feature Elimination (RFE) to identify the most pertinent features. The objective of RFE is to reduce data dimensionality by iteratively excluding the least significant features while preserving the most critical ones. RFE proves particularly advantageous when handling high-dimensional datasets, enhancing classifier performance by eliminating extraneous or redundant features and guarding against overfitting, a situation in which a classifier becomes overly intricate and underperforms on new, unseen data. We applied RFE to each of the classifiers to enhance their performance and mitigate the risk of overfitting. We trained and built four classification models, under the defined validation and feature selection settings, as follows: (1) Logistic Regression (LR) [73], (2) Support Vector Machine (SVM) [74], (3) Gaussian Naive Bayes (NB) [75], and (4) Random Forest (RF) [76]. We used grid search for tuning the hyperparameters of the models. The search was conducted over a defined hyperparameter space, guided by a cross-validation strategy, and the models were evaluated using the F1 score. Once the best combination of hyperparameters was identified, the corresponding model was refit on the training data to optimize the final model’s performance. The performance of final classifiers was evaluated using the F1 score and area under the ROC curve (AUC). The F1 score, a metric that strikes a balance between precision and recall, offers a holistic measure of classifier performance, taking into account the count of both false positives and false negatives in predictions [77]. The AUC score assesses the model’s capacity to differentiate between positive and negative classes at various thresholds, considering both true positive and false positive rates. It offers a unified measure that gauges the overall predictive quality of the model, providing a more comprehensive evaluation of its performance, which was particularly valuable in our case due to the imbalanced nature of the data.

4. Results

4.1. Distribution of Publications and Authors

Figure 2 shows the distribution of papers, authors, and the number of authors per paper in the dataset. As seen in Figure 2a,b, the data reveal a remarkable growth in both the number of publications and the number of authors in the field of AI from 2000 to 2019. Specifically, the number of papers has surged from 91 in 2000 to over 16,000 in 2019, indicating an exponential increase in research output. Correspondingly, the number of authors has also expanded dramatically, from 194 authors in 2000 to nearly 54,000 authors in 2019. This trend may reflect the burgeoning interest and investment in AI research, as well as the increasing collaboration and interdisciplinary efforts within the scientific community. The sharp rise in publications and authors suggests a vibrant and rapidly evolving field, with a growing pool of researchers contributing to the advancement of AI. This expansion underscores the importance of developing robust methods to identify and support rising star scientists amidst the increasing volume of research activity.

The observation of the average number of authors per paper increasing from around 2 in 2000 to approximately 4 in the later years (Figure 2c) further highlights the growing trend of collaborative research in the field of AI. As seen in the Figure, from 2000 to 2005, there was a noticeable rise in the average number of authors per paper, followed by a steady trend until 2012, and then another phase of increasing collaboration. The initial rise and subsequent steady trend might reflect the early stages of increased interdisciplinary collaboration and team-based research becoming more common in AI. The renewed upward trend post-2012 may suggest a further intensification of collaborative efforts, possibly driven by the increasing complexity of research problems that require diverse expertise and larger research teams. This shift towards more collaborative research practices underscores the importance of fostering teamwork and interdisciplinary approaches in scientific endeavours, which can be crucial for making significant advancements in AI.

4.2. Correlation Analysis

Before predicting star scientists, we investigated the correlations between the feature set and the label, i.e., the growth rate of the h-index. As illustrated in Figure 3, a positive correlation exists between the h-index growth rate and the number of articles. Additionally, there is a positive correlation between the h-index, number of level-B articles, and weighted degree centrality. Moreover, correlations were observed between diversity indicators and network structural metrics. For instance, gender diversity demonstrated positive associations with degree centrality, weighted degree centrality, and clustering coefficient, with the highest positive correlation seen between ethnicity diversity and clustering coefficient in these two feature groups. This may imply that individuals from diverse ethnic backgrounds tend to form close-knit relationships and establish dense connections, resulting in a high clustering coefficient early in the authors’ careers. Similarly, it may suggest that authors collaborating with individuals of various genders, as opposed to exclusively those of the same gender, tend to have larger networks of connections, underscoring the significance of gender diversity in scientific collaboration and its potential impact on authors’ career growth. Lastly, as anticipated, the number of level-A articles was positively correlated with citation count, indicating that publishing in prestigious venues garners more recognition, which can result in a higher number of citations.

Conducting a pairwise comparison hypothesis testing, we further investigated whether rising star scientists possessed different early-career characteristics compared to the general population of scientists. By implementing a two-sample t-test, we tested the null hypothesis that there were no significant differences in the means of the two groups across various features. Our analysis revealed significant differences in early-career characteristics between rising star scientists and the general population of scientists, with the exception of ethnic diversity. This may indicate that rising star scientists follow unique trajectories in their early careers, which can be predictive of their future success. This finding may further underscore the importance of recognizing and nurturing specific early-career attributes that contribute to scientific excellence while also highlighting that ethnic diversity remains consistent across different levels of scientific achievement.

4.3. Network Exploratory Analysis

By constructing a network for each year of the study, we explored the scientific collaboration networks among AI researchers, which provided a detailed understanding of the network’s structure and dynamics. Figure 4 focuses on the year 2014 as an example, visualizing the network of the most central authors, those in the top 1% in terms of betweenness centrality. This visualization reveals insights into the network’s structure and identifies key players who are most influential in spreading information or influence within the network. The examination of these central nodes highlights the pathways or bridges that connect various clusters, thereby enhancing the understanding of the network’s dynamics, structure, and influential members.

The analysis further quantifies the network with specific metrics: a total of 99,479 connections, including 5919 repeated collaborations, and a network density of 0.0002, indicating a low percentage of possible edges are present. This low density is typical for large networks, where the number of connections that can be maintained is limited compared to the possible connections as the network grows. The average degree centrality of 5.98 suggests that each author is involved in nearly 6 connections on average. The betweenness centralization score is 0.0005, and the network’s clustering coefficient is 0.85, indicating a high tendency for nodes to form tightly knit groups with a high density of connections. This high transitivity suggests the network is highly clustered, revealing the collaborative nature and structure of the academic research environment among AI researchers.

4.4. Rising Star vs. Non-Rising Star Researchers

We performed a comparative analysis of the features between rising star scientists and non-rising star researchers in the dataset to gain initial insights into the factors influencing their success. As seen in Table 1, rising star scientists exhibit a higher average weighted degree compared to their average degree, whereas non-rising star scientists demonstrate nearly equivalent average values. Moreover, rising star researchers exhibit higher averages in weighted and unweighted degrees in comparison to non-rising star researchers. This may suggest that rising star scientists engage more frequently in repeated collaborations with the same researchers, as indicated by higher weighted degree values. This may reflect the formation of strong and productive collaborations, potentially playing a pivotal role in their success [25]. Additionally, rising star scientists exhibit a tendency to publish more frequently and attract citations from a broader spectrum of publications in their early careers when contrasted with non-rising star scientists. This implies their enhanced ability to garner recognition for their contributions, with a higher average number of high-impact articles published. These initial comparisons underscore the significance of collaboration and the impact of publications in shaping a prosperous research career. Balancing the production of substantial work with a focus on its quality, recognition from diverse sources, and effective, productive collaborations emerges as crucial.

4.5. Predicting Star Scientists

As detailed in Section 3.2.3., we built four classifiers to predict rising star scientists and evaluated their performance using F1 and AUC metrics. As shown in Figure 5, the random forest classifier outperformed the others, achieving an AUC of 0.75. While the performance difference between the random forest and the SVM model was notable, it was not overwhelmingly substantial. However, a key advantage of the random forest model over SVM is its reduced susceptibility to overfitting, especially when confronted with high-dimensional datasets. The random forest model constructs numerous decision trees on random subsets of the training data and averages their outcomes, which helps to mitigate overfitting and enhances the model’s generalization to unseen data. For classification, the features selected using the RFE method included the number of articles, citation count, individual discipline diversity, ethnic diversity, gender diversity, weighted degree centrality, clustering coefficient, and betweenness centrality. These features provided a foundation for distinguishing rising star scientists from their peers.

Figure 6 depicts feature importance from the random forest model. As seen, the number of articles, betweenness centrality, h-index, citation count, and weighted degree centrality are the most influential features. These features suggest that prolific publication output in terms of both count and impact, acting as gatekeepers in the scientific community, and direct collaborations represented by degree centrality, may contribute to an individual’s recognition and success as a star scientist. The emphasis on these features underscores the importance of both individual productivity and collaborative measures in achieving prominence in the field.

5. Conclusions

AI is capturing global interest due to its unique capabilities. It offers the promise of automating and enhancing tasks that were once considered exclusive to human intelligence, spanning from natural language processing and computer vision to problem-solving and decision-making. The availability of massive digital data and increased computational power has driven advancements in AI algorithms, leading to more accurate and complex models. Consequently, AI technologies are being applied across various domains such as healthcare [78,79], finance [9], and even entertainment [80], making it a versatile and interdisciplinary field. The ability to create systems that continuously learn and improve from data, known as machine learning, is a core aspect of AI, offering adaptability and scalability in addressing diverse challenges. This potential is attracting immense interest and investment in the field.

Predicting rising star scientists is crucial for several reasons. Firstly, identifying rising star researchers can help research institutions and funding agencies allocate their resources more efficiently by investing in individuals who are likely to make significant contributions to the field. Secondly, these predictions benefit junior researchers themselves by offering insights into the factors that drive success, allowing them to tailor their career paths. Additionally, the scientific community at large benefits because star scientists often lead groundbreaking research, contributing to technological advancements with profound societal impacts. Similarly, predicting AI star scientists is crucial because it allows organizations to identify and support emerging talent, fostering innovation and ensuring a competitive edge in the rapidly evolving AI landscape. Additionally, it helps allocate resources effectively, driving breakthroughs and advancements in AI research and applications.

This research extends the literature on the utilization of ML algorithms for forecasting the trajectories of early-career scientists, underscoring the potential of such techniques in advancing our comprehension of the intricate scientific landscape. By analyzing features of various types, we developed a predictive model that distinguishes rising star scientists from their non-rising counterparts. We focused on the field of AI and harnessed diverse datasets, encompassing publication and citation data, and applied various techniques, including NLP and social network analysis, to engineer various features of different types, such as gender, field of expertise, ethnicity, and network structural measures. It was found that rising star scientists exhibit notable distinctions across various attributes in comparison to the broader research population. The pivotal features for effective predictions encompassed article quantity, citation count, individual discipline diversity, ethnic diversity, gender diversity, weighted degree centrality, clustering coefficient, and betweenness centrality. Our findings also indicated that certain metrics, such as the number of published articles, group discipline diversity, and weighted degree centrality, are significant indicators of future success. These insights may provide a foundation for more informed recruitment, collaboration, and funding decisions, which are critical for fostering innovation and scientific advancement. Additionally, our research highlighted the importance of diversity in the scientific community. Gender and ethnic diversity were found to play crucial roles in collaboration dynamics and career development. The inclusion of diverse perspectives not only enriches the research environment but also enhances the potential for groundbreaking discoveries. While our findings indicate that diversity might influence collaboration dynamics, further exploration into policies and initiatives promoting these values within scientific institutions and funding bodies could potentially contribute to a more equitable and productive scientific landscape.

While our study focuses on the AI field, the features identified may hold potential relevance across other disciplines. Adapting our model to other fields would require incorporating domain-specific features and recalibrating features to align with the unique characteristics of each discipline. However, the proposed methodology (and pipeline) could be extended to other fields. Overall, the proposed approach can serve as a valuable tool for researchers, practitioners, and funding agencies aiming to identify and support emerging talent in AI. By recognizing and nurturing star scientists early in their careers, the scientific community can accelerate the transfer of knowledge and technology to industry, driving progress and innovation. Future research should continue to refine predictive models and explore additional features that contribute to a researcher’s success, further enhancing our ability to support and advance the frontiers of science.

6. Limitations and Future Work

Several constraints should be acknowledged when interpreting the findings of this study. The dataset utilized in this research was restricted in available metadata, which could impact the precision of the research performance metric due to the absence of comprehensive citation data. Subsequent investigations might incorporate supplementary metadata from diverse sources. Another related limitation of our study was the use of a dataset with a cutoff in 2019. While this provides a comprehensive view up to that point, it may not fully capture the most recent developments and emerging trends in the rapidly evolving field of AI research. Additionally, researchers have the opportunity to enhance the research performance metric by integrating alternative indicators to offer a more comprehensive evaluation of the researcher’s work’s influence and dissemination. These limitations may introduce biases related to the temporal and contextual representation of researchers’ work. Future investigations might extend the data collection period and incorporate supplementary metadata from diverse sources to mitigate these potential biases. Future research could also explore the integration of open-source contributions, such as GitHub repositories, into the analysis framework to provide additional insights into collaborations, researcher activity, and innovative contributions in AI research.

Additionally, the prediction methodology itself may introduce biases, particularly through feature selection and model training processes. For instance, reliance on specific quantitative features such as publication count and network structure features might overlook other qualitative aspects of research contributions. Future research may integrate alternative indicators and explore diverse machine learning and deep learning techniques to improve the evaluation of researchers’ influence and dissemination. The inclusion of demographic features such as gender and ethnicity in our predictive model may raise ethical concerns about reinforcing systemic biases. While these features can enhance model accuracy, they may inadvertently perpetuate existing inequalities. To address these concerns, we propose implementing regular audits of the model’s predictions and incorporating fairness constraints to mitigate bias and ensure equitable outcomes.

Another potential avenue for future research involves exploring the influence of mentorship and collaboration on the success of junior researchers. While this study considered collaboration diversity in the success of early-career researchers, there remains an opportunity to delve deeper into the specific forms of collaborations and mentorship that contribute to academic success. This could entail conducting surveys or interviews with accomplished scholars and their mentors to gain insights into the types of collaborations and mentorship that prove most advantageous. Furthermore, forthcoming research might assess the impact of the academic environment on the accomplishments of early-career scientists, encompassing factors such as institutional resources available to researchers and the prevailing research community culture.

The scarcity of rising scientists in the dataset presents a significant challenge, limiting the model’s ability to perform more effectively. Additionally, the rapidly changing landscape of the AI field further complicates the prediction task, necessitating ongoing model adaptation to capture evolving trends and patterns. A potential future direction is to explore augmenting the dataset with additional relevant data sources and continuously updating the model. Lastly, it could be of great interest to replicate this research in various academic disciplines, aiming to explore the determinants of research achievement within diverse fields of study. Such an approach may unveil shared elements and distinctions in the factors influencing research success, thereby offering valuable insights into the most effective strategies for nurturing and promoting the growth of rising star scientists across academia.

Author Contributions

Conceptualization, A.E.; methodology, A.E. and K.S.; validation, K.S., A.S. and A.E.; formal analysis, K.S.; investigation, K.S., A.S. and A.E.; resources, A.S.; data curation, K.S.; writing-original draft and preparation, K.S. and A.E.; Writing-review and editing, A.E. and A.S.; visualization, K.S. and A.E.; supervision, A.S. and A.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

We acknowledge the foundational work of the first author (K.S.), whose thesis [81], supervised by the second (A.S.) and the third (A.E.) authors, served as a basis for the development of this paper. The study includes significant portions of the first author’s (K.S.) thesis work [81], with substantial overlap extending to all sections of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hirsch, J.E. An index to quantify an individual’s scientific research output. Proc. Natl. Acad. Sci. USA 2005, 102, 16569–16572. [Google Scholar] [CrossRef] [PubMed]
Ioannidis, J.P.A.; Boyack, K.W.; Klavans, R. Estimates of the Continuously Publishing Core in the Scientific Workforce. PLoS ONE 2014, 9, e101698. [Google Scholar] [CrossRef]
Azoulay, P.; Zivin, J.S.G.; Manso, G. Incentives and creativity: Evidence from the academic life sciences. RAND J. Econ. 2011, 42, 527–554. [Google Scholar] [CrossRef]
Lee, S.; Bozeman, B. The Impact of Research Collaboration on Scientific Productivity. Soc. Stud. Sci. 2005, 35, 673–702. [Google Scholar] [CrossRef]
Uzzi, B.; Mukherjee, S.; Stringer, M.; Jones, B. Atypical Combinations and Scientific Impact. Science 2013, 342, 468–472. [Google Scholar] [CrossRef]
Sridhar, A.; Pesala, B.; Radhakrishnan, G.; Niezgoda, J.; Gopalakrishnan, S. Evolution of Artificial Intelligence. In Artificial Intelligence and Biological Sciences; CRC Press: Boca Raton, FL, USA, 2025. [Google Scholar]
Mosallaie, S.; Rad, M.; Schiffauerova, A.; Ebadi, A. Discovering the evolution of artificial intelligence in cancer research using dynamic topic modeling. COLLNET J. Scientometr. Inf. Manag. 2021, 15, 225–240. [Google Scholar] [CrossRef]
Song, J.; Ebadi, A.; Florea, A.; Xi, P.; Tremblay, S.; Wong, A. COVID-Net USPro: An Explainable Few-Shot Deep Prototypical Network for COVID-19 Screening Using Point-of-Care Ultrasound. Sensors 2023, 23, 2621. [Google Scholar] [CrossRef] [PubMed]
Cao, L. AI in Finance: Challenges, Techniques, and Opportunities. ACM Comput. Surv. 2022, 55, 64:1–64:38. [Google Scholar] [CrossRef]
Abduljabbar, R.; Dia, H.; Liyanage, S.; Bagloee, S.A. Applications of Artificial Intelligence in Transport: An Overview. Sustainability 2019, 11, 189. [Google Scholar] [CrossRef]
Iyer, L.S. AI enabled applications towards intelligent transportation. Transp. Eng. 2021, 5, 100083. [Google Scholar] [CrossRef]
Datta, S.D.; Islam, M.; Sobuz, M.H.R.; Ahmed, S.; Kar, M. Artificial intelligence and machine learning applications in the project lifecycle of the construction industry: A comprehensive review. Heliyon 2024, 10, e26888. [Google Scholar] [CrossRef]
Hajibabaei, A.; Schiffauerova, A.; Ebadi, A. Women and key positions in scientific collaboration networks: Analyzing central scientists’ profiles in the artificial intelligence ecosystem through a gender lens. Scientometrics 2023, 128, 1219–1240. [Google Scholar] [CrossRef]
Lowe, R.A.; Gonzalez-Brambila, C. Faculty Entrepreneurs and Research Productivity. J. Technol. Transf. 2007, 32, 173–194. [Google Scholar] [CrossRef]
Hess, A.; Rothaermel, F.T. Intellectual Human Capital and the Emergence of Biotechnology: Trends and Patterns, 1974–2006. IEEE Trans. Eng. Manag. 2012, 59, 65–76. [Google Scholar] [CrossRef]
Niosi, J.; Queenton, J. Knowledge capital in biotechnology industry: Impacts on Canadian firm performance. Int. J. Knowl. Based Dev. 2010, 1, 136–151. [Google Scholar] [CrossRef]
Schiffauerova, A.; Beaudry, C. Star scientists and their positions in the Canadian biotechnology network. Econ. Innov. New Technol. 2011, 20, 343–366. [Google Scholar] [CrossRef]
Moretti, E.; Wilson, D.J. State incentives for innovation, star scientists and jobs: Evidence from biotech. J. Urban Econ. 2014, 79, 20–38. [Google Scholar] [CrossRef]
Azoulay, P.; Zivin, J.S.G.; Wang, J. Superstar Extinction. Q. J. Econ. 2010, 125, 549–589. [Google Scholar] [CrossRef]
Hess, A.M.; Rothaermel, F.T. When are assets complementary? star scientists, strategic alliances, and innovation in the pharmaceutical industry. Strat. Mgmt. J. 2011, 32, 895–909. [Google Scholar] [CrossRef]
Hoser, N. Public funding in the academic field of nanotechnology: A multi-agent based model. Comput. Math. Organ. Theory 2013, 19, 253–281. [Google Scholar] [CrossRef]
Tartari, V.; Perkmann, M.; Salter, A. In good company: The influence of peers on industry engagement by academic scientists. Res. Policy 2014, 43, 1189–1203. [Google Scholar] [CrossRef]
Nagane, H.S.; Fukudome, Y.; Maki, K. An Analysis of Star Scientists in Japan. In Proceedings of the 2018 IEEE International Conference on Engineering, Technology and Innovation (ICE/ITMC), Stuttgart, Germany, 17–20 June 2018; pp. 1–5. [Google Scholar] [CrossRef]
Wagner, C.S.; Leydesdorff, L. Network structure, self-organization, and the growth of international collaboration in science. Res. Policy 2005, 34, 1608–1618. [Google Scholar] [CrossRef]
Wuchty, S.; Jones, B.F.; Uzzi, B. The Increasing Dominance of Teams in Production of Knowledge. Science 2007, 316, 1036–1039. [Google Scholar] [CrossRef] [PubMed]
Azoulay, P.; Fons-Rosen, C.; Zivin, J.S.G. Does Science Advance One Funeral at a Time? Am. Econ. Rev. 2019, 109, 2889–2920. [Google Scholar] [CrossRef]
Bornmann, L. Do altmetrics point to the broader impact of research? An overview of benefits and disadvantages of altmetrics. J. Informetr. 2014, 8, 895–903. [Google Scholar] [CrossRef]
O’Boyle, E., Jr.; Aguinis, H. The Best And The Rest: Revisiting The Norm Of Normality Of Individual Performance: Personnel Psychology. Pers. Psychol. 2012, 65, 79–119. [Google Scholar] [CrossRef]
Abramo, G.; D’Angelo, C.A.; Di Costa, F. A gender analysis of top scientists’ collaboration behavior: Evidence from Italy. Scientometrics 2019, 120, 405–418. [Google Scholar] [CrossRef]
Oettl, A. Reconceptualizing Stars: Scientist Helpfulness and Peer Performance. Manag. Sci. 2012, 58, 1122–1140. [Google Scholar] [CrossRef]
Zucker, L.G.; Darby, M.R.; Brewer, M.B. Intellectual Human Capital and the Birth of U.S. Biotechnology Enterprises. Am. Econ. Rev. 1998, 88, 290–306. [Google Scholar]
Leshner, A.I. Public Engagement with Science. Science 2003, 299, 977. [Google Scholar] [CrossRef]
Tsatsaronis, G.; Varlamis, I.; Torge, S.; Reimann, M.; Nørvåg, K.; Schroeder, M.; Zschunke, M. How to Become a Group Leader? or Modeling Author Types Based on Graph Mining. In Research and Advanced Technology for Digital Libraries; Gradmann, S., Borri, F., Meghini, C., Schuldt, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6966, pp. 15–26. [Google Scholar] [CrossRef]
Daud, A.; Abbasi, R.; Muhammad, F. Finding Rising Stars in Social Networks. In Database Systems for Advanced Applications; Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7835, pp. 13–24. [Google Scholar] [CrossRef]
Li, X.-L.; Foo, C.S.; Tew, K.L.; Ng, S.-K. Searching for Rising Stars in Bibliography Networks. In Database Systems for Advanced Applications; Zhou, X., Yokota, H., Deng, K., Liu, Q., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5463, pp. 288–292. [Google Scholar] [CrossRef]
Daud, A.; Ahmad, M.; Malik, M.S.I.; Che, D. Using machine learning techniques for rising star prediction in co-author network. Scientometrics 2015, 102, 1687–1711. [Google Scholar] [CrossRef]
Zhang, C.; Liu, C.; Yu, L.; Zhang, Z.-K.; Zhou, T. Identifying the Academic Rising Stars via Pairwise Citation Increment Ranking. In Web and Big Data; Chen, L., Jensen, C.S., Shahabi, C., Yang, X., Lian, X., Eds.; Springer International Publishing: Cham, Switzerland, 2017; Volume 10366, pp. 475–483. [Google Scholar] [CrossRef]
Nie, Y.; Zhu, Y.; Lin, Q.; Zhang, S.; Shi, P.; Niu, Z. Academic rising star prediction via scholar’s evaluation model and machine learning techniques. Scientometrics 2019, 120, 461–476. [Google Scholar] [CrossRef]
Abbasi, A.; Altmann, J. On the Correlation between Research Performance and Social Network Analysis Measures Applied to Research Collaboration Networks. In Proceedings of the 2011 44th Hawaii International Conference on System Sciences, Kauai, HI, USA, 4–7 January 2011; pp. 1–10. [Google Scholar] [CrossRef]
AlShebli, B.K.; Rahwan, T.; Woon, W.L. The preeminence of ethnic diversity in scientific collaboration. Nat. Commun. 2018, 9, 5163. [Google Scholar] [CrossRef] [PubMed]
Hajibabaei, A.; Schiffauerova, A.; Ebadi, A. Gender-specific patterns in the artificial intelligence scientific ecosystem. J. Informetr. 2022, 16, 101275. [Google Scholar] [CrossRef]
Cavacini, A. What is the best database for computer science journal articles? Scientometrics 2015, 102, 2059–2071. [Google Scholar] [CrossRef]
Ebadi, A.; Tremblay, S.; Goutte, C.; Schiffauerova, A. Application of machine learning techniques to assess the trends and alignment of the funded research output. J. Informetr. 2020, 14, 101018. [Google Scholar] [CrossRef]
Sá, C.; Cowley, S.; Martinez, M.; Kachynska, N.; Sabzalieva, E. Gender gaps in research productivity and recognition among elite scientists in the U.S., Canada, and South Africa. PLoS ONE 2020, 15, e0240903. [Google Scholar] [CrossRef]
Hong, L.; Page, S.E. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proc. Natl. Acad. Sci. USA 2004, 101, 16385–16389. [Google Scholar] [CrossRef]
Woolley, A.W.; Chabris, C.F.; Pentland, A.; Hashmi, N.; Malone, T.W. Evidence for a Collective Intelligence Factor in the Performance of Human Groups. Science 2010, 330, 686–688. [Google Scholar] [CrossRef]
Deville, P.; Wang, D.; Sinatra, R.; Song, C.; Blondel, V.D.; Barabási, A.-L. Career on the Move: Geography, Stratification and Scientific Impact. Sci. Rep. 2014, 4, 4770. [Google Scholar] [CrossRef]
Jia, T.; Wang, D.; Szymanski, B.K. Quantifying patterns of research-interest evolution. Nat. Hum. Behav. 2011, 1, 78. [Google Scholar] [CrossRef]
Fortunato, S.; Bergstrom, C.T.; Börner, K.; Evans, J.A.; Helbing, D.; Milojević, S.; Petersen, A.M.; Radicchi, F.; Sinatra, R.; Uzzi, B.; et al. Science of science. Science 2018, 359, eaao0185. [Google Scholar] [CrossRef] [PubMed]
Jones, B.F.; Weinberg, B.A. Age dynamics in scientific creativity. Proc. Natl. Acad. Sci. USA 2011, 108, 18910–18914. [Google Scholar] [CrossRef]
Freeman, R.B.; Huang, W. Collaborating with People Like Me: Ethnic Coauthorship within the United States. J. Labor Econ. 2015, 33, S289–S318. [Google Scholar] [CrossRef]
Jones, B.F.; Wuchty, S.; Uzzi, B. Multi-University Research Teams: Shifting Impact, Geography, and Stratification in Science. Science 2008, 322, 1259–1262. [Google Scholar] [CrossRef] [PubMed]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Blei, D.M. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101 (Suppl. S1), 5228–5235. [Google Scholar] [CrossRef]
Sonnenwald, D.H. Scientific collaboration. Annu. Rev. Info. Sci. Amp. Tec. 2007, 41, 643–681. [Google Scholar] [CrossRef]
Ebadi, A.; Schiffauerova, A. How to boost scientific production? A statistical analysis of research funding and other influencing factors. Scientometrics 2016, 106, 1093–1116. [Google Scholar] [CrossRef]
Ebadi, A.; Schiffauerova, A. How to Receive More Funding for Your Research? Get Connected to the Right People! PLoS ONE 2015, 10, e0133061. [Google Scholar] [CrossRef]
Ebadi, A.; Schiffauerova, A. How to become an important player in scientific collaboration networks? J. Informetr. 2015, 9, 809–825. [Google Scholar] [CrossRef]
Price, D.J.D.S. Little Science, Big Science. In Little Science, Big Science; Columbia University Press: New York, NY, USA, 1963. [Google Scholar] [CrossRef]
Ubfal, D.; Maffioli, A. The impact of funding on research collaboration: Evidence from a developing country. Res. Policy 2011, 40, 1269–1279. [Google Scholar] [CrossRef]
De Nooy, W.; Mrvar, A.; Batagelj, V. Exploratory Social Network Analysis with Pajek: Revised and Expanded Edition for Updated Software, 3rd ed.; Cambridge University Press: Cambridge, UK, 2018. [Google Scholar] [CrossRef]
Ebadi, A.; Schiffauerova, A. iSEER: An intelligent automatic computer system for scientific evaluation of researchers. Scientometrics 2016, 107, 477–498. [Google Scholar] [CrossRef]
Borgatti, S.P. Centrality and network flow. Soc. Netw. 2005, 27, 55–71. [Google Scholar] [CrossRef]
Diestel, R. Graph Theory, 5th ed.; Springer: Heidelberg, Germany, 2016; Volume 173, Available online: https://diestel-graph-theory.com/ (accessed on 30 October 2023).
Wasserman, S.; Faust, K. Social Network Analysis: Methods and Applications; Cambridge University Press: New York, NY, USA, 1994. [Google Scholar] [CrossRef]
Carrington, P.J. The SAGE Handbook of Social Network Analysis; SAGE Publications Ltd.: London, UK, 2011; pp. 1–640. [Google Scholar]
Bornmann, L.; Daniel, H.-D. Does the h-index for ranking of scientists really work? Scientometrics 2005, 65, 391–392. [Google Scholar] [CrossRef]
Cronin, B.; Meho, L. Using the h-index to rank influential information scientistss. J. Am. Soc. Inf. Sci. 2006, 57, 1275–1278. [Google Scholar] [CrossRef]
Braun, T.; Glänzel, W.; Schubert, A. A Hirsch-type index for journals. Scientometrics 2006, 69, 169–173. [Google Scholar] [CrossRef]
Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; Wiley: New York, NY, USA, 1994; Available online: https://www.wiley.com/en-us/Outliers+in+Statistical+Data%2C+3rd+Edition-p-9780471930945 (accessed on 25 August 2025).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. Jair 2002, 16, 321–357. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 1st ed.; Wiley: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Zhang, H. The Optimality of Naive Bayes; Presented at the The Florida AI Research Society; The AAAI Press: Menlo Park, CA, USA, 2004; Available online: https://www.semanticscholar.org/paper/The-Optimality-of-Naive-Bayes-Zhang/0bf71f7b1fa5f95b50d27e3583c81ffe7178e58c (accessed on 25 August 2025).
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
González, F.A. An Introduction to Machine Learning. 2010. Available online: https://fagonzalezo.github.io/ml-2021-1/introduction%20ml.pdf (accessed on 25 August 2025).
Ebadi, A.; Xi, P.; Tremblay, S.; Spencer, B.; Pall, R.; Wong, A. Understanding the temporal evolution of COVID-19 research through machine learning and natural language processing. Scientometrics 2021, 126, 725–739. [Google Scholar] [CrossRef] [PubMed]
Ma, K.; He, S.; Sinha, G.; Ebadi, A.; Florea, A.; Tremblay, S.; Wong, A.; Xi, P. Towards Building a Trustworthy Deep Learning Framework for Medical Image Analysis. Sensors 2023, 23, 8122. [Google Scholar] [CrossRef] [PubMed]
Hallur, G.G.; Prabhu, S.; Aslekar, A. Entertainment in Era of AI, Big Data & IoT. In Digital Entertainment: The Next Evolution in Service Sector; Das, S., Gochhait, S., Eds.; Springer Nature: Singapore, 2021; pp. 87–109. [Google Scholar] [CrossRef]
Shirouyeh, K. Star Scientists’ Prediction in the Field of Artificial Intelligence Using Machine Learning Techniques. Master’s Thesis, Concordia University, Montreal, QC, Canada, 2023. Available online: https://spectrum.library.concordia.ca/id/eprint/992127/ (accessed on 25 August 2025).

Figure 1. The high-level conceptual flow of the analyses.

Figure 2. (a) Distribution of publications, (b) distribution of authors, and (c) average number of authors per paper.

Figure 3. Correlation heatmap. Darker shades represent more positive correlations.

Figure 4. Visualization of the top 1% of researchers in terms of betweenness centrality in 2014, highlighting highly central individuals within the scientific collaboration network.

Figure 5. Performance comparison of classifiers.

Figure 6. Feature importance analysis of the random forest model.

Table 1. Rising star vs. non-rising star researchers.

	Rising Star Researchers (Label = 1)				Non-Rising Star Researchers (Label = 0)
Feature	Mode	Mean	σ	Range	Mode	Mean	σ	Range
Articles	{1, 3}	2.82	1.74	[1, 10]	1	1.25	0.67	[1, 13]
Level-A articles	0	0.29	0.60	[0, 3]	0	0.10	0.32	[0, 4]
Level-B articles	0	1.35	1.34	[0, 6]	0	0.45	0.67	[0, 12]
Level-C articles	1	1.18	1.44	[0, 9]	1	0.70	0.74	[0, 12]
Citations	10	42.89	41.27	[0, 237]	0	19.63	35.10	[0, 757]
Individual discipline diversity	0.20	0.19	0.03	[0.09, 0.25]	0.18	0.16	0.03	[0.04, 0.25]
Group discipline diversity	0	0.15	0.14	[0, 0.36]	0	0.04	0.10	[0, 0.37]
Ethnic diversity	0.35	0.25	0.10	[0, 0.37]	0	0.23	0.14	[0, 0.37]
Gender diversity	0	0.23	0.12	[0, 0.37]	0	0.17	0.15	[0, 0.37]
Affiliation diversity	0	0.15	0.13	[0, 0.37]	0	0.09	0.13	[0, 0.37]
Age diversity	0	0.12	0.12	[0, 0.35]	0	0.05	0.10	[0, 0.37]
Degree	{3, 4}	10.8	9.56	[1, 63]	0	5.20	4.89	[0, 57]
Weighted degree	4	13.12	12.02	[1, 76]	3	5.58	5.59	[0, 68]
Clustering Coefficient	0.07	0.05	0.03	[0, 0.15]	0.07	0.06	0.03	[0, 0.26]
Betweenness centrality	0	0.0002	0.0006	[0, 0.005]	0	0.0002	0.0002	[0, 0.001]
h-index	1	1.79	0.95	[0, 5]	1	0.58	0.58	[0, 8]
h-index growth rate	0.6	0.74	0.24	[0.6, 2.2]	0	0.04	0.09	[0, 0.4]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shirouyeh, K.; Schiffauerova, A.; Ebadi, A. Predicting Star Scientists in the Field of Artificial Intelligence: A Machine Learning Approach. Metrics 2025, 2, 22. https://doi.org/10.3390/metrics2040022

AMA Style

Shirouyeh K, Schiffauerova A, Ebadi A. Predicting Star Scientists in the Field of Artificial Intelligence: A Machine Learning Approach. Metrics. 2025; 2(4):22. https://doi.org/10.3390/metrics2040022

Chicago/Turabian Style

Shirouyeh, Koosha, Andrea Schiffauerova, and Ashkan Ebadi. 2025. "Predicting Star Scientists in the Field of Artificial Intelligence: A Machine Learning Approach" Metrics 2, no. 4: 22. https://doi.org/10.3390/metrics2040022

APA Style

Shirouyeh, K., Schiffauerova, A., & Ebadi, A. (2025). Predicting Star Scientists in the Field of Artificial Intelligence: A Machine Learning Approach. Metrics, 2(4), 22. https://doi.org/10.3390/metrics2040022

Article Menu

Predicting Star Scientists in the Field of Artificial Intelligence: A Machine Learning Approach

Abstract

1. Introduction

2. Related Work

2.1. Definition of Star Scientists

2.2. Importance of Star Scientists

2.3. Predicting Star Scientists

3. Data and Methodology

3.1. Data

3.2. Methodology

3.2.1. Textual Data Processing

3.2.2. Feature Engineering

Gender

Ethnicity

Diversity

Disciplinary Profiles

Network Structural Features

Research Performance

3.2.3. Data Analytics

Feature Set and Label

Classification

4. Results

4.1. Distribution of Publications and Authors

4.2. Correlation Analysis

4.3. Network Exploratory Analysis

4.4. Rising Star vs. Non-Rising Star Researchers

4.5. Predicting Star Scientists

5. Conclusions

6. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI