Text Mining arXiv: A Look Through Quantitative Finance Papers

Bianchi, Michele Leonardo

doi:10.3390/math13091375

Open AccessArticle

Text Mining arXiv: A Look Through Quantitative Finance Papers

by

Michele Leonardo Bianchi

Financial Stability Directorate, Banca d’Italia, 00184 Rome, Italy

Mathematics 2025, 13(9), 1375; https://doi.org/10.3390/math13091375

Submission received: 24 February 2025 / Revised: 14 April 2025 / Accepted: 18 April 2025 / Published: 23 April 2025

(This article belongs to the Special Issue Advances in Mathematical Behavioural Finance and Decision Analysis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This paper explores articles hosted on the arXiv preprint server with the aim of uncovering valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, I xamine the contents of quantitative finance papers posted in arXiv from 1997 to 2022. I extract and analyze, without relying on ad hoc software or proprietary databases, crucial information from the entire documents, including the references, to understand the topic trends over time and to find out the most cited researchers and journals in this domain. Additionally, I compare numerous algorithms for performing topic modeling, including state-of-the-art approaches.

Keywords:

quantitative finance; text mining; natural language processing; unsupervised clustering; topic modeling; research trends; named entity recognition

MSC:

94A16; 91G99; 68T50

1. Introduction

Quantitative finance is a field of finance that studies mathematical and statistical models and applies them to financial markets and investments, for pricing, risk management, and portfolio allocation. These models are needed to analyze financial data, find the price of financial instruments, and measure their risk (see [1,2]). Readers are referred to [3] for an insightful exploration of the role of models in finance and to [4] for some philosophical remarks on mathematics and finance.

The world of finance is always moving forward even in times of crisis. Innovations in finance come from the development of new financial services, products, or technologies. Research trends in quantitative finance are driven not only by innovations but also by structural changes in financial markets or by changes in regulation (refs. [5,6]). When a structural change occurs, some models are no longer able to explain the phenomena observed in the market; consequently, quants and researchers start working on new models. Examples of such changes are when the implied volatility smile appeared in 1987 (see [7]) or the Euribor-OIS spread materialized in 2007 (see [8]). Research activities driven by new products are, for instance, the development of pricing models for interest rate and equity derivatives started in 1990s, the structuring of credit products in the early 2000s, or the recent research trend on cryptocurrencies. New technologies applied to finance include the increasing role of big data and the advent of machine learning techniques. Regulation has an impact on the development of new quantitative tools for measuring, managing, and monitoring financial risks (e.g., the Basel Accords).

In this paper, I explore the arXiv preprint server, the dominant open-access preprint repository for scholarly papers in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. The articles in arXiv are not peer reviewed, but there are advantages in submitting to this repository, mainly to disseminate a paper without waiting for the peer review and publishing process, which can be slow (see [9]). The arXiv collection provides a unique source of data for conducting various studies, including bibliometric, trend, and citation network analyzes (see [10]). It is a valuable resource for advancing scientific knowledge and conducting research on research, often referred to as meta-research. For example, trend detection on computer science papers stored in arXiv is performed in [11,12], the authors of [13] conduct a case study of computer science preprints submitted to arXiv from 2008 to 2017 to quantify how many of these preprints were eventually printed in peer-reviewed venues, the authors of [14] explore the images of around 1.5 million papers held in the repository, Okamura [15] investigates the citations of more than 1.5 million preprints on arXiv to study the evolution of collective attention on scientific knowledge, the authors of [16] train a state-of-art classification approach, and the authors of [17] design an algorithm to help researchers to perform systematic literature reviews.

In this study, I analyze all papers on quantitative finance, a small portion of the entire arXiv, which contains more than two million works at the time of writing. The choice is also motivated by my experience in this domain and scientific curiosity.

The code is run in a standard desktop environment, without using a big cluster. Scaling to a large number of papers may be not trivial. Dealing with a large amount of data requires significant computing resources, including processing power and memory, to manipulate and analyze the data efficiently. It is not simple and maybe even impossible to explore more than two million papers with a standard desktop environment like mine.

The studies of papers on finance topics are not new in the literature. For example, the authors of [18] review the history of a well-known journal in this field and highlight its growth in terms of productivity and impact. The authors present a bibliometric analysis and identify key contributors, themes, and co-authorship patterns, and suggest future research directions. A systematic literature review and a bibliometric analysis on around 3000 articles on asset pricing sourced from the top 50 finance and economics journals, spanning a 47-year period from 1973 to 2020 is conducted in [19]. As observed by the authors, the exclusion of certain publications may potentially offer an alternative perspective on the landscape of existing asset pricing research. By using bibliometric and network analysis techniques, including the Bibliometrix Tool of [20,21], researchers investigate more than 4000 papers on option pricing that appeared from 1973 to 2019. They follow the procedure suggested by [22]. Their study aims to pinpoint high-quality research publications, discern trends in research, evaluate the contributions of prominent researchers, assess contributions from different geographic regions and institutions, and, ultimately, to examine the interconnectedness among these aspects. The works of [18,19,21] are focused on asset pricing or on a specific journal, their corpus is obtained by searching in the Scopus database using specific keywords, and the bibliometric analysis relies on VOS viewer (see [23]) and Gephi (see [24]). A bibliometric analysis of financial risk has also been performed by [25], who analyze a sample of publications obtained from the Web of Science database using CiteSpace. Finally, the authors of [26] conduct a bibliometric analysis of asset–liability management (ALM) literature, identifying key journals, authors, and articles, while highlighting research trends and gaps in the field.

My study explores all papers on quantitative finance collected in arXiv up to the end of 2022 (around 16,000) and it considers text mining techniques implemented in Python 3.11 to extract information directly from the portable document format (pdf) files containing the full text of the papers, excluding images, without relying on ad hoc software or proprietary databases. As observed by the authors of [27], examining the full text of documents significantly improves text mining compared to studies that only explore information collected from abstracts (as a crosscheck, I conducted the analysis on both abstracts and full texts; the analysis using the full text data showed better results). Their finding highlights the importance of using complete textual content for more comprehensive and accurate text mining and analysis.

The main objectives of my work are twofold. First, I explore the topics of the quantitative finance papers collected in arXiv in order to describe the evolution of topics over time. After having evaluated the performance of various clustering algorithms, I investigate on which themes researchers have focused their attention in the period 1997 to 2022. Second, I try to understand who are the most prominent authors and journals in this field. Both analyses are performed with data mining techniques and without actually reading the papers.

As I will see in the following, text mining techniques can help to reveal specific topics, such as novel applications of sentiment analysis to stock price prediction, that are closely aligned with the behavioral finance literature.

The remainder of the paper is organized as follows. First, I provide a brief description of the data analyzed in this work (Section 2). Then, in Section 3, the preprocessing phase is discussed by offering further insights on the papers analyzed in my work. In Section 4, I compare various clustering algorithms and, after having selected the best performer, I explore, by splitting my corpus into 30 clusters, the evolution of topics over time. Finally, in Section 5, I describe an entity extraction process to investigate authors and journals with the largest number of occurrences in the corpus considered in this work. Section 6 concludes.

2. Data Description

In this section, I provide a description of the papers analyzed in this work. As observed above, there are various domains in arXiv (i.e., physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics) and each domain has is own categories. The categories within the quantitative finance domain are the following:

Computational finance (q-fin.CP) includes Monte Carlo, PDE, lattice and other numerical methods with applications to financial modeling;
Economics (q-fin.EC) is an alias for econ.GN and it analyses micro and macro economics, international economics, theory of the firm, labor economics, and other economic topics outside finance;
General finance (q-fin.GN) is focused on the development of general quantitative methodologies with applications in finance;
Mathematical finance (q-fin.MF) examines mathematical and analytical methods of finance, including stochastic, probabilistic and functional analysis, algebraic, geometric, and other methods;
Portfolio management (q-fin.PM) deals with security selection and optimization, capital allocation, investment strategies, and performance measurement;
Pricing of securities (q-fin.PR) discusses valuation and hedging of financial securities, their derivatives, and structured products;
Risk management (q-fin.RM) is about risk measurement and management of financial risks in trading, banking, insurance, corporate, and other applications;
Statistical finance (q-fin.ST) includes statistical, econometric and econophysics analyses with applications to financial markets and economic data;
Trading and market microstructure (q-fin.TR) studies market microstructure, liquidity, exchange and auction design, automated trading, agent-based modeling, and market-making.

These categories are assigned by the authors when they submit their papers. Even if it is possible to select multiple couples of domain-category belonging to more than one domain, I select as the reference category only the first category within the quantitative finance domain. Figure 1 shows the numbers of papers on quantitative finance submitted to arXiv between 1997 and 2022. The increase in the last three years is mainly due to the q-fin.EC category.

As observed in Section 1, the code is implemented in Python and it is run under Ubuntu 22.04 on a desktop with an AMD Ryzen 5 5600 g processor and 32 GB of RAM. As I will describe in the following, numerous packages are considered.

As far as the collection process is concerned, I retrieve data from arXiv by selecting all categories within quantitative finance (i.e., q-fin). I collect article metadata and pdf files for all articles from 1997 to 2022 for a total of around 16,000 articles (18 GB of data).

While the metadata are obtained through urllib.request and feedparser, the pdf files are downloaded by means of the arxiv package. The metadata can be collected by following the suggestions provided in the arXiv web pages. They are a fundamental input of the analysis and include the link to the paper main web page, from where it is possible to extract the paper identification code (id, e.g., 2005.06390). The metadata contain information like authors names, paper title, primary category, submission and last update dates, abstract, and publication data when available (e.g., digital object identifier, DOI). Subsequent updates of the papers can be stored in the repository and for this reason, there is a version number at the end of the paper id.

Since a paper could be assigned to multiple categories, a web-scraping tool written in Python allows us to retrieve from the paper main web page the list of all categories of each paper. I select from this list the subset of categories within the quantitative finance domain. Thus, I assign as the reference category of a paper the first category appearing in this subset. Starting from this list, I are able to filter and analyze all papers in the nine categories within q-fin.

The pdftotext package allows us to extract text from pdf files. Each paper becomes a single (long) string. As discussed in Section 3, the lengths of these strings vary across papers, also because some documents are not papers (e.g., there are also theses and books). As a first assessment of the corpus, for each document, I estimate the readability of the papers through textstat. As shown in Figure 2, the Flesch reading ease score (see [28]) is on average equal to 65.7 (plain English), the lower and upper quartiles are 59.91 (fairly difficult to read, but not far from the plain English) and 71.95 (fairly easy to read), respectively, and 99% of the papers are in the range from 40.28 (difficult to read) and 88.20 (easy to read). There is only one paper with a negative value, but this is caused by the text contained in the figures. All other papers are above 17.17, that is above the extremely difficult to read level.

3. Text Preprocessing

This section describes the text processing steps. The preprocessing phase is performed with nltk: (1) I split the text in tokens; (2) I extract the numbers representing years in the text (we assume these numbers have 4 digits; I do not explore this data in the empirical analysis); (3) I identify all strings containing alphabet letters, and I refer to them as words even when they do not belong to English vocabulary; this step allows also to remove some symbols that are not recognized as letters in text analysis; (4) I remove all stopwords and all words with length fewer than 3 characters; I also check whether there are words with more than 25 characters (quite uncommon in English); (5) I conduct a lemmatization by means of a part-of-speech tagger considering nouns, verbs, adjectives and adverbs; and (6) I check if the paper is written in English by means of the langdetect package and I discard all non-English papers. Both the extraction phase and the preliminary text analysis is parallelized by means of the multiprocessing package. I refer to the output of this first preprocessing phase as lemmatized data.

Thus, I analyze the frequency of the words across the whole corpus and I remove all words appearing fewer than 25 times. I also discard some words frequently used in writing papers on quantitative finance that do not help in understanding the topic of the paper. The list of these words includes “proof” and “theorem”, verbs commonly used in mathematical sentences (e.g., “assume”, “satisfy”, and “define”), mathematical functions (e.g., “min”, and “log”), and adverbs. The complete list is available upon request. In Figure 3, the list of the top 100 most frequent words obtained after this cleaning phase and their percentages of appearance are shown (see also Figure 4). The word “model” is extremely frequent (one in every 100 words). The word “http” is also quite common, indicating that the papers’ full texts contain numerous internet links (this word could be discarded, since it carries no meaning).

After a first preprocessing phase, I conduct an n-gram analysis by considering the Phrases model of the gensim package. I ignore all words and bigrams with a total collected count over the entire corpus lower than 250 and set the score threshold equal to 10. I find the bigrams and then, to find trigrams and fourgrams, I re-apply the same model to the transformed corpus including bigrams. This approach allows us to have a better ex post understanding of the corpus, which is full of n-grams (e.g., Monte Carlo simulation, Eisenberg and Noe, or bank balance sheet). The word clouds of frequent bigrams and of frequent words including bigrams and trigrams (fourgrams) are depicted in Figure 4.

It should be noted that some topic modeling algorithms analyzed in Section 4 do not need text preprocessing. In those cases, the input is just a single list containing the whole paper text. While I refer to this latter input as raw data, I define the output of the preprocessing as cleaned data.

In Figure 5, I show how the length of the papers varies over the three data preprocessing phases. Starting from the raw data, containing all words and symbols, after a preliminary cleaning step I obtain the lemmatized data and then, after the last cleaning steps, the cleaned data. The number of words varies from a median value of 8824 words for the raw data to around 2518 words for the cleaned data.

4. Topics Trend

Now, I are in the position to perform a topics trend analysis. I employ a topic modeling approach to identify the subjects discussed in the documents examined in this study, and then I observe how these topics change over time. Topic modeling refers to a class of statistical methods used to determine which subjects are prevalent in a given corpus. The topic modeling algorithms considered in my study are briefly described in Section 4.1, where some details of my implementation are also provided. Then, in Section 4.2, I select the best-performing model among some selected approaches presented in the related literature. I evaluate these approaches by assessing their ability to accurately match the nine q-fin categories that researchers assign to their work when submitting it to arXiv. Finally, in Section 4.3, after having split the papers into 30 clusters, each one representing a specific topic, I discuss the evolution of research trends over time.

4.1. Topic Modeling Algorithms

Topic modeling algorithms are widely used in natural language processing and text mining to uncover latent thematic structures in a collection of documents. Different algorithms have been developed, each with its own strengths and limitations (see [29]). The choice of a topic modeling algorithm depends on specific factors, such as the desired level of topic granularity and computational constraints. Some algorithms may require substantial computational resources and large amounts of training data. The techniques selected in this study are widely used in the literature, and robust implementations are available. I consider the following models.

k-means. I perform a clustering analysis by considering the k-means algorithm implemented in scikit-learn. This algorithm groups data points into k clusters by minimizing the distance between data points and their cluster center. The document word matrix is created through the CountVectorizer function, which converts the corpus to a matrix of token counts. I ignore terms that have a document frequency strictly higher than 75%.
LDA. By considering the same document word matrix analyzed with the k-means algorithm, I perform topic modeling with the latent Dirichlet allocation (LDA). LDA is a well-known unsupervised learning algorithm. As observed in the seminal work of [30], the basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. I study two different implementations of LDA (i.e., scikit-learn and gensim).
Word2Vec. I train a word embedding model (i.e., Word2Vec) and then I perform a clustering analysis by considering again the k-means approach. An embedding is a low-dimensional space into which high-dimensional vectors are projected. Machine learning on large inputs like sparse vectors representing words is easier if embeddings are considered. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. The Word2Vec neural network introduces distributed word representations that capture syntactic and semantic word relationships (see [31]). In more detail, I generate document vectors using the trained Word2Vec model, that is, I obtain numerical vectors for each word in a document, and then the document vector is the weighted average of the vectors. Thus, the k-means algorithm is applied to the matrix representing the corpus. I consider the Word2Vec model implemented in gensim.
Doc2Vec. I create a vectorized representation of each document through the Doc2Vec model and then I perform a clustering analysis by considering the k-means approach. The Doc2Vec extends Word2Vec and it can learn distributed representations of varying lengths of text, from sentences to documents (see [32]). I consider the Doc2Vec model implemented in gensim.
Top2Vec. I study the Top2Vec model, an unsupervised learning algorithm that finds topic vectors in a space of jointly embedded document and word vectors (see [33]). This algorithm directly detects topics by performing the following steps. First, embedding vectors for documents and words are generated. Second, a dimensionality reduction on the vectors is implemented. Third, the vectors are clustered and topics are assigned. This algorithm is implemented in an ad hoc library named Top2Vec and it automatically provides information on the number of topics, topic size, and words representing the topics.
BERTopic. I study a BERTopic model, which is similar to Top2Vec in terms of algorithmic structure and uses BERT as an embedder. As described in the seminal work of [34], from the clusters of documents, topic representations are extracted using a custom class-based variation of term frequency-inverse document frequency (TF-IDF). This is the main difference with respect to Top2Vec. The algorithm is implemented in an ad hoc library named BERTopic. The main downside with working with large documents, as in my case, is that information will be ignored if the documents are too long. The model accepts a limited number of tokens, discarding any additional input. Since I are dealing with large documents, to work around this issue, I first split each documents into chunks of 300 tokens and then fit the model on these chunks. BERTopic does not allow one to directly select the number of topics; for this reason, on the first step, I obtain a number of topics much larger than the desired one. Since I obtain for each chunk the corresponding topic, I have for each document a list of possibly different topics and the length of these lists varies across documents (i.e., the length of a single list depends on the length of the corresponding document). To perform clustering on this list of lists of topics, I consider each integer representing a topic as a word. Thus, I use the Word2Vec algorithm described above to find similarities between these list of topics. Each topic label, that is the number representing the topic, is treated as a string, and Word2Vec transforms it into a numerical vector. I then apply k-means clustering to group these lists based on their similarity in the vector space. The resulting clusters reveal relationships and patterns among these lists and allow us to select the number of clusters I need for my purposes.

In theory, both BERTopic and Top2Vec should use raw data since these algorithms rely on an embedding approach, and keeping the original structure of the text is of paramount importance (see [35]). However, raw data extracted from quantitative finance papers have a considerable number of formulas, symbols, and numbers that may affect the algorithm’s performance. For this reason, I consider both raw and cleaned data. Additionally, for these state-of-the-art algorithms, the number of extracted topics tends be large; however, the algorithms offer the possibility to reduce the number of topics and of outliers, which can be larger than expected. The parameters of a BERTopic model have to be carefully chosen to avoid memory issues. Alternatively, it is possible to perform topic modeling online, that is, the model is trained incrementally from a mini-batch of instances. This results in a less resource-demanding approach in terms of memory and CPU usage; however, it also generates less rich and less comprehensive outputs and for these reasons, I do not consider this incremental approach here.

4.2. Algorithm Performance on Full Texts

Since is not simple to assess the performance of different topic modeling algorithms (see [36]), I start by comparing the clusters assigned by each algorithm on the entire corpus to the nine clusters defined by the q-fin categories described in Section 2, that is the categories that researchers assign to their work during the submission process to arXiv. By exploiting a Bayesian optimization strategy, the authors of [37] present a framework for training, analyzing, and comparing topic models where the competitor models are trained by searching for their optimal hyperparameter configuration for a given metric and dataset. Here, I consider a simpler approach in which I compare the models by looking at some metrics. The assessment of topic modeling algorithms typically involves the use of performance measures, and it is important to note that different algorithms can yield varying results across these metrics. Here, I compare different algorithms by looking at some standard performance measures.

In Table 1, I report the following similarity measures between true and predicted cluster labels: (1) the rand score (RS) is defined as the ratio between the number of agreeing pairs and the number of pairs, and it ranges between 0 and 1, where 1 stands for perfect match; (2) the adjusted rand score (ARS), that is the rand score adjusted for chances, has a value close to 0 for random labeling, independently of the number of clusters and samples, and exactly 1 when the clusterings are identical (up to a permutation), however is bounded below by -0.5 for especially discordant clusterings; (3) the mutual info score (MI) is independent of the absolute values of the labels (i.e., a permutation of the cluster labels does not change the value of the score); (4) the normalized mutual information (NMI) is a normalization of the MI to scale the results between 0 (no mutual information) and 1 (perfect correlation); (5) cluster accuracy (CA) is based on the Hungarian algorithm to find the optimal matching between true and predicted cluster labels; and (6) to compute the purity score (PS), each cluster is assigned to the class that is most frequent in the cluster, and the similarity measure is obtained by counting the number of correctly assigned papers and dividing by the number of observations. This latter score increases as the number of clusters increases and for this reason, it cannot be used as a trade off between the number of clusters and clustering quality, that is to find the optimal number of clusters.

The measures presented in Table 1 demonstrates that the Doc2Vec approach, when coupled with k-means clustering on cleaned data, outperforms other models. This is evident from the higher MI and PS measures that it achieves compared to its competitors. Moreover, Doc2Vec exhibits practical advantages, as it is straightforward to implement and significantly reduces computing time when compared to more advanced techniques like BERTopic. Interpreting the results of the Doc2Vec approach is simple, as it allows the identification of the most representative documents by retrieving the centroid vectors of each cluster. The Word2Vec approach, when coupled with k-means clustering on cleaned data, also shows good performance.

It is worth noting that there are no significant differences in performance measures when applying either the Top2Vec or BERTopic methods to raw or cleaned data. This may be due to the fact that raw data contain mathematical formulas that do not contribute substantial additional information, even if, as shown in Figure 5, raw data have a larger number of words. This finding indicating an equivalence between raw or cleaned data may also be due to the relatively simple structure commonly found in quantitative finance papers. It is important to note that these findings may not generalize to papers or books with more intricate and complex text structures and without formulas.

LDA implemented in scikit-learn has better performance than the LDA implementation in gensim. The plain k-means does not show satisfactory results, even if the algorithm can be implement without any great effort.

Finally, as an overall assessment, it is important to highlight that the performance metrics reported in Table 1 do not demonstrate particularly impressive results. This could partly stem from the wide-ranging nature of each q-fin category, encompassing numerous subtopics and arguments. Conversely, some papers can be classified under multiple categories, as it is not always obvious how to select a single definitive category for a given work.

4.3. Empirical Study

As shown in Section 4.2, the best performing model is Doc2Vec with k-means clustering applied on cleaned data. This model is considered to have a better understanding of the topics discussed in the quantitative finance papers analyzed in this work. To obtain the desired number of topics I perform again a k-means clustering analysis. To extract the most representative documents, I retrieve the centroid vectors of each cluster. These centroids represent the average position of all document embeddings assigned to a particular cluster. For each cluster centroid, I find the nearest neighbors among the original Doc2Vec embeddings. These nearest neighbors are the documents that are closest to the centroid in the embedding space and can be considered as the main documents of that cluster. Thus, I select for each cluster the 20 most representative documents and I find a label for the topic on the basis of the documents titles. Note that the underlying meanings of the topics are subject to human interpretation. However, also this phase is automated by asking ChatGPT (GPT-3.5) to name the topic after having provided the list of 20 titles (see [38]).

Given the size of my sample, a reduction down to 30 topics can be considered a good compromise for a topic analysis. The selected number of topics strikes a good balance between ensuring a sufficient quantity of documents for each topic and maintaining the desired level of granularity. This approach allows us to extract meaningful insights from the data while avoiding an excessive division of content that could affect the ability to identify overarching trends and patterns. As shown in Figure 6, most of the topics have an increasing trend (topic 28 seems to be the only exception). This is also reflected in the growth in the number of papers shown in Figure 1. For each topic, the list of 20 titles is the input for the question I ask the ChatGPT chatbot. The topics label (i.e., the ChatGPT reply to the question) and the title of the most representative paper are reported in Table 2. ChatGPT was employed due to its ability to generate coherent topic labels. To ensure reliability, I validated the outputs and found a high degree of consistency between the topic labels generated by ChatGPT and the most representative papers.

It is interesting to observe that topics related to decentralized finance and blockchain technology (2) and stock price prediction with deep learning and news sentiment analysis (20) show a remarkable increase. It is important to note that this topic is connected to behavioral finance. Health, policy, and social impact studies, represented by topic 16, and diverse perspectives in education, innovation, and economic development (0) also show an increase. Both topics are oriented towards economics. These topics trends are also affected by the introduction in 2014 of the q-fin.EC category within the arXiv quantitative finance papers. Classical quantitative finance subjects like portfolio optimization techniques and strategies (6), stochastic volatility modeling and option pricing (7), game theory and strategic decision-making (19), and high-order numerical methods for option pricing in finance (23), as well as new themes that have appeared in the literature in recent years, like deep reinforcement learning in stock trading and portfolio management (4) and environmental and economic impacts of mobility technologies (25), attracted the interest of researchers in the analyzed period. The representativeness of topic 28 is limited, mainly because the number of papers in this cluster is low, and one could merge it with another cluster (i.e., 8).

It is clear that some topics are more related to economics than to finance. This also depends on the presence of the q-fin.EC category. For articles in this category, there is not always a flawless alignment with quantitative finance.

To visualize the clusters, in Figure 7, the document vectors are projected onto 2-dimensional space through the standard t-distributed stochastic neighbor embedding (t-SNE) algorithm. The larger the distance between topics, the more distinct the papers in those topics are in the original high-dimensional space. Conversely, it is also possible to have a better view on how close topics are to each other (e.g., 8 and 28). It appears that the most specialized or narrowly focused topics tend to occupy peripheral positions, while themes that are more aligned with economics are positioned closer to the center.

5. Extracting Authors and Journals

In this section, I extract the author surnames by means of the spacy package. In more detail, starting from the raw data, I perform named-entities recognition. Since this approach extracts both first names and surnames, I remove all first names by checking if these names are included in a list of about 67,000 first names. It should be noted that the number of occurrence of the surname of the author in a paper strongly depends on the citation style. Surnames are always reported in the references, but they do not necessary appear in the main text of a paper. Additionally, even if the author–date style is widely used (i.e., the citation in the text consists of the authors name and year of publication), the surname of the first author appears more frequently (e.g., Bianchi is more probable than Tassinari, even if Bianchi and Tassinari are coauthors of the same papers, together with other coauthors).

The algorithm is able to find the names and surnames occurring in the text. These are included in the PERSON entity types. The first 100 authors by number of occurrences in the corpus are selected. In order to have additional information on these authors, I obtain topics, number of citations, h-index, and i10-index from Google Scholar (see Table 3). It should be noted that not all authors are registered in Google Scholar, even if they made a significant contribution to the field (e.g., Markowitz) or there are authors with the same surname and belonging to the same research field (see also Figure 3 in [21]). This is the case for some researchers I find in my corpus (e.g., Zhou, Bayraktar, and Chakrabarti). For these last-named authors, it is not simple to find a perfect match in Google Scholar even if their number of occurrences is generally high (for the reasons described above, I exclude from Table 3 the following researcher names: Zhou, Markowitz, Peng, Jacod, Merton, Guo, Lo, Follmer, Yor, Almgren, Embrechts, Bayraktar, Artzner, Weber, Jarrow, Feng, Samuelson, Tang, Chakrabarti, Glasserman, Tsallis, Leung, Sato, Zariphopoulou, Kramkov, Karoui, Cizeau, Cao, and Christensen).

The algorithm is also able to find the most cited journals, included in the ORG entity types: Journal of Finance (4490 occurrences)*, Mathematical Finance (3785), Journal of Financial Economics (3325)*, Physica A (3137)*, Quantitative Finance (3044), Econometrica (2473)*, Journal of Econometrics (1971), American Economic Review (1878), Insurance Mathematics and Economics (1667), Review of Financial Studies (2636)*, Journal of Banking and Finance (1542)*, Physical Review (1538), Journal of Economic Dynamics (1289), Energy (1267), Operations Research (1242), The Quarterly Journal of Economics (1238)*, Journal of the American Statistical Association (1204), Management Science (1160), European Journal of Operational Research (1066), Quantum (1043), IEEE Transactions (996), Journal of Political Economy (990), Journal of Economic Theory (977), Energy Economics (946), International Journal of Theoretical and Applied Finance (946), SIAM Journal on Financial Mathematics (888), Science (865), Expert Systems with Applications (845), Applied Mathematical Finance (743), Finance and Stochastics (736), Mathematics of Operations Research (652), PLoS (602), The Annals of Applied Probability (593), Stochastic Processes and their Applications (525), Energy Policy (520), International Journal of Forecasting (520), The European Physical Journal (514), Journal of Empirical Finance (510), and Journal of Risk (509). I consider only journals with more than 500 occurrences and I exclude publishing houses. The journals with the asterisk symbol are identified with more than one name. It should be noted that some well-known journals are slightly below 500 occurrences. Furthermore, it is worth noting that papers with a strong mathematical focus tend to receive significantly fewer citations compared to papers that lean more towards economics or finance.

It is important to acknowledge that while the arXiv repository serves as a valuable resource for scholarly papers, it may not encompass the entirety of the quantitative finance research landscape. While the repository strives to be comprehensive, there may be variations in the representation of scholars from different countries. Some scholars may have a relatively higher presence due to their active participation in submitting their research to arXiv. The platform content is reliant on authors voluntarily submitting their work, which introduces some bias. As a result, some authors and their contributions may not be represented. Therefore, my analysis and conclusions should be interpreted within the context of the available arXiv data, recognizing that there may be additional research and authors in the field of quantitative finance who have chosen alternative avenues for publishing their work. The same observation is true for the findings described in Section 4.3.

It is possible that influential scholars may not be as consistently represented or that, for various reasons, they have not regularly submitted their work to arXiv (see also [39]). In the study [21], focused on option pricing, some well-known authors are cited but they do not appear among the first 100 authors in my analysis. This discrepancy could be influenced by factors such as publication preferences, possible copyright issues, or institutional practices that may vary across different academic communities.

As a final remark, in my view, researchers in quantitative finance should consider submitting their work to arXiv due to the potential benefits it offers (see also [40]). The delay between arXiv posting and journal publication, which can sometimes be more than a year, underscores the importance of submitting preprints to the repository. By doing so, researchers can help the community to understand research trends in their field more promptly, while also accelerating the dissemination of their own findings. This approach aligns with the findings of [41], who show that rapid and open dissemination through preprints helps scholarly and scientific communication, enhancing the reception of findings in the field. In the meantime, as a possible alternative, to gain a comprehensive understanding of the entire landscape, future studies may consider incorporating other reputable academic databases and journals to ensure a more holistic exploration of quantitative finance research and its authors. Network approaches would also help to identify cliques and highly connected groups of authors. Insights from network clusters could further improve the understanding of the textual data investigated in this work.

6. Conclusions

In this study, I explore the field of quantitative finance through an analysis of papers in the arXiv repository. My objectives are twofold: first, I investigate the evolution of topics over time, and second, I identify prominent authors and journals in this domain. By employing data mining techniques, I achieve these goals without reading the papers individually.

The preprocessing phase, when needed, ensures the suitability of the data for subsequent analyses. Topic modeling helps in gaining insights and understanding the main themes and trends within my large dataset. By applying topic modeling algorithms, I identify the best performer and examine the temporal evolution of quantitative finance topics. This analysis reveals the changing research trends and highlights the emergence and decline of various topics over time.

Furthermore, I conduct an entity extraction process to identify influential authors and journals in the field. Through quantifying author and journal occurrences, I shed light on the researchers who have made notable contributions to quantitative finance.

My study demonstrates the power of data mining techniques in uncovering insights from a large-scale preprint repository. My work not only showcases the power of data mining but also highlights the continued growth and dynamism of quantitative finance as a discipline. The techniques explored in this work can assist researchers in exploring and identifying novel research topics, discovering connections between different research areas, and staying up-to-date with the latest developments in the field. Furthermore, my methodology may serve as a roadmap for future studies on broader datasets or in other scientific domains utilizing text mining techniques. Although scaling to a larger number of papers may pose challenges, my approach provides valuable insights.

Finally, I believe that quantitative finance researchers should consider sharing their work on arXiv to potentially accelerate the dissemination and impact of their findings and to enhance the community understanding of research trends.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study were collected from https://arxiv.org/ (accessed on 21 February 2025).

Acknowledgments

The author thanks arXiv, ChatGPT, and Google Scholar for use of their open-access interoperability, Sabina Marchetti, and two anonymous referees for their comments and suggestions. This publication should not be reported as representing the views of the Bank of Italy. The views expressed are those of the author and do not necessarily reflect those of the Bank of Italy.

Conflicts of Interest

The author declares no conflicts of interest.

References

Bianchi, M.L.; Tassinari, G.L.; Fabozzi, F.J. Fat and heavy tails in asset management. J. Portf. Manag. 2023, 49, 236–263. [Google Scholar] [CrossRef]
Vogl, M. Quantitative modelling frontiers: A literature review on the evolution in financial and risk modelling after the financial crisis (2008–2019). SN Bus. Econ. 2022, 2, 183. [Google Scholar] [CrossRef] [PubMed]
Derman, E. Models Behaving Badly: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
Ippoliti, E. Mathematics and finance: Some philosophical remarks. Topoi 2021, 40, 771–781. [Google Scholar] [CrossRef]
Carmona, R. The influence of economic research on financial mathematics: Evidence from the last 25 years. Financ. Stoch. 2022, 26, 85–101. [Google Scholar] [CrossRef]
Cesa, M. A brief history of quantitative finance. Probab. Uncertain. Quant. Risk 2017, 2, 1–16. [Google Scholar] [CrossRef]
Derman, E.; Miller, M.B. The Volatility Smile; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
Bianchetti, M.; Carlicchi, M. Interest Rates After the Credit Crunch: Multiple-Curve Vanilla Derivatives and SABR. Available online: https://arxiv.org/abs/1103.2567 (accessed on 21 February 2025).
Huisman, J.; Smits, J. Duration and quality of the peer review process: The author’s perspective. Scientometrics 2017, 113, 633–650. [Google Scholar] [CrossRef]
Clement, C.B.; Bierbaum, M.; O’Keeffe, K.P.; Alemi, A.A. On the use of arXiv as a dataset. Available online: https://arxiv.org/abs/1905.00075 (accessed on 21 February 2025).
Eger, S.; Li, C.; Netzer, F.; Gurevych, I. Predicting Research Trends from arXiv. Available online: https://arxiv.org/abs/1903.02831 (accessed on 21 February 2025).
Viet, N.T.; Kravets, A.G. Analyzing recent research trends of computer science from academic open-access digital library. In Proceedings of the 8th International Conference on System Modeling and Advancement in Research Trends (SMART), Moradabad, India, 22–23 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 31–36. [Google Scholar]
Lin, J.; Yu, Y.; Zhou, Y.; Zhou, Z.; Shi, X. How many preprints have actually been printed and why: A case study of computer science preprints on arXiv. Scientometrics 2020, 124, 555–574. [Google Scholar] [CrossRef]
Tan, K.; Munster, A.; Mackenzie, A. Images of the arXiv: Reconfiguring large scientific image datasets. J. Cult. Anal. 2021, 3, 1–41. [Google Scholar] [CrossRef]
Okamura, K. Scientometric engineering: Exploring citation dynamics via arXiv eprints. Quant. Sci. Stud. 2022, 3, 122–146. [Google Scholar] [CrossRef]
Bohara, K.; Shakya, A.; Debb Pande, B. Fine-tuning of RoBERTa for document classification of arXiv dataset. In Mobile Computing and Sustainable Informatics; Shakya, G., Papakostas, S., Kamel, K.A., Eds.; Springer Nature: Singapore, 2023; pp. 243–255. [Google Scholar]
Fatima, R.; Yasin, A.; Liu, L.; Wang, J.; Afzal, W. Retrieving arXiv, SocArXiv, and SSRN metadata for initial review screening. Inf. Softw. Technol. 2023, 161, 107251. [Google Scholar] [CrossRef]
Burton, B.; Kumar, S.; Pandey, N. Twenty-five years of The European Journal of Finance (EJF): A retrospective analysis. Eur. J. Financ. 2020, 26, 1817–1841. [Google Scholar] [CrossRef]
Ali, A.; Bashir, H.A. Bibliometric study on asset pricing. Qual. Res. Financ. Mark. 2022, 14, 433–460. [Google Scholar] [CrossRef]
Aria, M.; Cuccurullo, C. bibliometrix: An R-tool for comprehensive science mapping analysis. J. Informetr. 2017, 11, 959–975. [Google Scholar] [CrossRef]
Sharma, P.; Sharma, D.K.; Gupta, P. Review of research on option pricing: A bibliometric analysis. Qual. Res. Financ. Mark. 2024, 16, 159–182. [Google Scholar] [CrossRef]
Donthu, N.; Kumar, S.; Mukherjee, D.; Pandey, N.; Lim, W.M. How to conduct a bibliometric analysis: An overview and guidelines. J. Bus. Res. 2021, 133, 285–296. [Google Scholar] [CrossRef]
Van Eck, N.; Waltman, L. Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 2010, 84, 523–538. [Google Scholar] [CrossRef]
Bastian, M.; Heymann, S.; Jacomy, M. Gephi: An open source software for exploring and manipulating networks. In Proceedings of the International AAAI Conference on Web and Social Media, Palo Alto, CA, USA, 17–20 May 2009; pp. 361–362. [Google Scholar]
Liu, J.; Liu, Y.; Ren, L.; Li, X.; Wang, S. Trends and Trajectories: A Bibliometric Analysis of Financial Risk in Corporate Finance and Finance (2020–2024). Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4960436 (accessed on 21 February 2025).
Joaqui-Barandica, O.; Manotas-Duque, D.F. Assets liability management: A bibliometric analysis and topic modeling. Entramado 2022, 18, 1–23. [Google Scholar]
Westergaard, D.; Stærfeldt, H.H.; Tønsberg, C.; Jensen, L.J.; Brunak, S. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput. Biol. 2018, 14, e1005962. [Google Scholar] [CrossRef]
DuBay, W.H. The principles of readability. ERIC 2004. Available online: https://eric.ed.gov/?id=ed490073 (accessed on 21 February 2025).
Sethia, K.; Saxena, M.; Goyal, M.; Yadav, R.K. Framework for topic modeling using BERT, LDA and K-means. In Proceedings of the 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), Greater Noida, India, 28–29 April 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2204–2208. [Google Scholar]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In International Conference on Machine Learning; PMLR: Brookline, MA, USA, 2014; pp. 1188–1196. [Google Scholar]
Angelov, D. Top2vec: Distributed Representations of Topics. arXiv 2020, arXiv:2008.09470. [Google Scholar]
Grootendorst, M. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Egger, R.; Yu, J. A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef] [PubMed]
Rüdiger, M.; Antons, D.; Joshi, A.M.; Salge, T.O. Topic modeling revisited: New evidence on algorithm performance and quality metrics. PLoS ONE 2022, 17, e0266325. [Google Scholar] [CrossRef]
Terragni, S.; Fersini, E.; Galuzzi, B.G.; Tropeano, P.; Candelieri, A. OCTIS: Comparing and optimizing topic models is simple. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Kyiv, Ukraine, 19–23 April 2021; pp. 263–270. [Google Scholar]
Ebinezer, S. Transform Your Topic Modeling with ChatGPT: Cutting-Edge NLP. Available online: https://medium.com/ (accessed on 21 February 2025).
Metelko, Z.; Maver, J. Exploring arXiv usage habits among Slovenian scientists. J. Doc. 2023, 79, 72–94. [Google Scholar] [CrossRef]
Mishkin, D.; Tabb, A.; Matas, J. ArXiving Before Submission Helps Everyone. arXiv 2020, arXiv:2010.05365. [Google Scholar]
Wang, Z.; Chen, Y.; Glänzel, W. Preprints as accelerators of scholarly communication: An empirical analysis in mathematics. J. Informetr. 2020, 14, 101097. [Google Scholar] [CrossRef]

Figure 1. Categories by year. The y-axis indicates the total number of papers.

Figure 2. Flesch reading ease score. The vertical line represents the median, the dashed line is the quantile of level 0.25 (0.75), and the dotted line is the quantile of level 0.005 (0.995). The y-axis indicates the total number of papers.

Figure 3. Frequent words of the corpus and their percentages of appearance.

Figure 4. Frequency-based word clouds of bigrams and of all words including bigrams and tri(four)grams. I consider n-grams with parameters min count equal to 250 and threshold equal to 10.

Figure 5. Paper length, in terms of number of words, for raw, lemmatized and cleaned data. The x-axis values are in thousands. The vertical line represents the median, the dashed line is the quantile of level 0.25 (0.75), and the dotted line is the quantile of level 0.01 (0.99). The y-axis indicates the total number of papers.

Figure 6. Topics trend by year across the sample of around 16,000 papers in the q-fin categories. The y-axis indicates the total number of papers.

Figure 7. Topic clusters projected onto 2-dimensional space through the standard t-distributed stochastic neighbor embedding (t-SNE) algorithm. The numbers represent the topics reported in Table 2.

Table 1. Algorithm performance.

	RS	ARS	MI	NMI	CA	PS
k-means	0.570	0.029	0.232	0.136	0.271	0.297
LDA scikit-learn	0.823	0.194	0.608	0.284	0.376	0.460
LDA gensim	0.788	0.085	0.276	0.131	0.275	0.314
Word2Vec	0.832	0.200	0.613	0.283	0.371	0.427
Doc2Vec	0.831	0.220	0.699	0.325	0.388	0.490
Top2Vec raw	0.810	0.195	0.501	0.239	0.365	0.404
Top2Vec cleaned	0.811	0.206	0.530	0.387	0.253	0.416
BERTopic raw	0.826	0.238	0.608	0.289	0.436	0.458
BERTopic cleaned	0.821	0.239	0.574	0.276	0.398	0.429

I report the rand score (RS), the adjusted rand score (ARS), the mutual info score (MI), the normalized mutual info score (NIM), the cluster accuracy (CA), and the purity score (PS).

Table 2. The label extracted from ChatGPT and the title of the most representative paper for each topic.

Number	Label	Title of the Most Representative Paper
0	diverse perspectives in education, innovation, and economic development	Perspectives in public and university sector co-operation in the change in the higher education model in Hungary, in light of China’s experience
1	modeling financial market dynamics	Comment on: Thermal model for adaptive competition in a market
2	decentralized finance and blockchain technology	Understanding the maker protocol
3	correlation analysis in financial markets and networks	Random matrix theory and cross-correlations in global financial indices and local stock market indices
4	deep reinforcement learning in stock trading and portfolio management	Practical deep reinforcement learning approach for stock trading optimal market making by reinforcement learning
5	optimal trading and portfolio liquidation strategies in financial markets	An FBSDE approach to market impact games with stochastic parameters
6	portfolio optimization techniques and strategies	Seven sins in portfolio optimization
7	stochastic volatility modeling and option pricing	On the uniqueness of classical solutions of Cauchy problems
8	asset pricing, investment, and arbitrage in financial markets	Characterization of arbitrage-free markets
9	network analysis of financial contagion and systemic risk	Clearing algorithms and network centrality
10	counterparty risk and valuation adjustments in financial derivatives	Collateral margining in arbitrage-free counterparty valuation adjustment including re-hypotecation and netting
11	quantum models in finance and option pricing	Sornette–Ide model for markets: Trader expectations as imaginary part
12	valuation and risk management in annuity and insurance products	A policyholder’s utility indifference valuation model for the guaranteed annuity option
13	optimal dividend strategies in stochastic control and risk management	Optimal dividends problem with a terminal value for spectrally positive Levy processes
14	risk measures and utility maximization under model uncertainty	On the C-property and $w^{*}$ -representations of risk measures
15	economic complexity, networks, and trade patterns	Economic complexity and growth: Can value-added exports better explain the link?
16	health, policy, and social impact studies	Ramadan and infant health outcomes
17	statistical analysis of financial markets and volatility	Volatility distribution in the S&P500 Stock Index
18	renewable energy economics and electricity market dynamics	On wholesale electricity prices and market values in a carbon-neutral energy system
19	game theory and strategic decision-making	Simultaneous auctions for complementary goods
20	stock price prediction with deep learning and news sentiment analysis	Stock Prediction: a method based on extraction of news features and recurrent neural networks
21	kinetic wealth exchange models in economics	Gibbs versus non-Gibbs distributions in money dynamics
22	market order flow and price impact	Order flow and price formation
23	high-order numerical methods for option pricing in finance	High-order compact finite difference scheme for option pricing in stochastic volatility with contemporaneous jump models
24	optimal investment and consumption in financial models with constraints	Recursive utility optimization with concave coefficients
25	environmental and economic impacts of mobility technologies	A review on energy, environmental, and sustainability implications of connected and automated vehicles
26	advanced risk measures in financial modeling	Generating unfavorable VaR scenarios with patchwork copulas
27	Bayesian models for financial tail risk forecasting	A semi-parametric realized joint value-at-risk and expected shortfall regression framework
28	pricing and modeling options in stochastic volatility models with jumps	Semi-analytical pricing of barrier options in the time-dependent Heston model
29	economic growth and market dynamics	Uncovering volatility dynamics in daily REIT returns

Table 3. Number of author occurrences and Google Scholar metrics as of May 2023. The symbol [] indicates that the information is not available.

Author_ID	Occurences	Name	Topics	Citations	h-Index	i10-Index
mG07_6k	4505	Ioannis Karatzas	[‘stochastic analysis’, ‘stochastic control’, ‘mathematical finance’]	35,724	60	127
58amEmw	4441	Jean-Philippe Bouchaud	[‘statistical mechanics’, ‘disordered systems’, ‘random matrices’, ‘quantitative finance’, ‘agent based models’]	49,794	105	351
ahLm1v0	3256	Walter Schachermayer	[]	15,891	56	124
HGsSmMA	3135	Didier Sornette	[‘cooperation’, ‘organization’, ‘patterns’, ‘prediction’]	53,371	112	587
CqFCQVE	2794	Alexander Schied	[‘probability theory’, ‘stochastic processes’, ‘mathematical finance’]	16,780	35	71
2QOp9_M	2457	Peter Carr	[‘financial engineering’, ‘quantitative finance’, ‘mathematical finance’, ‘derivatives’, ‘volatility’]	24,060	62	117
mVF1X_U	2371	Freddy Delbaen	[‘mathematik’, ‘ökonomie’]	25,557	50	90
ElAtiUs	2258	Darrell Duffie	[‘finance’, ‘central banking’]	58,442	88	142
fq7BQos	2042	Dilip B. Madan	[‘mathematical finance’, ‘general equilibrium theory’]	25,131	55	156
8abFiFM	1774	Robert Engle	[‘finance and econometrics’]	189,678	118	229
GU9HgNA	1743	Quanquan Gu	[‘statistical machine learning’, ‘nonconvex optimization’, ‘deep learning theory’, ‘reinforcement learning’, ‘ai for science’]	14,915	56	161
Q7N-rCk	1672	Rosario Nunzio Mantegna	[‘econophysics’, ‘statistical physics’, ‘complex systems’, ‘financial markets’, ‘information filtering’]	28,269	67	139
QsYYhSE	1629	Søren Johansen	[‘matamatical statistics’, ‘econometrics’]	97,728	65	132
vZA2pjw	1627	Benoît B. Mandelbrot	[‘mathematics’, ‘fractals’, ‘economics’, ‘information theory’, ‘fluid dynamics’]	142,895	96	319
Lf1kf1Q	1505	Mario Coccia	[‘evolution of technology’, ‘scientific change’, ‘social dynamics’, ‘complex adaptive systems’, ‘environment & COVID-19’]	19,376	106	228
9HXRjPk	1444	Damir Filipovic	[‘quantitative finance’, ‘quantitative risk management’]	6910	39	73
6_INHZI	1389	Fabrizio Lillo	[‘quantitative finance’, ‘statistical mechanics’, ‘data science’]	10,900	51	121
mGpnlA8	1218	Touzi Nizar	[‘stochastic control’, ‘mathematical finance’, ‘monte carlo methods’]	11,831	56	120
rp-3Yoo	1187	Barry Williams	[‘banks and banking’, ‘bank risk’, ‘multinational’, ‘banking’]	2342	17	23
MZNxzRY	1161	Huyên Pham	[‘mathematical finance’, ‘stochastic control’, ‘numerical probabilities’]	9967	54	135
3HhvEUc	1147	Yuri Kabanov	[‘mathematical finance’, ‘mathematics’]	6297	38	77
-YEPo1E	1143	Wing-Keung Wong	[‘financial economics’, ‘econometrics’, ‘investment theory’, ‘risk management’, ‘operational research’]	14,440	65	274
GyPrRgc	1138	Swarn Chatterjee	[‘financial planning’, ‘wealth management’, ‘financial literacy’, ‘household finance’, ‘behavioral finance’]	2719	28	54
RZid9X8	1075	Guido Caldarelli	[‘network theory’, ‘network science’, ‘statistical physics’, ‘complex systems’]	24,165	71	191
ImhakoA	1075	Daniel Kahneman	[]	519,507	158	369
zO_tShM	1050	Marek Rutkowski	[‘mathematical finance’, ‘stochastic processes’]	7559	30	67
7NJ7Ax8	1039	Patrick Cheridito	[]	5400	34	59
nyfza90	1019	Volker Schmidt	[‘virtual materials testing’, ‘statistical learning’, ‘image analysis’, ‘spatial stochastic modeling’, ‘monte carlo simulation’]	11,896	52	230
x4vtSxI	1017	Rene Carmona	[‘stochastic analysis’, ‘financial mathematics’, ‘financial engineering’]	17,474	59	139
kukA0Lc	999	Yoshua Bengio	[‘machine learning’, ‘deep learning’, ‘artificial intelligence’]	656,874	222	763
vQ0_nz8	989	Emmanuel Bacry	[‘self-similarity’, ‘multifractal’, ‘stochastic modeling’, ‘statistical finance’, ‘financial time-series modelization’]	11,937	47	69
1XwLUrc	980	Jim Gatheral	[‘volatility modeling’, ‘market microstructure’, ‘algorithmic trading’]	6126	30	42
3HwRbiQ	955	Jerome Friedman	[]	283,058	95	197
e2Xowj0	900	Neil Shephard	[‘econometrics’, ‘economics’, ‘statistics’, ‘financial econometrics’, ‘finance’]	42,035	69	140
pEnxwCM	887	Victor M. Yakovenko	[‘condensed matter theory’, ‘econophysics’]	8891	44	101
a11vssU	845	Constantinos Kardaras	[‘stochastic analysis’, ‘probability’, ‘mathematical finance’]	1557	20	31
79htA7g	838	Bent Flyvbjerg	[‘project management’, ‘management’, ‘infrastructure’, ‘planning’, ‘cities’]	73,264	70	152
zH1qBSo	834	Albert Shiryaev	[‘probability theory’]	35,521	59	163
QVb4LGI	815	Andrey Itkin	[‘mathematical finance’, ‘computational finance’, ‘derivatives’, ‘quantitative finance’, ‘machine learning’]	709	14	19
Zuhod6s	813	Yong Deng	[‘uncertainty’, ‘deng entropy’, ‘information volume’, ‘random permutation set’, ‘chaos and fractal’]	23,189	81	335
bWlZ3-Y	810	Eric Jacquier	[]	4458	19	25
GKthQJQ	804	Peter K. Friz	[‘rough path theory’, ‘stochastic analysis’, ‘pdes’, ‘finance’]	5678	39	82
2qTa_4U	794	Francis Diebold	[‘economics’, ‘econometrics’, ‘time series’, ‘statistics’]	76,159	97	175
utY1nTo	794	Matteo Marsili	[‘statistical mechanics’, ‘stochastic processes’, ‘collective phenomena in socio-economic systems’, ‘networks’, ‘complex systems’]	10,093	50	139
ZpG_cJw	783	Robert Tibshirani	[‘statistics’, ‘data science’, ‘machine learning’]	460,493	172	525
65wdZxA	780	Damiano Brigo	[‘probability’, ‘mathematical finance’, ‘stochastic analysis’, ‘signal processing’, ‘differential geometry and statistics’]	9663	42	114
bxJe87s	780	Marco Frittelli	[‘financial mathematics’, ‘mathematical finance’, ‘probability’]	3795	24	33
aVju7cI	771	Monique Jeanblanc	[‘mathématiques financières’]	10,256	51	110
-iOn6uI	769	Aurélien Alfonsi	[]	2731	22	30
P_LECrk	750	Tomasz R. Bielecki	[‘mathematical finance’, ‘stochastic processes’, ‘stochastic control’, ‘stochastic analysis’, ‘probability’]	6901	39	89
5sQ0Fag	729	Ajit Singh	[]	21,592	60	360
6quAJUE	706	Josef Teichmann	[‘mathematical finance’, ‘machine learning in finance’, ‘rough analysis’]	3019	30	67
58amEmw	705	Jean-Philippe Bouchaud	[‘statistical mechanics’, ‘disordered systems’, ‘random matrices’, ‘quantitative finance’, ‘agent based models’]	49,794	105	351
JicYPdA	691	Geoffrey Hinton	[‘machine learning’, ‘psychology’, ‘artificial intelligence’, ‘cognitive science’, ‘computer science’]	687,453	180	436
i2MC67A	679	J.F. Muzy	[‘multifractal analysis’, ‘econophysics’, ‘turbulence’]	13,417	55	83
K9yGky8	678	Andreas Kyprianou	[‘probability theory’, ‘applied mathematics’]	7946	44	99
aCSds20	670	Xavier Gabaix	[‘economics’, ‘finance’]	30,926	56	75
G-WPCrM	667	Diego Garlaschelli	[‘network theory’, ‘econophysics’, ‘sociophysics’, ‘statistical physics’]	7394	41	73
YTCnA4E	664	Eduardo Schwartz	[‘finance’]	44,652	81	141
A0ISJPU	664	Steven Shreve	[‘probability’, ‘financial mathematics’]	35,605	42	70
fFFOHec	660	Alexander McNeil	[]	24,473	41	60
dYwbc9s	659	Guido Imbens	[‘causal inference’, ‘econometrics’]	90,765	95	169
OQK4DDY	657	Peter Forsyth	[‘scientific computing’, ‘computational finance’, ‘numerical solution of pdes’]	10,617	58	143
c1wQ9_k	655	Daojian Zeng	[‘natural language processing’]	4877	13	17
Vs7kOf4	645	Marcel Nutz	[‘optimal transport’, ‘mathematical finance’, ‘game theory’]	2339	30	43
vjc1kF0	640	Francesca Biagini	[‘financial and insurance mathematics’, ‘stochastic calculus’, ‘probability’]	2598	20	42
zGJKZpk	629	Marianne Bertrand	[]	64,693	66	119
nEfnJZM	628	Vadim Linetsky	[‘mathematical finance’, ‘financial economics’]	5260	39	63
Bekg2Qo	621	Joel Shapiro	[‘financial intermediation’, ‘regulation of financial institutions’, ‘corporate governance’, ‘industrial organization’]	3449	15	16
KDhGvNQ	611	Johanna Ziegel	[‘statistical forecasting’, ‘risk measures’, ‘postitive definite functions’, ‘stereology’, ‘copulas’]	2088	19	31
r5PHkCs	610	Thomas Guhr	[‘theoretical physics’]	8395	39	104

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bianchi, M.L. Text Mining arXiv: A Look Through Quantitative Finance Papers. Mathematics 2025, 13, 1375. https://doi.org/10.3390/math13091375

AMA Style

Bianchi ML. Text Mining arXiv: A Look Through Quantitative Finance Papers. Mathematics. 2025; 13(9):1375. https://doi.org/10.3390/math13091375

Chicago/Turabian Style

Bianchi, Michele Leonardo. 2025. "Text Mining arXiv: A Look Through Quantitative Finance Papers" Mathematics 13, no. 9: 1375. https://doi.org/10.3390/math13091375

APA Style

Bianchi, M. L. (2025). Text Mining arXiv: A Look Through Quantitative Finance Papers. Mathematics, 13(9), 1375. https://doi.org/10.3390/math13091375

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Text Mining arXiv: A Look Through Quantitative Finance Papers

Abstract

1. Introduction

2. Data Description

3. Text Preprocessing

4. Topics Trend

4.1. Topic Modeling Algorithms

4.2. Algorithm Performance on Full Texts

4.3. Empirical Study

5. Extracting Authors and Journals

6. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI