Text Mining arXiv: A Look Through Quantitative Finance Papers
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript investigates research trends in quantitative finance using text mining techniques applied to arXiv preprints between 1997 and 2022. The study aims to identify dominant topics, influential authors, and emerging topics in the field by leveraging machine learning models for topic modeling and entity recognition.
This question is relevant as it provides a structured overview of how research in quantitative finance has evolved over time.
The manuscript presents an original application of natural language processing (NLP) and topic modeling to explore trends in financial research. The study is relevant to the field because:
It provides a systematic analysis of the evolution of research in quantitative finance.
It uses state-of-the-art text mining methods, including LDA, BERTopic, and Doc2Vec.
It highlights emerging topics, such as blockchain finance, decentralized finance (DeFi), and machine learning applications.
Multiple topic modeling approaches were considered, which improves robustness. The dataset is sufficiently large (arXiv from 1997 to 2022) and covers a broad scope of research.
The conclusions summarize the key findings well. The study correctly identifies dominant research areas and emerging topics.
The graphs are well designed and convey the information effectively.
------
Minor revisions
The introduction is clear and contextualizes the research well, but could be better linked to the behavioral finance approach mentioned in the special issue.
Methodological: It could be more concise and better justify the methodological choice, for example, the use of ChatGPT for topic classification.
The study uses ChatGPT to label the topics. While innovative, the accuracy or reliability of these labels is not discussed.
Have human annotators validated the topic classifications?
The authors employ multiple topic modeling techniques (LDA, BERTopic, Doc2Vec, Top2Vec), but it is not clear why these models were chosen.
How do the selected models compare in terms of performance (coherence scores, interpretability)?
Are there publication biases (e.g. preference for theoretical papers over empirical financial studies)?
The study could benefit from referencing more bibliometric studies in finance to contextualize its contribution.
https://doi.org/10.18041/1900-3803/entramado.1.8242
Author Response
The introduction is clear and contextualizes the research well, but could be better linked to the behavioral finance approach mentioned in the special issue.Reply: I added a sentence in the introduction.
Methodological: It could be more concise and better justify the methodological choice, for example, the use of ChatGPT for topic classification.
The study uses ChatGPT to label the topics. While innovative, the accuracy or reliability of these labels is not discussed.
Have human annotators validated the topic classifications?
Reply: I added a brief explanation.
The authors employ multiple topic modeling techniques (LDA, BERTopic, Doc2Vec, Top2Vec), but it is not clear why these models were chosen.
Reply: I employed multiple topic modeling techniques—LDA, BERTopic, Doc2Vec, and Top2Vec—based on their widespread application in the related literature.
How do the selected models compare in terms of performance (coherence scores, interpretability)?
Reply: The algorithm performances are reported in Table 1.
Are there publication biases (e.g. preference for theoretical papers over empirical financial studies)?
Reply: We considered all paper submitted to arxiv. Yes, as mentioned in the paper, it could be the case that some bias exists and that the dataset is not representative of the entire quantitative finance literature.
The study could benefit from referencing more bibliometric studies in finance to contextualize its contribution.
https://doi.org/10.18041/1900-3803/entramado.1.8242
Reply: I added the references.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this paper authors apply text mining techniques to provide an automated approach to analysis of research paper i.e. trends analysis in quantitative finance. It is an effective use case of topic modeling. QF papers from arXiv database are used in research.
Main drawbacks:
- The structure of the paper can be improved e.g. Related literature section. It should focus on text mining techniques used in topic modeling, and applications of these techniques in paper/trend analysis. It is recommended that these techniques are separated from experiment results in section Topics trend.
- Introduction - If I understand correctly, this is the first attempt to investigate quantitative finance field automatically (using text mining) on arXiv dataset. Were there similar attempts to automatically explore QF papers on other datasets?
- The arXiv dataset contains papers from 1997 until 2022. The last 2 years are not considered. I suggest that you include 2023 and 2024 in the analysis.
- Preprocessing – The hyperlinks are left in the text (even http occurs as a word with high frequency). Please consider removing http and all the links from the text, as they usually carry no meaning.
General and technical remarks:
Figure 4 – poor quality, not clear and readable
Please make sure to name axes on all figures (e.g. Figure 2, Figure 5, etc.)
Figure 5 – for better comparison try to put same/similar axes on 3 plots (e.g. x-axes 0 to 40). Also, enlarge the graph so that median lines, q1 and q2 are clearly visible.
Table 3. Please explain the meaning of [ ] in topics (e.g. Daniel Kahneman)
Table 3. occurrences should not be a decimal number, but integer
Referencing – it is not common to start sentence with reference in the brackets [] (e.g. page 2, [9], [1]). For sake of readability, use instead: In research []….
Please check for typos, e.g. page 9 paragraph 2 – ‘T’ at the end of paragraph, page 10 – ‘To cluster this list of lists of topics...’
Please check grammar – e.g. page 8 - Since IT is not simple to assess
Author Response
The structure of the paper can be improved e.g. Related literature section. It should focus on text mining techniques used in topic modeling, and applications of these techniques in paper/trend analysis. It is recommended that these techniques are separated from experiment results in section Topics trend.
Reply: I modified Section 4.
Introduction - If I understand correctly, this is the first attempt to investigate quantitative finance field automatically (using text mining) on arXiv dataset. Were there similar attempts to automatically explore QF papers on other datasets?
Reply: To the best of my knowledge, I have included all relevant papers in the Introduction.
The arXiv dataset contains papers from 1997 until 2022. The last 2 years are not considered. I suggest that you include 2023 and 2024 in the analysis.
Reply: Incorporating data from 2023 and 2024 into the empirical analysis is demanding, as it necessitates redoing the entire analysis from the beginning.
Preprocessing – The hyperlinks are left in the text (even http occurs as a word with high frequency). Please consider removing http and all the links from the text, as they usually carry no meaning.
Reply: In a future release of the code, I will remove additional words that do not contribute meaningful information and update the main results of the empirical study. In my view, and based on the papers within each cluster, these few words are unlikely to significantly impact the empirical results.
Figure 4 – poor quality, not clear and readable
Reply: I modified it to increase the quality.
Please make sure to name axes on all figures (e.g. Figure 2, Figure 5, etc.)
Reply: I added the explanantion in the caption.
Figure 5 – for better comparison try to put same/similar axes on 3 plots (e.g. x-axes 0 to 40). Also, enlarge the graph so that median lines, q1 and q2 are clearly visible.
Reply: Modified as suggested.
Table 3. Please explain the meaning of [ ] in topics (e.g. Daniel Kahneman)
Reply: The meaning of [] is that the information is not available. I modified the caption.
Table 3. occurrences should not be a decimal number, but integer
Reply: Modified as suggested.
Referencing – it is not common to start sentence with reference in the brackets [] (e.g. page 2, [9], [1]). For sake of readability, use instead: In research []….
Reply: Modified as suggested.
Please check for typos, e.g. page 9 paragraph 2 – ‘T’ at the end of paragraph, page 10 – ‘To cluster this list of lists of topics…’
Please check grammar – e.g. page 8 - Since IT is not simple to assess
Reply: Thank a lot for your suggestions. I read again the paper in order to find possible typos.