Application of Natural Language Processing and Genetic Algorithm to Fine-Tune Hyperparameters of Classifiers for Economic Activities Analysis
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors address the problem of economic classification systems according to NACE codes.
The proposed approach investigates the reliability of such codes for economic analysis and decision-making and compares different classification approaches.
The authors investigate different methodologies and classifiers. However, considering the number of investigations carried out, the overall paper is quite difficult to read. The presentation of the proposed approach could benefit from an overall description to better connect all the parts that compose it, now spread within the different sections of Chapter 2. For instance, the inclusion of bullet points with model parameters might be reformulated in order to focus the attention of the reader on the results achieved by the different models. In Section 3, three different justifications for the presence of a wrong code are presented. Although the names of the identified types are informative, the defined types are quite similar. To improve clarity, I suggest adding a brief explanation and including examples.
A final concern refers to the analysis carried out in section 3.2 that focuses on the Random Forest Classifier, although, from the consideration drawn in section 2, the Multilayer Perceptron has been selected as the best model among the analyzed.
Comments on the Quality of English LanguageSome typos were found in the manuscript, which could benefit from grammar and syntactic revisions. Examples include "actiBities" at line 5 and a double comma at line 387.
Author Response
We appreciate your detailed review and the constructive feedback you've provided. We've implemented the suggested revisions and have included a response document detailing the changes made. Thank you for your valuable input.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsUtilizing proofreading services or engaging with editorial language services is strongly recommended, as the manuscript contains typographical errors even in critical sections such as the abstract (e.g., 'actibities') and the conclusions (e.g., 'approaches„'). It is advisable to commence the abstract with a clear statement of what the study proposes, develops, or invents, rather than what it investigates. For instance, mentioning a specific innovation such as an enhanced classification algorithm like NACE-I or ENACE would clarify the contributions of the research. The abstract should succinctly convey the core propositions, improvements, and solutions introduced by the study, and could be expanded, if permissible by the publisher, to distinctly articulate the findings and the research questions addressed. This expansion would reduce the need for subsequent investigation by other researchers.
Attention should be given to the accessibility of visual content; authors must consider color-blind readers when choosing color-coding for graphical representations. Presenting hyperparameters and settings of the artificial neural network in a tabular format would enhance clarity and facilitate a quicker understanding for readers.
The methodology employed appears appropriate, and the approach is notably innovative. It would be beneficial to include additional graphical representations, such as those depicting training history, to provide a comprehensive view of the research process.
In the conclusions, rather than stating intentions such as 'we aim to increase', it would be more effective to assert the actual achievements of the research, e.g., 'the researchers increased the actual accuracy by 20%'. Consistent with best practices in international publishing, the use of first-person pronouns ('we', 'us') should be replaced with third-person constructs ('the researchers'), maintaining an objective and formal tone throughout the document.
Overall, while the study is innovative and presents significant scientific findings, the quality of the manuscript could be enhanced through careful attention to the structure, presentation, and language used in the paper.
Comments on the Quality of English LanguageEnglish requires minor editing
Author Response
We appreciate your comprehensive review and the constructive feedback you provided. We've taken your suggestions into account and made the necessary revisions. Please find attached a response document detailing the changes we've implemented. Thank you again for your valuable input.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsPaper Summary: The study aims to enhance the accuracy of classifying economic activities by matching Nomenclature of Economic Activities (NACE) codes using machine learning techniques combined with expert evaluations. The authors utilized a dataset with 20 million records involving economic activities, which include descriptions, prices, and NACE codes. They also employed Apache Spark for distributed data processing and vectorized text using TF-IDF methods. These data collection and pre-processing methods seem to be solid and convincing. Also, genetic algorithm was used to optimize the parameters of various classifiers including Naive Bayes, Decision Tree, Random Forest, and multilayer perceptron. Finally, different machine learning models were evaluated, with multilayer perceptron showing the best performance with an accuracy of 71% and F1-score of 0.73.
Strength: The methods used for financial data analysis in this paper is introduced with enough details. The paper is written in a clear structure and easy to read. The experimental results are described with enough details.
Weakness: There is no comparison between the results from the proposed model and those from other baseline models. The authors should find other models in recently published works as a comparison.
Conclusion: In general, this is a boarder-line paper: There is no comparison on the results from this model and other published models. The authors should find other models from recently published papers (LLM, BERT-based models, deep LSTM models, etc) as baselines to compare with the proposed model. As a result, I have to recommend reconsider after major revision.
Comments on the Quality of English LanguageNot applicable.
Author Response
Thank you for your thorough review and constructive feedback. We have addressed your suggestions and attached a response document outlining the revisions made accordingly.
Author Response File: Author Response.pdf