Systematic Comparison of Vectorization Methods in Classification Context
Round 1
Reviewer 1 Report
This study is a comparison of text vectorization methods for natural language processing. Words and paragraphs are examples of things that can be transformed to tokens for forming vectors of numbers. This process allows for a predictive model of language use, such as for creating a powerful chat bot.
The writing in the manuscript is excellent. There were some minor errors for words or sentences, but overall the writing style is very good, along with a very good presentation of the relevant concepts. I enjoyed reading this manuscript. I recommend it for publication after the following recommendations are considered.
Recommendations for improvements to paper:
Section 1 of the Introduction could use citations to references, such as the case for supporting the statement in line 27.
In Section 2, it is not fully clear how the CBOW and skip gram models differ. Also, the Skip-gram model is referred to as just skip gram in line 104. Should it instead be Skip-gram?
In line 114, I think that "tf" should be "TF".
In lines 117-123, I think the "Doc2Vec" method could use further clarification, perhaps just a sentence or two on how it encodes a paragraph.
In line 129, *.txt could be in parentheses, like so:
text file (.txt)
Line 161, the sentence is kind of awkward from use of the comma. Line 162 sentence is also awkward.
Line 155: "analysand" should be "analysed"
Line 199: the phrase "punctuation full" is awkward. Perhaps it should be "punctuation unmodified"? And, perhaps "dataset" should be "dataset size"?
Line 220: perhaps add an example of a "stop word".
Line 230, is "gensim library" a Python library?
Line 247, 248: SkipGram is now used instead of Skip-gram or skip gram. The use of the name should be the same across the manuscript.
Section 4 (starting at line 290): "tf" is lower case. Same for "tf-idf". Please verify that the lower case is best and that the use of case is the same throughout the paper.
Line 312: The letter "i" in Naive can be converted to an English letter "i". It currently has a phonetic symbol above the letter.
Figure 4: the legend's text is too small. Also, the figure's axes labels are also too small. The text in this case is also somewhat blurry.
Figures 4 and 5: the x-axis label should be "Number of epochs".
Line 319: there could be a space after the semicolon for "[0.92;0.95]" and "[0.94;0.96]".
The abbreviation NBC should be defined early in the text.
Line 350: should "era" be "epoch"?
Line 366: what does "works" refer to? Is it "methods"?
Tables 5 and 6 are using commas instead of periods to denote a fractional value. Does the journal use a UK style of periods, such as 0.5000?
Line 389: I think it is fine to just show the heading as "Discussion".
Is it possible to have two or more sentences as guidance for the reader, so that the reader has advice on which methods are best for them to use, or possibly the simplest in a particular case?
Author Response
Responses to Reviewers’ comments – Round 1
The authors would like to thank the reviewers for their valuable comments that helped to improve the quality of their manuscript.
Reviewer #1: This study is a comparison of text vectorization methods for natural language processing. Words and paragraphs are examples of things that can be transformed to tokens for forming vectors of numbers. This process allows for a predictive model of language use, such as for creating a powerful chat bot. The writing in the manuscript is excellent. There were some minor errors for words or sentences, but overall the writing style is very good, along with a very good presentation of the relevant concepts. I enjoyed reading this manuscript. I recommend it for publication after the following recommendations are considered.
- Section 1 of the Introduction could use citations to references, such as the case for supporting the statement in line 27.
References added.
- In Section 2, it is not fully clear how the CBOW and skip gram models differ. Also, the Skip-gram model is referred to as just skip gram in line 104. Should it instead be Skip-gram?
One paragraph added in page 3.
- In line 114, I think that "tf" should be "TF". --> improved
- In lines 117-123, I think the "Doc2Vec" method could use further clarification, perhaps just a sentence or two on how it encodes a paragraph.
One paragraph added in page 3, at the end of section 2.
- In line 129, *.txt could be in parentheses, like so: text file (.txt) --> improved
- Line 161, the sentence is kind of awkward from use of the comma. Line 162 sentence is also awkward. --> improved
- Line 155: "analysand" should be "analysed" --> improved
- Line 199: the phrase "punctuation full" is awkward. Perhaps it should be "punctuation unmodified"? And, perhaps "dataset" should be "dataset size"? --> improved
- Line 220: perhaps add an example of a "stop word".
Example of stop words removal added in page 7.
- Line 230, is "gensim library" a Python library? --> improved
- Line 247, 248: SkipGram is now used instead of Skip-gram or skip gram. The use of the name should be the same across the manuscript.
Changed for Skip-Gram in the whole manuscript.
- Section 4 (starting at line 290): "tf" is lower case. Same for "tf-idf". Please verify that the lower case is best and that the use of case is the same throughout the paper.
"tf" and "tf-idf" can be written in uppercase as well as in lowercase. We use the form of lowercase letters.
- Line 312: The letter "i" in Naive can be converted to an English letter "i". It currently has a phonetic symbol above the letter. --> improved
- Figure 4: the legend's text is too small. Also, the figure's axes labels are also too small. The text in this case is also somewhat blurry. --> improved
- Figures 4 and 5: the x-axis label should be "Number of epochs". --> improved
- Line 319: there could be a space after the semicolon for "[0.92;0.95]" and "[0.94;0.96]". --> improved
- The abbreviation NBC should be defined early in the text. --> improved
- Line 350: should "era" be "epoch"? --> improved
- Line 366: what does "works" refer to? Is it "methods"? --> improved
- Tables 5 and 6 are using commas instead of periods to denote a fractional value. Does the journal use a UK style of periods, such as 0.5000?
Changed into UK style.
- Line 389: I think it is fine to just show the heading as "Discussion".
We prefer to leave the heading for section 6 as “Discussion and conclusions”.
- Is it possible to have two or more sentences as guidance for the reader, so that the reader has advice on which methods are best for them to use, or possibly the simplest in a particular case?
Added in section 6.
All paragraphs added in the manuscript were marked in blue.
Reviewer 2 Report
The authors of the paper “Vector space modelling in natural language processing” aims at investigating how different approaches to text vectorization commonly used in Text Mining can lead to a different organization of structured data associated with a collection of documents written in natural language and how these approaches can influence one of the tasks of Text Mining, focusing on text categorization. Before considering the publication of the study, some aspects have to be cleared and the quality of the manuscript has to be improved. In the following, some remarks that can help authors in the revision of their work.
1) In the abstract, speaking about the alternative solution usually carried out in Text Mining for text vectorization, authors claim “The first one focuses on placing words in the context of the language, while the second one deals with full documents and the representation of single sentences, paragraphs, in the context of the documents under consideration”. In my opinion, this sentence may confuse readers since Vector Space Model (in accordance with the logic of Bag of Words) aims at projecting document-vectors in a space spanned by the terms belonging to the vocabulary of the document collection but, if necessary, also to project term-vectors in a space spanned by the documents of the collection themselves. Conversely, word embedding aims at building a feature space of the terms in accordance with a different theoretical grounding, allowing the projection of each single term in a space that take into account the syntagmatic and paradigmatic relations of terms (depending on the choice of CBOW or Skipgram approach) and hence to leverage this representation to build the document-vectors. The claim is repeated again in the introduction, without better state what the authors refer to. Please, try to better focus the abstract (not merely repeating the same sentences reported below), to give some hints to the readers about the topic more extensively developed in the manuscript.
2) Line 24, pag. 1 - “Automatic content analysis of written statements is a rapidly growing field of science”. Which branch of science? Content Analysis has been developed in Social Sciences and commonly used, for example, in Sociology, Psychology or other domains. A computer scientist may find this term not immediately understandable. Why not referring here more generally to automatic text processing and to the main tasks of Text Mining (e.g., text summarization, topic detection, text clustering and classification)? Moreover, it is currently possible to analyse text written in English, Spanish, Italian, and even in Chinese and Arabic. The reference only to English language is really limited.
3) The effectiveness of different approaches to text vectorization can be tested in several ways, not only by considering their impact on text classification. There are several studies for example aiming at evaluating the influence of data preparation and organization for feature selection and feature extraction. The choice of considering text vectorization in a classification context is a specific decision of the authors that has to be better motivated. Following these remarks (see also 1 and 2), I suggest to re-organize the introduction and make clearer the aim of the study and how it is carried out.
4) Line 68, pag. 2 - “Performing machine learning process on text data requires representing documents as vectors in a specified feature space”. Every automatic analysis of texts from a quantitative point of view requires the transformation of unstructured data in structured data. Again, the specific reference to machine learning processes may confuse readers not familiar with this domain. I do not know which is the cultural background of authors, but they mix terms and approaches of very different spheres involved in Text Analysis. Machine learning may be referred to the supervised approaches but there are several unsupervised approaches that cannot be considered in this framework. Is it for example a hierarchical text clustering approach a machine learning process?
5) Authors refers to two main approaches in text vectorization. The first approach is merely described as “mapping texts to vectors composed of textual representations of words or sentences”, and they claim “the first method is not often used because of the high computational complexity of comparing words in textual form”. First of all, this approach is known as Vector Space Model (Salton et al. in several classic literature references) and is the most important algebraic model for representing texts written in natural language. Surprisingly, there are no references to this approach in the bibliography of the paper. It is true that other approaches have been lately proposed and that nowadays there is a great interested in word embedding, but it is not agreeable the idea that this model is not often used. As regards its complexity, there are tons of publications on the curse of dimensionality and how to deal with this very known drawback of text representation in Euclidean spaces. Concerning the second approach, defined by authors as “mapping of texts to vectors of real number”, I have to ask what is a vector? Are the numbers in a document-vector obtained in the Vector Space Model not real or that documents are not vectors? It is probably better referring to discrete and continuous values. As proof of this, TF-IDF is included in the second class of vectorization methods when instead it is possible to use this term weighting in the Vector Space Model too (by the way, TF-IDF has been formalized by Salton et al. too, not by Shahzad & Ramsha, Stephen or Havrlanta and Kreinovich).
6) The entire section “Natural language processing in artificial intelligence” (why this reference to AI?) should contain a formalisation of the different approaches or a more extensive reference to the literature. If I have to read other 10-15 publications to understand how Vector Space Model works, how CBOW and Skipgram works (what about neural networks and shallow learning?), I would not find any interest in reading this very brief summary of techniques.
7) Where do you have retrieved the data? Kaggle, I suppose… Please cite the source of your data. How the data have been pre-processed? This is an interesting point, because you have considered only the impact of text vectorization when there are several studies concerning the impact of text pre-treatment on the analyses performed on textual datasets. Instead of using a screenshot of the windows containing an example of one text and an example oof five texts in a category, please consider to use only one figure to show what kind of texts are you considering (it is quite unacceptable to use the screenshot of a MS Word file without even mask the red lines of the automatic correction). Similarly, consider to pair table 1 and 2 in a unique table and instead of using both average and median length use one centrality measure and consider to use also a variability measure of the document length. Moreover, how many article per category there are in your dataset? Have you used all the articles included in the original dataset?
8) The subsection 3.3 should be postponed after 3.5, after splitting this latter in one subsection concerning vectorization and one subsection concerning the classification task.
9) Author should better motivate why they have chosen KNN and the Naïve Bayes classifiers. Instead of reporting the scripts used in the analysis, it should be better to use pseudo-code or clearly cite the packages/libraries used to perform the analysis.
10) In the results the Vector Space Model completely disappeared and the TF-IDF listed as an approach of vectorization became (more correctly, in my opinion) a term weighting to apply or not apply in a comparative study.
11) How effectiveness has been computed? What about precision, recall, F1-measure, accuracy, and other validation measures that can be used? What are epochs? In the discussion of the results, I do not see any valuable difference in using a given classifier with a given vectorization technique with respect to others.
12) The comparison of your study with two other publications is really limited and poor. Please refer to a wider range of studies.
13) I do not see any reference to limitations of the study neither future direction of your research in the final section. Concerning the first point, there are a lot of limitations to list. Concerning the second point, what do you think to do to go deepen in this research issue? Is it this paper just an exercise or are you exploring these topics more extensively?
14) I suggest to change the title of the manuscript to take into account that you are comparing different vectorization techniques and their impact in a text classification task.
Please, take note of this list of references that can help you in improving your study:
- Aggarwal, C. C., A. Hinneburg, and D. A. Keim. 2001. On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory, 420–434. Berlin: Springer
- Balbi, S. 2010. Beyond the curse of multidimensionality: high dimensional clustering in text mining. Italian Journal of Applied Statistics 22 (1):53–63
- Misuraca, G. Scepi, M. Spano (2021). “A comparative study on community detection and clustering algorithms for text categorisation”, Proceedings of 15th International Conference on Statistical Analysis of Textual Data (JADT20) [https://hal.archives-ouvertes.fr/hal-03216888]
- Misuraca, M. Spano (2020). “Unsupervised analytic strategies to explore large document collections”, in D.F. Iezzi, D. Mayaffre, M. Misuraca (eds.), Text Analytics. Advances and Challenges, 17-28. Heidelberg: Springer
- Salton, G., and M. J. McGill. 1986. Introduction to modern information retrieval. New York, NY: McGraw-Hill
- Salton, G., A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11):613–20
Author Response
Manuscript Number: applsci-1709211
Vector space modelling in natural language processing --> Systematic comparison of vectorization methods in classification context
Urszula Krzeszewska, Aneta Poniszewska-Marańda, Joanna Ochelska-Mierzejewska
Institute of Information Technology, Lodz University of Technology, Łódź, Poland
Responses to Reviewers’ comments – Round 1
The authors would like to thank the reviewers for their valuable comments that helped to improve the quality of their manuscript.
Reviewer #2: The authors of the paper “Vector space modelling in natural language processing” aims at investigating how different approaches to text vectorization commonly used in Text Mining can lead to a different organization of structured data associated with a collection of documents written in natural language and how these approaches can influence one of the tasks of Text Mining, focusing on text categorization. Before considering the publication of the study, some aspects have to be cleared and the quality of the manuscript has to be improved. In the following, some remarks that can help authors in the revision of their work.
- In the abstract, speaking about the alternative solution usually carried out in Text Mining for text vectorization, authors claim “The first one focuses on placing words in the context of the language, while the second one deals with full documents and the representation of single sentences, paragraphs, in the context of the documents under consideration”. In my opinion, this sentence may confuse readers since Vector Space Model (in accordance with the logic of Bag of Words) aims at projecting document-vectors in a space spanned by the terms belonging to the vocabulary of the document collection but, if necessary, also to project term-vectors in a space spanned by the documents of the collection themselves. Conversely, word embedding aims at building a feature space of the terms in accordance with a different theoretical grounding, allowing the projection of each single term in a space that take into account the syntagmatic and paradigmatic relations of terms (depending on the choice of CBOW or Skipgram approach) and hence to leverage this representation to build the document-vectors. The claim is repeated again in the introduction, without better state what the authors refer to. Please, try to better focus the abstract (not merely repeating the same sentences reported below), to give some hints to the readers about the topic more extensively developed in the manuscript.
The sentences in the abstract were improved.
- Line 24, page 1 - “Automatic content analysis of written statements is a rapidly growing field of science”. Which branch of science? Content Analysis has been developed in Social Sciences and commonly used, for example, in Sociology, Psychology or other domains. A computer scientist may find this term not immediately understandable. Why not referring here more generally to automatic text processing and to the main tasks of Text Mining (e.g., text summarization, topic detection, text clustering and classification)? Moreover, it is currently possible to analyse text written in English, Spanish, Italian, and even in Chinese and Arabic. The reference only to English language is really limited.
The sentences were improved.
- The effectiveness of different approaches to text vectorization can be tested in several ways, not only by considering their impact on text classification. There are several studies for example aiming at evaluating the influence of data preparation and organization for feature selection and feature extraction. The choice of considering text vectorization in a classification context is a specific decision of the authors that has to be better motivated. Following these remarks (see also 1 and 2), I suggest to re-organize the introduction and make clearer the aim of the study and how it is carried out.
The sentences in Introduction were improved.
- Line 68, page 2 - “Performing machine learning process on text data requires representing documents as vectors in a specified feature space”. Every automatic analysis of texts from a quantitative point of view requires the transformation of unstructured data in structured data. Again, the specific reference to machine learning processes may confuse readers not familiar with this domain. I do not know which is the cultural background of authors, but they mix terms and approaches of very different spheres involved in Text Analysis. Machine learning may be referred to the supervised approaches but there are several unsupervised approaches that cannot be considered in this framework. Is it for example a hierarchical text clustering approach a machine learning process?
The sentences in Section 2 were improved.
- Authors refers to two main approaches in text vectorization. The first approach is merely described as “mapping texts to vectors composed of textual representations of words or sentences”, and they claim “the first method is not often used because of the high computational complexity of comparing words in textual form”. First of all, this approach is known as Vector Space Model (Salton et al. in several classic literature references) and is the most important algebraic model for representing texts written in natural language. Surprisingly, there are no references to this approach in the bibliography of the paper. It is true that other approaches have been lately proposed and that nowadays there is a great interested in word embedding, but it is not agreeable the idea that this model is not often used. As regards its complexity, there are tons of publications on the curse of dimensionality and how to deal with this very known drawback of text representation in Euclidean spaces. Concerning the second approach, defined by authors as “mapping of texts to vectors of real number”, I have to ask what is a vector? Are the numbers in a document-vector obtained in the Vector Space Model not real or that documents are not vectors? It is probably better referring to discrete and continuous values. As proof of this, TF-IDF is included in the second class of vectorization methods when instead it is possible to use this term weighting in the Vector Space Model too (by the way, TF-IDF has been formalized by Salton et al. too, not by Shahzad & Ramsha, Stephen or Havrlanta and Kreinovich).
The sentences in Section 2 were improved. Salton et al. reference was added.
- The entire section “Natural language processing in artificial intelligence” (why this reference to AI?) should contain a formalisation of the different approaches or a more extensive reference to the literature. If I have to read other 10-15 publications to understand how Vector Space Model works, how CBOW and Skipgram works (what about neural networks and shallow learning?), I would not find any interest in reading this very brief summary of techniques.
The VSM method is not used in the study of this paper, while the other methods have been described to give at least a small picture of what the methods are, how they work, and how they differ.
- Where do you have retrieved the data? Kaggle, I suppose… Please cite the source of your data. How the data have been pre-processed? This is an interesting point, because you have considered only the impact of text vectorization when there are several studies concerning the impact of text pre-treatment on the analyses performed on textual datasets. Instead of using a screenshot of the windows containing an example of one text and an example of five texts in a category, please consider to use only one figure to show what kind of texts are you considering (it is quite unacceptable to use the screenshot of a MS Word file without even mask the red lines of the automatic correction). Similarly, consider to pair table 1 and 2 in a unique table and instead of using both average and median length use one centrality measure and consider to use also a variability measure of the document length. Moreover, how many article per category there are in your dataset? Have you used all the articles included in the original dataset?
The dataset itself is from: http://mlg.ucd.ie/datasets/bbc.html, free to use for research, and cited as authors suggested with the publications: D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. Text pre-processing process is described in Text preparation subsection later on – subsection 3.4.
„How many article per category there are in your dataset?” --> Within the table 2 there are all needed information for every category, each of them contains around 2000 texts.
„Have you used all the articles included in the original dataset?” --> All the articles were used.
Tables 1 and 2 were merged.
- The subsection 3.3 should be postponed after 3.5, after splitting this latter in one subsection concerning vectorization and one subsection concerning the classification task.
The proposed organization of subsections in section 3 were made.
- Author should better motivate why they have chosen KNN and the Naïve Bayes classifiers. Instead of reporting the scripts used in the analysis, it should be better to use pseudo-code or clearly cite the packages/libraries used to perform the analysis.
The motivation was added in subsection 3.6.
- In the results the Vector Space Model completely disappeared and the TF-IDF listed as an approach of vectorization became (more correctly, in my opinion) a term weighting to apply or not apply in a comparative study.
Vector Space model was not taken into account within this study, only TF-IDF as was mentioned in additional parameter for CBOW and Skip-gram methods. The idea behind it is to see if TF-IDF weighting has an impact on the final classification task. Probably, by the title of the paper there is a problem with a good understanding of not using VSM directly in the comparison.
- How effectiveness has been computed? What about precision, recall, F1-measure, accuracy, and other validation measures that can be used? What are epochs? In the discussion of the results, I do not see any valuable difference in using a given classifier with a given vectorization technique with respect to others.
Effectiveness of the methods is measures as classification accuracy.
- The comparison of your study with two other publications is really limited and poor. Please refer to a wider range of studies.
More comparison was added in section 5 with more references.
Moreover, many studies take into account completely different data sets, texts of different length and using different vocabulary, so that an additional set of data does not add much to our study.
- I do not see any reference to limitations of the study neither future direction of your research in the final section. Concerning the first point, there are a lot of limitations to list. Concerning the second point, what do you think to do to go deepen in this research issue? Is it this paper just an exercise or are you exploring these topics more extensively?
The new paragraphs added in section 6.
- I suggest to change the title of the manuscript to take into account that you are comparing different vectorization techniques and their impact in a text classification task.
The title of the manuscript was changed.
All paragraphs added in the manuscript were marked in blue.