Candidate Term Boundary Conflict Reduction Method for Chinese Geological Text Segmentation
Round 1
Reviewer 1 Report
Dear Authors,
Please find my remarks below.
L. 44. Citation needed.
L. 48. abovementioned->above mentioned
L. 198. "the theory of information entropy was learned by Shannon." The verb learn is not correct here, maybe established or created would fit better.
In general, citations should have a space character to the nearest word. E.g. information resource management[29]->information resource management [29]
Please modify all citations accordingly.
L. 213. "It is a good attempt..."-> It is a good opportunity...
L. 241. many word segmentation studies
Equation (4)-(8): These are basic, well known metrics, it is not needed to describe them. F beta score can be kept but only if used in this study.
L. 425. If possible the datasets should have a link or citation added.
L. 449. Bert-based->BERT-based. It is an acronym.
Table 6. GPT-3 should be added to the results in the table even if it has the uncertainty as you described. Maybe an average of some runs could be shown in the table marked with an asterisk signing that these are averaged data. Alternatively, the accuracy should be measured to all three datasets and only the accuracy would be shown in the table.
L. 455: End of sentence : or . missing?
It is often emphasized that how faster this method works than others. It would be beneficial to compare the methods by runtime as well. It is enough if it is measured on one dataset.
In the citations there are 4 Master theses cited. It would be beneficial to cite real papers connected to these studies if possible. Otherwise, a link should be added to these Master's works since these works are usually not freely accessible. At least I did not manage to find.
Author Response
Response to Reviewer 1 Comments
We really appreciate you for your carefulness and conscientiousness. Your suggestions are really valuable and helpful for revising and improving our paper. According to your suggestions, we have made the following revisions on this manuscript:
Point 1: L. 44. Citation needed.
Response 1: We added the citation.
Point 2: L. 48. abovementioned->above mentioned
Response 2: We changed “abovementioned” to “above mentioned”.
Point 3: L. 198. "the theory of information entropy was learned by Shannon." The verb learn is not correct here, maybe established or created would fit better.
Response 3: We changed “learned” to “established”.
Point 4: The citations should have a space character to the nearest word. E.g. information resource management[29]->information resource management [29]. Please modify all citations accordingly.
Response 4: We modified all citations accordingly.
Point 5: L. 213. "It is a good attempt..."-> It is a good opportunity...
Response 5: We changed “attempt” to “opportunity”.
Point 6: L. 241. many word segmentation studies
Response 6: We changed “segmentations” to “segmentation”.
Point 7: Equation (4)-(8): These are basic, well known metrics, it is not needed to describe them. F beta score can be kept but only if used in this study.
Response 7: We deleted Equation (4)-(8) and parameter explanations.(Line 381.)
Point 8: L. 425. If possible the datasets should have a link or citation added.
Response 8: We provided links to access the datasets in Data Availability Statement.
Point 9: L. 449. Bert-based->BERT-based. It is an acronym.
Response 9: We changed “Bert-based” to “BERT-based”.
Point 10: Table 6. GPT-3 should be added to the results in the table even if it has the uncertainty as you described. Maybe an average of some runs could be shown in the table marked with an asterisk signing that these are averaged data. Alternatively, the accuracy should be measured to all three datasets and only the accuracy would be shown in the table.
Response 10: Thank you for your advice. Because of the limitations of the length and access times of GPT-3 in word segmentation task, we intercepted part of each text for 20 repeated experiments, and calculated its average accuracy of word segmentation. The results are shown in Table 6 and explained below.
Point 11: L. 455: End of sentence : or . missing?
Response 11: We added “.” at the end of the sentence.
Point 12: It is often emphasized that how faster this method works than others. It would be beneficial to compare the methods by runtime as well. It is enough if it is measured on one dataset.
Response 12: Thank you for your advice. Regrettably, we are not able to compare our method with others by runtime. Although some of resources are already open and free, it's hard to test the same data and get real runtime. There are two examples below:
(1) WMSEG achieved state-of-the-art performance on all general datasets. We tested it using its open source code and found that it has stricter restrictions on the input data. Through our experiments, we found that the size of input file should ideally not be bigger than 10KB and the length of a single line of characters should ideally not be longer than 298, otherwise text will be truncated. Thus, for the 241k-characters Rencun, 346k-characters Nanpanjiang and 576k-characters Xiaoshan, we had to trim them down to over a thousand small files. These time-consuming efforts were unhelpful.
(2) GPT-3 is a good large-scale model and we called its API for text segmentation. However there are limitations on both the number of interactions and the length of the text, we had to sleep and wait while we run the code to avoid it crashing. In this realistic situation, it is difficult to count its real running time.
We introduced our method worked faster, mainly based on several aspects below:
(1) In terms of the process of preperation for text segmentation, we do not need to build a corpus or train a model. The statistics-based approach is relatively fast.
(2) The setting of threshold based on information entropy is mostly manual, which requires a lot of experimental generalisation, while in contrast, our direct local numerical comparison is immediate and fast.
(3) Our procedure does not require data processing, it simply inputs text of any length and outputs the word segmentation results. We do record the runtime of word segmentation, as shown in the table below, without using any specific distributed algorithm to improve efficiency, such results are acceptable and fast for workers and researchers in unsupervised word segmentation.
Data |
Length of characters |
Size of text |
Runtime |
Rencun |
241072 |
512 KB |
2.2 min |
Nanpanjiang |
345911 |
937 KB |
7.7 min |
Xiaoshan |
574679 |
1521 KB |
24 min |
Point 13: In the citations there are 4 Master theses cited. It would be beneficial to cite real papers connected to these studies if possible. Otherwise, a link should be added to these Master's works since these works are usually not freely accessible. At least I did not manage to find.
Response 13: We changed Master theses to journals published while the author was a student, or privided a link ,which is freely accessible.
Author Response File: Author Response.docx
Reviewer 2 Report
The paper presents a high-degree-of-freedom-priority candidate term boundary conflict reduction method (HFCR) for solving the problem of manually setting thresholds on segmentation based on information entropy. According to the Authors, the conducted experiments demonstrate that the HFCR algorithm achieves an average F1-value of more than 95% for single geological text. The topic is interesting and the paper well corresponds with the journal's aim and scope.
The paper is quite well structured. The Authors presented the aim of their work in the Introduction section. The related work was also introduced.
However, there are shortcomings in this paper.
The text presented in Figure 1 should be fitted to the shape (see: boundary conflict exists).
The limitations of the presented approach should be added in the Conclusions section.
Overall, the paper looks good, but it requires improvements in some parts.
Author Response
Response to Reviewer 2 Comments
We really appreciate you for your carefulness and conscientiousness. Your suggestions are really valuable and helpful for revising and improving our paper. According to your suggestions, we have made the following revisions on this manuscript:
Point 1: The text presented in Figure 1 should be fitted to the shape (see: boundary conflict exists).
Response 1: Thank you for your advice. We redrew Figure 1 to make sure the text and shape fit.
Point 2: The limitations of the presented approach should be added in the Conclusions section.
Response 2: Thank you for your advice. We added the limitations of our method in the Conclusions section: “For the limitations such as identical end-word concatenation and inaccurate seg-mentation of some sentences, the fundamental reason is that there are not sufficiently rich connections between candidate terms in one single text. This is an objective prob-lem that is difficult to solve. ”.
Author Response File: Author Response.docx
Reviewer 3 Report
The article is well written and organized. I have one major comment: Authors write about geological text segmentation, but it is not well explained in the text. Actually the article would be in the same way accurate without mentioning that it concerns geological texts. Is it possible to improve the article to emphasise more this "geological" aspect?
Author Response
Response to Reviewer 3 Comments
We really appreciate you for your carefulness and conscientiousness. Your suggestions are really valuable and helpful for revising and improving our paper. According to your suggestions, we have made the following revisions on this manuscript:
Point 1: I have one major comment: Authors write about geological text segmentation, but it is not well explained in the text. Actually the article would be in the same way accurate without mentioning that it concerns geological texts. Is it possible to improve the article to emphasise more this "geological" aspect?
Response 1: Thank you for your thoughtful advice. we would like to explain and illustrate this problem from the following aspects:
- As you have judged, the ideas and implementation process of this paper have no inevitable relationship with geological texts. Professional terms and laws may exist widely in specific texts. We want to establish a unsupervised word segmentation method without any corpora and labels for specific field segmentation. As we work mainly in the area of geological information, and are more familiar with the text, and can better carry out research and testing, which is conducive to our judgment and verification of the effect of word segmentation algorithm;
- When we create an effective method for geological texts, we can extend it to other fields such as electricity, medicine and law, and promote the convergence and integration of knowledge in various fields, we think it is a very interesting and meaningful work;
- The current development of AI are guided mostly by data. Our paper and segmentation algorithm are oriented towards geological texts. In order to better explain the ideas of this method and summarize the rules, more explanatory and illustrative descriptions related to geological texts should indeed be added. We added relevant contents to the technical path in Chapter 3.
Thanks again for your advice!
Author Response File: Author Response.docx
Reviewer 4 Report
Overall
Both the title and the abstract summarize the contribution of the paper clearly and concisely. The introduction leads the reader through the research gap and establishes the rationale for undertaking to fill the research niche. The review of related works is critical and concludes by showing the gap that the authors address. Figures are used well throughout the paper to support the text. The authors have described a logically-organised piece of research, resulting in SOTA results. I have no objections to publication, but suggest that the authors tidy up the language. I provided a few suggestions for improvements in the language.
Language issues
Some language issues are listed below in order to help the authors improve the quality of the language used. Any comments in this section, do not impact the review decision.
1. Line 10
“to label corpus, models and algorithms” - - > to label corpora, models and algorithms
2. Line 31
“It’s very necessary” - - > It is absolutely necessary OR It is essential
3. Line 42
“large number of human work” - - > large amount of human work
Author Response
Response to Reviewer 4 Comments
We really appreciate you for your carefulness and conscientiousness. Your suggestions are really valuable and helpful for revising and improving our paper. According to your suggestions, we have made the following revisions on this manuscript:
Point 1: Line 10 “to label corpus, models and algorithms” - - > to label corpora, models and algorithms.
Response 1: Thank you very much for your advice. We have changed the “corpus” into the correct plural form “corpora”.
Point 2: Line 31 “It’s very necessary” - - > It is absolutely necessary OR It is essential
Response 2: Thank you very much for your advice. We have changed the “very necessary” into “essential”.
Point 3: Line 42 “large number of human work” - - > large amount of human work
Response 3: Thank you very much for your advice. We have changed the “number ” into “amount”.
Thanks again for your advice!
Author Response File: Author Response.docx