RoSummary: Control Tokens for Romanian News Summarization
Round 1
Reviewer 1 Report
In this paper, three results are presented: create a dataset to summarize in Romanian, train a model that generates coherent texts and introduce control tokens to easily manipulate the output.
It is interesting paper in a language with little presence in other studies, but the experiments are rather simple, and do not add much to existing work in other languages.
The authors should indicate them because the Romanian language is low-income. Linguistic, grammar or economic resources?
The authors should explain why the results are so similar Table 2, 3. These results may indicate that a winning strategy was not obtained. The evaluation of the human experts seems quite consistent.
It would be interesting if the linguistic corpus were extended to more data sources. My opinion is that the authors are close to obtaining relevant results, but the proposed model is not significantly good.
Author Response
Thank you kindly for your review!
In this paper, three results are presented: create a dataset to summarize in Romanian, train a model that generates coherent texts and introduce control tokens to easily manipulate the output.
It is interesting paper in a language with little presence in other studies, but the experiments are rather simple, and do not add much to existing work in other languages.
Response: We have introduced an additional exploratory analysis, with detailed examples that highlight the benefits of our approach. The appendices denote extensive experimentation. All new additions are marked in blue.
The authors should indicate them because the Romanian language is low-income. Linguistic, grammar or economic resources?
Response: We have clarified in the introduction that we were referring to a low-resource language in terms of available datasets and models for NLP. This study provides valuable resources and models for follow-up experimentation.
The authors should explain why the results are so similar Table 2, 3. These results may indicate that a winning strategy was not obtained. The evaluation of the human experts seems quite consistent.
Response: You are right, a winning strategy cannot be accurately determined. It should be noted that the best results for each control token were obtained when the beam search method was used as decoding, coupled with the large or medium versions of the language model.
I would be interesting if the linguistic corpus were extended to more data sources. My opinion is that the authors are close to obtaining relevant results, but the proposed model is not significantly good.
Response: We emphasized this in terms of future work. We have introduced two new experiments on a sample of 100 news items on which we changed the values ​​for <NoSenteces> and the combination <NoSentences> - <NoWords>.
Reviewer 2 Report
This paper describes a bgreat contribution to the Romanian language resources. The study has been performed thoroughly with solid evaluations, both automatic and through human judges. All the developed resources are nicely available on hugging face and github.
I enjoyed reading the paper and recommend that it be published as is. I was just curious why this has not been submitted to LREC? as this seems the first forum for a language resource I would think of. But anyway, this journal is also a fine place.
Thanks for doing all this improtant work! I hope it will be used!
Author Response
This paper describes a great contribution to the Romanian language resources. The study has been performed thoroughly with solid evaluations, both automatic and through human judges. All the developed resources are nicely available on hugging face and github.
Response: Thank you kindly for your review and comments.
I enjoyed reading the paper and recommend that it be published as is. I was just curious why this has not been submitted to LREC? as this seems the first forum for a language resource I would think of. But anyway, this journal is also a fine place.
Response: Thank you very much for your appreciation. Unfortunately, the experiments were not ready in time for the LREC 2022 deadline, while the 2023 edition is too far away.
Thanks for doing all this improtant work! I hope it will be used!
Response: Thank you very much for your appreciation!
Reviewer 3 Report
The paper describes an experiment to use control tokens for abstractive summarization in Romanian. The model is based on the Romanian version of GPT-2. The paper also introduces a related, publicly available dataset. The approach is very interesting and it deserves to be published after fixing minor text issues listed below. In the abstract:
- "on top of which control tokens were considered to specify characteristics for the 5 generated text" - I do not understand. What are control tokens? A reader of the abstract can only guess that it's a special class of tokens used to control text generation. But the next sentence "counts of sentences and of words, token ratio, and n-gram overlap" indicates that you perform some measurements and analytics. Please explain what do you mean by control tokens. - you need to briefly describe BERTScore metrics, what does it capture? It's not so commonly used as well-known ROUGE or BLEU which do not require introduction in my opinion. - values you report for BERTScore / F1, ROUGE-L are four decimal places. Is this justified by statistical significance on such a small dataset? This remark concerns the whole paper, I would consider using 2 max 3 decimal places.
Lines 109-119: I do not understand why the same size limit was not set for all train/dev/test splits? You mention only that the test partition was limited to 715 tokens, but training and dev could be longer and split into 3 fragments. Please explain why, usually train / dev / test splits should be sampled from the same distribution and be very comparable. Line 120 (very long one btw!) "Specific tokens were used to indicate the task and the characteristics of the generated text" - what are specific tokens? is it the same as control tokens mentioned before? Why a different name now? Equation (1) - you do not explain what are the symbols in here, what is w, w_1..m? a token? Lines 121-122. Are you referring to token ids? Most transformers operate on subwords, not tokens. Do you really mean tokens? It is not clear what the labels are, please explain and provide examples. I do not fully understand the architecture on Figure 1: on the left, 12 x, 24 x and 36 x means what, 3 variants of the model? It's not clear. The original GPT-2 had 12 decoder layers, apparently. Line 215: Romanian is one of the languages of the EU, it is used in international circulation :) Line 220: "weights for each token in the generated phrase": where do these weights come from? are these really token or subword weights? Lines 221-224: some more explanation of how it is computed would be useful, we only know it's "between embeddings" but what does it mean, embeddings of subwords? is there any aggregation employed? How are embeddings of texts computed from embeddings of subwords - is average or max pooling employed? How is the comparison performed? Equations (5), (6) and (7) do not explain that unfortunately. I have a general remark on the use of control tokens, which is not clear from Section 3.3 and 4.: By introducing control tokens, is your goal only to add metadata usable for generating better summary (this seems to be the case with your experiments)? You add metadata described by control tokens to existing summaries and check if this additional information is helpful in improving summaries, is this correct? But you don't really check if the model learns the control token semantics. I would generate a random sample of say, NoSentences or RatioTokens values, feed it to the model, and check output quality. That would allow us to estimate in what value ranges the output is / is not reasonable and interestingly, if the model learns what is the semantics of these control tokens.
Author Response
Thank you kindly for your review and your kind comments!
The paper describes an experiment to use control tokens for abstractive summarization in Romanian. The model is based on the Romanian version of GPT-2. The paper also introduces a related, publicly available dataset. The approach is very interesting and it deserves to be published after fixing minor text issues listed below. In the abstract:
Response: Thank you again for all your suggestions. All changes are marked in blue in the manuscript
- "on top of which control tokens were considered to specify characteristics for the 5 generated text" - I do not understand. What are control tokens? A reader of the abstract can only guess that it's a special class of tokens used to control text generation. But the next sentence "counts of sentences and of words, token ratio, and n-gram overlap" indicates that you perform some measurements and analytics. Please explain what do you mean by control tokens.
Response: We modified the abstract, we specified that they are tokens received by the model in the prompt, which indicate the characteristics of the text that is generated by the model.
- you need to briefly describe BERTScore metrics, what does it capture? It's not so commonly used as well-known ROUGE or BLEU which do not require introduction in my opinion.
- values you report for BERTScore / F1, ROUGE-L are four decimal places. Is this justified by statistical significance on such a small dataset? This remark concerns the whole paper, I would consider using 2 max 3 decimal places.
Response: We added explanations about BERTScore and rewrote that paragraph for clarity. We modified the presentation of BERTScore, ROUGE, Spearman, and Pearson correlations as percentages, while the rest of the reported values have a maximum of 3 decimal places.
Lines 109-119: I do not understand why the same size limit was not set for all train/dev/test splits? You mention only that the test partition was limited to 715 tokens, but training and dev could be longer and split into 3 fragments. Please explain why, usually train / dev / test splits should be sampled from the same distribution and be very comparable.
Response: We chose not to apply this technique for the test partition because we did not want to apply the augmentation technique that changes the content of the original text.
Line 120 (very long one btw!) "Specific tokens were used to indicate the task and the characteristics of the generated text" - what are specific tokens? is it the same as control tokens mentioned before? Why a different name now?
Equation (1) - you do not explain what are the symbols in here, what is w, w_1..m? a token? Lines 121-122. Are you referring to token ids? Most transformers operate on subwords, not tokens. Do you really mean tokens? It is not clear what the labels are, please explain and provide examples.
Response: We changed to control tokens so that there is no confusion There are control tokens for characteristics and for indicating the task that the model will perform. We changed to subwords instead of token IDs.
I do not fully understand the architecture on Figure 1: on the left, 12 x, 24 x and 36 x means what, 3 variants of the model? It's not clear. The original GPT-2 had 12 decoder layers, apparently
Response: We added explanations as these are 3 variants of the RoGPT 2 model that differ in the number of decoder layers. The original GPT2 model has four variations that differ in the number of decoder layers; these can be found in the table: “Table 2. Architecture hyperparameters” from the original paper.
Line 215: Romanian is one of the languages of the EU, it is used in international circulation :)
Response: We rewrote that part for clarity - thank you for pinpointing it! However, Romanian is not one of the most popular foreign languages studied in the European Union or a language that is used frequently worldwide :) https://ec.europa.eu/eurostat/web/products-eurostat-news/-/ddn-20220923-1
Line 220: "weights for each token in the generated phrase": where do these weights come from? are these really token or subword weights?
Response: We changed to subwords and explained that token ids are numerical representations for subwords obtained after tokenization is applied to the text received as input.
Lines 221-224: some more explanation of how it is computed would be useful, we only know it's "between embeddings" but what does it mean, embeddings of subwords? is there any aggregation employed? How are embeddings of texts computed from embeddings of subwords - is average or max pooling employed? How is the comparison performed?Equations (5), (6) and (7) do not explain that unfortunately.
Response: We added details on how precision and recall are calculated.
I have a general remark on the use of control tokens, which is not clear from Section 3.3 and 4.: By introducing control tokens, is your goal only to add metadata usable for generating better summary (this seems to be the case with your experiments)? You add metadata described by control tokens to existing summaries and check if this additional information is helpful in improving summaries, is this correct? But you don't really check if the model learns the control token semantics. I would generate a random sample of say, NoSentences or RatioTokens values, feed it to the model, and check output quality. That would allow us to estimate in what value ranges the output is / is not reasonable and interestingly, if the model learns what is the semantics of these control tokens.
Response: This was a great suggestion and a very adequate addition to our study! Thank you kindly! We introduced two new experiments in which we vary the values for <NoSenteces> and the combination <NoSentences> - <NoWords> on a sample of 100 news items - corresponding results and discussions are present in the new version of the paper.
Round 2
Reviewer 1 Report
I thank the authors for the improvements to the paper. For the sake of improvement, it is suggested that the URLs or web addresses be referenced in the documents, for example http://www.google.es accessed on 11/30/2022, by [12]. The paper constitutes an interesting work for the Romanian language. The linguistic corpus may encourage further work.