Next Article in Journal
Entity Factor: A Balanced Method for Table Filling in Joint Entity and Relation Extraction
Next Article in Special Issue
Emotion Recognition Based on the Structure of Narratives
Previous Article in Journal
A Carrier Synchronization Lock Detector Based on Weighted Detection Statistics for APSK Signals
 
 
Article
Peer-Review Record

Combining Transformer Embeddings with Linguistic Features for Complex Word Identification

Electronics 2023, 12(1), 120; https://doi.org/10.3390/electronics12010120
by Jenny A. Ortiz-Zambrano 1,*,†, César Espin-Riofrio 1,† and Arturo Montejo-Ráez 2,*,†
Reviewer 1:
Reviewer 2:
Electronics 2023, 12(1), 120; https://doi.org/10.3390/electronics12010120
Submission received: 7 November 2022 / Revised: 9 December 2022 / Accepted: 22 December 2022 / Published: 27 December 2022

Round 1

Reviewer 1 Report

Overview.

The paper proposes to 

1. identifying complex words

2. utilize NLP and transformer-based models to improve the prediction of lexical complexity

 

Overall, the premise seems promising but there seem to be loopholes in the explanation of the method and presentation of the results. The authors need to do a thorough revision to provide more context. The following comments may be useful.

 

Pros.

1. The premise is interesting.

2. The related works section is well-written. 

 

Cons. 

1. The definition of "complex words" is not exactly obvious. The authors should elaborate on this point - use examples where necessary.

 

2. The method section seems inadequate. I would like the authors to provide more specifics about their approach. 

 

3. It seems that the task at hand is to take a dataset with words labeled with complexity level and predict that. Is that right? These things need to be clarified at the method level as well as the results level.

 

4. There needs to be more information on the transformer models and their applicability in the context of this work.

Author Response

Dear reviewer. We really appreciate the recommendations, suggestions and comments after reviewing our manuscript. We have attended the identified issues as follows:

1. The definition of "complex words" is not exactly obvious. The authors should elaborate on this point - use examples where necessary.

Thank you very much for pointing it out, as that improves the introduction of the problem. We have added the following text in the introduction:  A complex word is considered that one that is difficult to understand by a reader with an average level of literacy. In a more general view [3], lexical complexity prediction (LCP) tries to assign a score of complexity to values, turning the task into a regression problem instead of a binary classification task

Also, some examples from the CompLex corpus have been added to help to understand better the problem to solve.

2. The method section seems inadequate. I would like the authors to provide more specifics about their approach.

Thank you very much for this appreciation. We agree in that observation. A new diagram has been included to clarify how different elements in the system are combined. This diagram is now detailed in the text, and mentions of features and classifiers are added to inform the reader about the sections where they are described.

3. It seems that the task at hand is to take a dataset with words labeled with complexity level and predict that. Is that right? These things need to be clarified at the method level as well as the results level.

We believe that the inclusion of entry samples from the CompLex dataset as suggested and attended in previous point will help readers to prevent these doubts.

4. There needs to be more information on the transformer models and their applicability in the context of this work.

We expect that the diagram and the explanation added to the manuscript helps to understand how transformers models are applied as encoders of a text. 

Reviewer 2 Report

The task of Lexical Complexity Analysis has evolved recently to approaches that combine classic features (e.g., linguistic, syntactic) and approaches based on transformers which use word/sentence embeddings. In this study, the authors, explore lexical complexity prediction (LCP) by combining pre-trained and adjusted transformer networks with different types of traditional linguistic features. They apply these features over classical machine learning classifiers. Their best results were obtained by applying Support Vector Machines on an English corpus in an LCP task solved as a regression problem. According to the authors, the results show that linguistic features can be useful in LCP tasks and may improve the performance of deep learning systems.

 The article is well-organized, written, and relatively easy to read.

 The methodology seems adequate, since, as the authors rightly say, the transformer architectures are here to stay, not only for this task but for many tasks associated with NLP. However, the article has several weaknesses from the methodological point of view and from the presentation/analysis of results. I will now describe some aspects to review in the article.

 In the Related Work section, I would like the authors to have summarized in a clearer way (possibly in a table) the comparison of results between their work and the referenced works. It would also be important for them to identify the proportion of variance explained by the different models, through the measure R2.

 The description of the dataset to be used is very simplified. They should add some relevant information about the annotation process in [11] and about the 5 Likert scale classes used. This last issue only became clear to me when I went to read the article [11].

 One of the aspects that, in my opinion, is not well justified by the authors and that may jeopardize the results achieved, is related to the criterion for choosing classic features. Why only 23 features? In this type of task, in addition to the bag-of-words, hundreds of features are known. Eventually, through feature selection processes, this set could be reduced. As an example, only frameworks like Linguistic and Word Count (LIWC) extract about 100 features. It was important for the authors to have presented a baseline (which represented the state of the art) regarding the features. My question is have the authors experimented with more features and these are the selected ones? And in this case, the article has important flaws in terms of the information presented. Or simply decided to choose this set of features. The article should have a detailed analysis of this process.

 In line 180, the authors indicate the machine learning algorithms used but then fail to indicate the best results with each one. A comparative table would have been important. And more than indicating that they used X algorithms, it would be essential for them to present the details of the experiments carried out whose information in the text is very vague or non-existent. Has the created models been fine-tuned? what are the chosen parameters? Was a feature selection made?

The authors say they did a lot of experimentation (189-190) and used a variety of training strategies (192-193). Which was? This information, even for the purposes of possible reproduction of their work, must be included in the document.

 Another aspect is related to the comparison of results made to the various models created. Is the difference between the best and second best statistically significant (p)? And the same for all models. This information should be included in the article.

 In conclusion, due to some fundamental issues listed above not being covered by the authors, I propose a major revision in which in the final version the previous issues should be covered.

Author Response

Dear reviewer. We really appreciate the recommendations, suggestions and comments after reviewing our manuscript. We have attended the identified issues as follows:

In the Related Work section, I would like the authors to have summarized in a clearer way (possibly in a table) the comparison of results between their work and the referenced works.

We understand and share the worries of the reviewer, as a clear comparison in terms of performance is desirable to better understand the state of the art from an empirical point of view. Unfortunately, this is problematic for the following reasons: not all experiments are comparable, as they work on different datasets (even with different versions of the data or variations in the definition of the LCP task itself). That’s why we added a comparison table of our results with those of the SemEval competition. To help in the comparison, we have added our results at the end of that table.

It would also be important for them to identify the proportion of variance explained by the different models, through the measure R2.

Thank you very much for pointing that issue. We have added a new paragraph in conclusions discussing the R2 value obtained. 

The description of the dataset to be used is very simplified. They should add some relevant information about the annotation process in [11] and about the 5 Likert scale classes used. This last issue only became clear to me when I went to read the article [11].

The reviewer is right in requesting such clarification without the need to read the source article. We have indicated that the annotation process was done manually by native English speakers and that each entry in the dataset received 20 annotations. We believe that the Likert scale is clear enough to understand how the final complexity score was calculated.

One of the aspects that, in my opinion, is not well justified by the authors and that may jeopardize the results achieved, is related to the criterion for choosing classic features. Why only 23 features? In this type of task, in addition to the bag-of-words, hundreds of features are known. Eventually, through feature selection processes, this set could be reduced. As an example, only frameworks like Linguistic and Word Count (LIWC) extract about 100 features. It was important for the authors to have presented a baseline (which represented the state of the art) regarding the features. My question is have the authors experimented with more features and these are the selected ones? And in this case, the article has important flaws in terms of the information presented. Or simply decided to choose this set of features. The article should have a detailed analysis of this process.

That is a good point. We have not explored other features, so we have added such a possibility as future work.

In line 180, the authors indicate the machine learning algorithms used but then fail to indicate the best results with each one. A comparative table would have been important. And more than indicating that they used X algorithms, it would be essential for them to present the details of the experiments carried out whose information in the text is very vague or non-existent. Has the created models been fine-tuned? what are the chosen parameters? Was a feature selection made?

We agree that we may have not been clear in the paper regarding the training process. BERT and RoBERTa models have been fine-tuned with final dense layers as classifier network using the architecture used by default in sequence classification version of these models as it is the Huggingface implementation of these architectures. No hyperparameter exploration has been done so, again, default parametrization from the implementation taken was applied. Regarding classic machine learning methods, they have been trained with default parameters as set by Scikit-learn implementations. We have added these details to the paper.

The authors say they did a lot of experimentation (189-190) and used a variety of training strategies (192-193). Which was? This information, even for the purposes of possible reproduction of their work, must be included in the document.

This refers to the different configuration of our experiments: with pre-trained or fine-tuned transformer models, with or without linguistic features and so no. We have clarified this point in the text.

Another aspect is related to the comparison of results made to the various models created. Is the difference between the best and second best statistically significant (p)? And the same for all models. This information should be included in the article.

Unfortunately, we have not performed that analysis, so we will consider this interesting comment for future work. Nevertheless, we believe that such a missing analysis does not invalidate our findings.

In conclusion, due to some fundamental issues listed above not being covered by the authors, I propose a major revision in which in the final version the previous issues should be covered.

Sincerely, thank you very much for your exhaustive review. It helped us in polishing our research approaches and methods. We hope that the new version is considered for its publication.



Round 2

Reviewer 1 Report

My queries have been addressed.

Author Response

Dear reviewer,

We really appreciate your comments and that the paper changes fulfil your recommendations.

With our best regards.

Reviewer 2 Report

Thanks to the authors for responding so promptly to the revisions made.

 

In my opinion, I believe that they answered satisfactorily the question of the R2 evaluation metric of the models and the question related to the dataset annotation process.

 

The issue that worries me the most is related to the creation and evaluation of the various models created.

The authors indicate that in fact the low value of R2 indicates that improvements can be made. I suppose that issues not covered in the experiments, such as the few features used, or the lack of fine-tuning of the created models, could have an influence on these results.

 

I think that the improvement in this model creation process can positively influence the research carried out by the authors.

 

Because I think that additional experiments have to be implemented, I propose not accepting the article in its current form.

Author Response

Dear reviewer,

We understand that the paper unveils the need for further experimentation, but we consider that, after the suggestions and changes introduces, it can be considered a contribution of the understanding on the application of deep encoding from pre-trained models with different classification algorithms. We have started intensive experimentation with fine-tuning approaches.

Of course, more experiments can be made, that the essence of research, where results aim to explore other approaches. But, as said, we would appreciate if the reviewer considered this paper for its publication, as the experimentation made and the proposed system to lexical simplification can help other researchers in their work to explore such a design. Pre-trained models and classical algorithms do not demand high computing resources and could be a solution in low-resourced environments, like mobile devices and alike.

We have added this reflection in the paper following your concerns.

With our best regards.

Round 3

Reviewer 2 Report

In fact, the experiments carried out could have been more extensive. Anyway, the article has positive aspects from the methodological point of view that can be of help to other researchers and for that reason and for the reflection added to the document, I will suggest the acceptance of the article.

Back to TopTop