Next Article in Journal
A Novel Dual-Function Drone Landing Gear with Ultra-Fast Grasping Capability Enabled by a Quick-Release Mechanism
Previous Article in Journal
Damage Limit Velocity and Fracture Patterns in Single Glass Plates Impacted by Steel Balls of Varying Diameters
 
 
Article
Peer-Review Record

A Green AI Methodology Based on Persistent Homology for Compressing BERT

Appl. Sci. 2025, 15(1), 390; https://doi.org/10.3390/app15010390
by Luis Balderas 1,2,3,4,*, Miguel Lastra 2,3,4,5 and José M. Benítez 1,2,3,4
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2025, 15(1), 390; https://doi.org/10.3390/app15010390
Submission received: 22 November 2024 / Revised: 17 December 2024 / Accepted: 31 December 2024 / Published: 3 January 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

A reviewer's opinions submitted to the paper A Green AI methodology based on persistent homology for compressing BERT

Authors present in this paper Green AI methodology which is based on homology theory. Proposed persistent BERT compression and explainability (PCBE for short) methodology.

In Introduction section, a comprehensive motivation and review of the current state is provided. Authors explain their motivation clearly and formulate the summarized contributions of the paper submitted. Subsequently, a brief introduction to the BERT model is provided.

Section 3 is core of the paper. It brings the description and analysis of the proposed model. It would be interesting when authors could be more conrete about the Algorithm 1 - how it was implemented/tested, how it was verified that it is consistent, and to provide the comparing case study of the results based on the application of an algorithm.

Results presented in Section 4 seem to be consistent and relevant. The conclusion section summarizes achieved results. As a interesting point, authors could fill the information about the future research orientation and application of the reached results.

Minor issues:
- section 2 should be named Previous Works or Related Works

Major issues:
- listed in the text above.

I recommend to accept the paper after correction of the text when applying the provided suggestions.

Author Response

Comment 1: In Introduction section, a comprehensive motivation and review of the current state is provided. Authors explain their motivation clearly and formulate the summarized contributions of the paper submitted. Subsequently, a brief introduction to the BERT model is provided. Section 3 is core of the paper. It brings the description and analysis of the proposed model. It would be interesting when authors could be more conrete about the Algorithm 1 - how it was implemented/tested, how it was verified that it is consistent, and to provide the comparing case study of the results based on the application of an algorithm.

Response 1:  Thank you for your interesting comments. Algorithm 1 presents the end-to-end pipeline of our proposed methodology. This includes selecting a representative corpus, performing the simplification process, and evaluating the simplified model on the GLUE benchmark. The implementation is entirely in Python, heavily relying on the Hugging Face Transformers library. To accommodate our specific requirements, we made necessary modifications to the library's underlying neural network structure. The efficacy of our approach is validated in Section 4, where we conduct extensive experiments comparing our method to 14 state-of-the-art BERT simplification techniques. The GLUE Benchmark, with its diverse range of tasks, provides a robust evaluation environment for our simplified models.

Comment 2: Results presented in Section 4 seem to be consistent and relevant. The conclusion section summarizes achieved results. As a interesting point, authors could fill the information about the future research orientation and application of the reached results.


Response 2: The conclusions section has been expanded to include future research directions. This includes extending our experiments to a broader range of BERT family models (RoBERTa, DistilBERT, DeBERTa, etc.) and exploring the potential of applying these models to fine-tuned versions for specific tasks like sentiment analysis and machine translation.

Comment 3: section 2 should be named Previous Works or Related Works

Response 3: Section 2 has been renamed to “Previous Works”.

Reviewer 2 Report

Comments and Suggestions for Authors

I read the article and have a few concerns: 

1. What is meant by RoBERTa [4], DistilBERT [5], ALBERT [6] or TinyBERT [7] ? A non-specialist reader needs further explanation. 

2. The article should be rewritten without "we" or "our", in the third person, in scientific language. 

3. Lines 31-38 should be placed in the methodology section, not in the introduction.

4. I suggest that authors provide the full name of each acronym when first used. For example: DDK, YOCO-BERT, etc.

5. What are the research questions to be answered by the article?

6. I suggest the authors to introduce a discussion section where they explain the results found and especially, what are the implications of the results in other, more or less related, fields. What does this research contribute to? 

7. The conclusions should not present the results found. Here it should be written whether the article answers or not the hypotheses, the research questions established at the beginning, the limitations of the study and future researches.

8. Lines 427-429 should be included in the text. 

 

Author Response

Comment 1: What is meant by RoBERTa [4], DistilBERT [5], ALBERT [6] or TinyBERT [7] ? A non-specialist reader needs further explanation. 

Response 1: Thank you for your inspiring comments. RoBERTa [4], DistilBERT [5], ALBERT [6] or TinyBERT [7] are variations of the BERT architecture, designed to offer improvements in terms of training efficiency, model size, or performance. These models have been proposed as alternatives to the original BERT, addressing specific shortcomings. The Introduction has been expanded to include additional details and clarifications.

Comment 2: The article should be rewritten without "we" or "our", in the third person, in scientific language. 

Response 2: The document has been revised to use third person and scientific language, minimizing first-person expressions.

Comment 3: Lines 31-38 should be placed in the methodology section, not in the introduction.

Response 3: Lines 31-38 provide a brief overview of our methodology. A comprehensive description of the methodology is presented in Section 3. Nonetheless, we consider that this initial summary helps readers grasp the overall approach before delving into the technical details.

Comment 4: I suggest that authors provide the full name of each acronym when first used. For example: DDK, YOCO-BERT, etc.

Response 4: DDK and YOCO-BERT are two state-of-the-art techniques for BERT compression. It is worth noting that, in some cases, such as DDK, the original authors do not provide the full name of the method. To improve clarity, we have revised the document to include the full names of all acronyms upon their first occurrence.

Comment 5: What are the research questions to be answered by the article?

Response 5: The primary goal of this article is to investigate whether persistent homology can be employed to quantify the importance of neurons within BERT-like encoder models (RQ1). Furthermore, we aim to determine if persistent homology can serve as a foundation for a method to simplify BERT (RQ2). Additionally, we would like to assess the potential of the method to enhance the explainability of Transformer-based models (RQ3). For the sake of clarity, we have added the main three research questions in the Introduction section.

Comment 6: I suggest the authors to introduce a discussion section where they explain the results found and especially, what are the implications of the results in other, more or less related, fields. What does this research contribute to? 

Response 6: We have added Section 5, dedicated to the discussion of the results. Specifically, we study the distribution of rf values for the selection of the most informative neurons, the results obtained from the simplification of BERT Base and Large along with their comparison with other state-of-the-art methods, and finally, the potential utility of our method as a tool to provide interpretability to language models such as BERT. 

As noted in the previous question, all this research and experimentation contribute to understanding the role of homology theory in selecting crucial neurons in predictions generated by a language model, as well as a new method for simplifying encoders.

Comment 7: The conclusions should not present the results found. Here it should be written whether the article answers or not the hypotheses, the research questions established at the beginning, the limitations of the study and future researches.

Response 7: In order to improve the overall clarity and conciseness of our conclusions, we have minimized references to specific results like model size, provided a clear justification for our research questions, and outlined potential future research directions. But of course, when discussing results some references to the experimental outcomes are necessary.

Comment 8: Lines 427-429 should be included in the text.

Response 8: In accordance with the journal's guidelines, we have included a list of abbreviations with their full names in the abbreviations section. However, following the suggestion in Comment 4, we have also included these full names within the text.

Reviewer 3 Report

Comments and Suggestions for Authors

In this paper, the authors leveraged the persistent BERT compression and Explainability (PBCE) methodology to improve the efficiency of BERT. The topic is interesting and the paper could be published as long as the following issues are solved:

j

  1. could the authors utilize 3-4 sentences to explain what is the green AI methodology and the basic principals of the green AI methodology? 
  2. Page 2, Line 74-75, “Persistent homology combined with machine learning also finds applications in the fields of biology and chemistry, particularly in protein analysis”. Citations regarding to the computatinal biology field are required to support this arguement.
  3. Apart from persistent homology theory, some popular dimension resduction methods like PCA, tiCA, UMAP and variational auto encoder (VAE) have been widely used to extract features and analysis the data. What is the advantage of the persistent homology compared with these dimension reduction methods.
  4. The authors may explain why they focus on the 0-dimensional persistent homology? Also the Figure 4, should be 1-dimensional instead of 2, because the birth time always zero and the data point wont be seperated over x axis.
  5. Could the author also visualize which part of the BERT model should be pruned most and which part would be keep as intact. Like the Q, K, V part in the attention mechanism and the pruned neurons distributions over the BERT layers.
  6. Also, could the authors provode some metrices to quantify the compressed model efficiency(e,g, training speed)?

Author Response

Comment 1: Could the authors utilize 3-4 sentences to explain what is the green AI methodology and the basic principals of the green AI methodology? 

Response 1: Thank you for your insightful comments. The Green AI paradigm refers to AI research that yields novel results without increasing computational cost, and ideally, reducing it. Green AI encourages using measures of efficiency (carbon emission, electricity usage or number of parameters, among others) as evaluation metrics for computational research. The literature describes strategies to reduce the computational cost from the software and hardware perspective. In terms of software and algorithm optimization, pruning algorithms such as PBCE are very relevant. More details can be found in the following references:

  • Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2019). Green AI. arXiv [Cs.CY]. Retrieved from http://arxiv.org/abs/1907.10597
  • Bolón-Canedo, V., Morán-Fernández, L., Cancela, B., & Alonso-Betanzos, A. (2024). A review of green artificial intelligence: Towards a more sustainable future. Neurocomputing, 599, 128096. doi:10.1016/j.neucom.2024.128096

For the sake of clarity, we have added more details of the Green AI paradigm to the Introduction section (lines 22-24).

Comment 2: Page 2, Line 74-75, “Persistent homology combined with machine learning also finds applications in the fields of biology and chemistry, particularly in protein analysis”. Citations regarding to the computatinal biology field are required to support this arguement.

Response 2: In this paragraph, after the sentence “Persistent homology combined with machine learning also finds applications in the fields of biology and chemistry, particularly in protein analysis”, some citations regarding the computational biology field were already included, such as:

  • Pun, C.S.; Lee, S.X.; Xia, K. Persistent-homology-based machine learning: a survey and a comparative study. Artificial Intelligence Review 2022, 55, 5169–5213. https://doi.org/10.1007/s10462-022-10146-z.
  • Routray, M.; Vipsita, S.; Sundaray, A.; Kulkarni, S. DeepRHD: An efficient hybrid feature extraction technique for protein remote homology detection using deep learning strategies. Computational Biology and Chemistry 2022, 100, 107749. https://doi.org/https://doi.org/10.1016/j.compbiolchem.2022.107749.
  • Nauman, M.; Ur Rehman, H.; Politano, G.; Benso, A. Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins. Journal of Grid Computing 2019, 17, 225–237. https://doi.org/10.1007/s10723-018-9450-6.
  • Wu, K.; Zhao, Z.; Wang, R.; Wei, G.W. TopP–S: Persistent homology-based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility. Journal of Computational Chemistry 2018, 39, 1444–1454, [https://onlinelibrary.wiley.com/doi/pdf/10.1002/jcc.25213].https://doi.org/https://doi.org/10.1002/jcc.25213.

Comment 3: Apart from persistent homology theory, some popular dimension resduction methods like PCA, tiCA, UMAP and variational auto encoder (VAE) have been widely used to extract features and analysis the data. What is the advantage of the persistent homology compared with these dimension reduction methods.

Response 4: PCA, UMAP, and VAE are indeed dimensionality reduction methods that facilitate data analysis and decision-making. To begin with, their actual goal is different from neural network pruning that is our main concern in this paper. Actually, the dimensionality reduction process inherently involves a loss of information, which can be crucial when assessing the importance of individual neurons within a neural network. In contrast, persistent homology does not perform any reduction, thus avoiding this potential loss of information. Moreover, the aforementioned methods rely on metric-based features, whereas persistent homology operates on topological concepts or features. Consequently, it offers a perspective distinct from traditional dimensionality reduction techniques. Lastly, persistent homology has been less explored in the context of machine learning and, to our knowledge, has never been employed as a tool for neural network pruning. For these reasons, we believe the most faithful and rigorous procedure is to compare other more established approaches.

Comment 4: The authors may explain why they focus on the 0-dimensional persistent homology? Also the Figure 4, should be 1-dimensional instead of 2, because the birth time always zero and the data point wont be seperated over x axis.

Response 4: As mentioned in the previous comment, we are not aware of any prior methods that employ persistent homology as a building block for neural network pruning techniques, particularly in language models. Therefore, for simplicity, we have considered that zero-dimensional persistent homology could be a suitable candidate, and the obtained results do confirm the hypothesis. With respect to Figure 4, the plots on the right represent the Birth-Death Diagram or the Persistent Diagram. This diagram is a well-established representation in homology theory and is always depicted in 2D, regardless of the dimension of the persistent homology being studied. You can find more details about Persistent Diagrams in Edelsbrunner, H., & Harer, J. L. (2022). Computational topology: An introduction. American Mathematical Society.

Comment 5: Could the author also visualize which part of the BERT model should be pruned most and which part would be keep as intact. Like the Q, K, V part in the attention mechanism and the pruned neurons distributions over the BERT layers.

Response 5: Figure 6 illustrates the extent to which each component in BERT's layers contributes more or less information. For example, in BERT Base (Figure 6a), layers 3 and 4 provide a significant amount of information across all components, whereas in layer 5, component V contributes little information. Considering this, three pruning levels (Q1, Q2, and Q3) are defined, and a decision is made regarding which pruning level to apply to each component and layer. This information is summarized in Tables 2 (BERT Base) and 3 (BERT Large). By examining these tables, it can be observed that, in general, the Intermediate component contains the neurons that contribute the most ---thus the most useful ones. Consequently, the most aggressive pruning level (Q3) is never applied to it. In contrast, the V component provides the least relevant information, making it the most heavily pruned component for both BERT Base and BERT Large.

Comment 6: Also, could the authors provode some metrices to quantify the compressed model efficiency(e,g, training speed)?

Response 6: To align with Green AI principles, we introduce efficiency metrics (post-simplification parameter count) and prioritize them alongside predictive performance. Tables 4 and 5 compare our results with existing methods. We chose this metric because of its widespread use and independence from experimental conditions, making it easier to compare across different studies. Regarding training speed, it is not relating to model compression. We start the simplification procedure with a pre-trained model and do not know the details of that learning process. It is possible that in subsequent training, the training time of the simplified model will be reduced, since it has a smaller number of parameters to train. Anyway, that metric is outside of our scope since it will depend on the type of training that is done and other factors unrelated to efficiency as the hardware where it is executed.

Reviewer 4 Report

Comments and Suggestions for Authors

 An interesting article has been submitted for review. The article presents new research findings related to BERT Large Language Models (LLMs). It introduces an improvement to existing modules by further compressing BERT LLMs, reducing 47% of the original parameters for BERT Base and 42% for BERT Large. The results obtained demonstrate advantages when compared to state-of-the-art techniques. The developed model is suitable for use in comparison with state-of-the-art techniques.

 It should also be noted that a comprehensive literature review has been conducted, including a detailed list of the most recent articles from the last 3–5 years. The reader is thoroughly introduced to the issues being addressed. The article is written in a scientific style, with results and experiments discussed in detail.

 The paper does, however, have some minor shortcomings:

·         Line 192: The abbreviation "LHU" is not explained.

·         Figure 4: Units are not indicated on the axes.

·         Tables 3 and 4: The font does not match the rest of the article.

·         Formulas: A dot or comma should be added at the end of formulas, depending on whether the sentence concludes or the explanation of variables continues.

·         Conclusions: A discussion on potential directions for further research would be appreciated.

·         Line 461: Why is the variable "R" the only one in bold? Is this intentional?

·         Figure 6: The illustrations should be labeled as (a) and (b), with the title specifying which is which.

·         Figure A1: Is this figure truly necessary? Could its content be summarized in the text instead? Currently, it takes up an entire page, appears in the middle of the page and does not add substantial additional information.

·         Figure 5. Same axis' titles and units are missing.

 

The comments are not critical but should be addressed. After minor revisions, the article can be accepted for publication.

Author Response

Comment 1: Line 192: The abbreviation "LHU" is not explained.

Response 1:  Thank you very much for your comments. Line 192 (Line 204 in the new version of the article) contains the “NHU” abbreviation, which is defined in the next line as “Number of Hidden Units”. For the sake of clarity, we have added this acronym to the Abbreviations section at the end of the article.

Comment 2: Figure 4: Units are not indicated on the axes.

Response 2: Figure 4 represents the persistence diagram or Birth-Death diagram, fully established as an analysis tool for persistent homology. This diagram is dimensionless, as can be seen, for example, in the following book, which is a reference in this field:

Edelsbrunner, H., & Harer, J. L. (2022). Computational topology: An introduction. American Mathematical Society.

Comment 3: Tables 3 and 4: The font does not match the rest of the article.

Response 3: We have checked the font of the tables to ensure it matches the one used in the rest of the article. Nevertheless, we have strictly followed the style guidelines and LaTeX template of the journal.

Comment 4: Formulas: A dot or comma should be added at the end of formulas, depending on whether the sentence concludes or the explanation of variables continues.

Response 4: All formulas have been carefully reviewed and adjusted to ensure they are correctly punctuated with periods or commas as required.

Comment 5: Conclusions: A discussion on potential directions for further research would be appreciated.

Response 5: The conclusions section has been expanded to include future research directions. This includes extending our experiments to a broader range of BERT family models and exploring the potential of applying these models to fine-tuned versions for specific tasks like sentiment analysis and machine translation.

Comment 6: Line 461: Why is the variable "R" the only one in bold? Is this intentional?

Response 6: Thank you for the appreciation. Variable “R” should not be in bold. We have fixed this.

Comment 7: Figure 6: The illustrations should be labeled as (a) and (b), with the title specifying which is which.

Response 7: Figure 6 has been updated with labels a and b for the two images, and these images are now referenced in the text.

Comment 8: Figure A1: Is this figure truly necessary? Could its content be summarized in the text instead? Currently, it takes up an entire page, appears in the middle of the page and does not add substantial additional information.

Response 8: While the image provided additional clarity, we have opted to exclude it in order to reduce the overall length of Appendix A and avoid unnecessary repetition.

Comment 9: Figure 5. Same axis' titles and units are missing.

Response 9:  As stated in Comment 2, the Birth-Death Diagram does not require units. With respect to the images on the left, they depict, as an example, two dimensional vectors being part of the zero-persistent homology process. These vectors are latent representations of the texts, as generated by the neural network in a hidden layer. Therefore, there are no possible axis’ titles nor units to add.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The paper can be accepted in the present form.

Reviewer 3 Report

Comments and Suggestions for Authors

The newest version is publishable.

Back to TopTop