Next Article in Journal
A 24-GHz RF Transmitter in 65-nm CMOS for In-Cabin Radar Applications
Next Article in Special Issue
ASAD-RD: Accuracy Scalable Approximate Divider Based on Restoring Division for Energy Efficiency
Previous Article in Journal
Waveform Design for Space–Time Coded MIMO Systems with High Secrecy Protection
Previous Article in Special Issue
Asynchronous Floating-Point Adders and Communication Protocols: A Survey
 
 
Article
Peer-Review Record

Approximate LSTM Computing for Energy-Efficient Speech Recognition

Electronics 2020, 9(12), 2004; https://doi.org/10.3390/electronics9122004
by Junseo Jo 1, Jaeha Kung 2 and Youngjoo Lee 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2020, 9(12), 2004; https://doi.org/10.3390/electronics9122004
Submission received: 30 October 2020 / Revised: 14 November 2020 / Accepted: 18 November 2020 / Published: 25 November 2020
(This article belongs to the Special Issue Circuits and Systems for Approximate Computing)

Round 1

Reviewer 1 Report

  1. More implementation details should be added to describe the hardware details. 
  2. Authors can include the limitations of the proposed method if any. Also, can include future work. 
  3. There are a few language issues which should be rectified.
   

Author Response

We thank all of the editors and reviewers for their constructive comments and valuable suggestions. We have carefully checked all of the specific comments and suggestions made by the reviewers. As clearly described below, the attached revision addresses all of their comments and suggestions. Please find that all the revised parts are highlighted in blue fonts in the revised manuscript.

1. More implementation details should be added to describe the hardware details.

Authors’ Response:
According to this comment, in the revised manuscript, we have added more descriptions of implementation details. Adding the new figure, more precisely, we have clearly described the internal architecture of processing element (PE) that can support the multiple operation modes, which is the major part of the proposed hardware design, i.e., the normal LSTM operation and the approximate computation with reduced precision. Please find the additional parts in Section 4 at the new submission.


2. Authors can include the limitations of the proposed method if any. Also, can include future work.

Authors’ Response:
Thank you for this comment. As you may expect, depending on the applications, there should be some limitations for directly applying the proposed approximate LSTM computing, which is dedicated to reducing the computational costs of bi-directional LSTM operations used in the speech recognition system. For example, if we consider the speech recognition processing for the streaming applications [R1], then we have to exploit the one-directional LSTM structure associated with the attention-based processing. In this case, only a few LSTM results in the attention region are involved in identifying a character output, and the proposed similarity-based cell approximation may not be an attractive approach. In this case, we might develop the context-level evaluation to approximate the complex LSTM cell operations. This could be one of the future works to extend the approximate computing use cases in speech processing fields.
In the revised manuscript, we have analyzed the proposed approach’s basic limitations, especially for the streaming applications, as we just described above. Please find the related parts in Section V and VI of the revised manuscript. The authors highly appreciate this valuable comment.

[R1] ([26] at the revised paper) Jorge, J.; Giménez, A.; Iranzo-Sánchez, J.; Civera, J.; Sanchis, A.; Juan, A. Real-Time One-Pass Decoder for Speech Recognition Using LSTM Language Models. INTERSPEECH, 2019, pp. 3820–3824.


3. There are a few language issues which should be rectified.

Authors’ Response:
Thank you for carefully reviewing the manuscript. During the revision, we have carefully checked the manuscript multiple times to avoid ambiguity and language issues.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper presents an approximate computing method of long short-term memory (LSTM) operations for energy-efficient end-to-end speech recognition. The method has been validated by design an approximate LSTM accelerator in 65nm CMOS process. The application of the designed unit for speech recognition system allowed to improve its energy-efficiency by a large margin. The proposed method and its physical implementation are especially relevant for developing artificial intelligence power internet of things systems. The paper is well written and technically sound, while the proposed solution is novel and relevant. I have only a few comments and suggestions.

Comments

  1. The context of research is represented rather poorly. Discuss other recent implementations of deep learning neural networks in hardware such as YodaNN. How the problem of energy efficiency is solved in these works and how is your solution different?
  2. Define MAC –> multiply-accumulate operations.
  3. 155: “reduce the computational complexity of DNN-based speech recognition by more than 49%” -> explain in detail how do you arrive at this number.
  4. 186: “8 x8 versions.” -> should be multiplication operations?
  5. Explain Figure 8 in more detail. How do you calculate Normalized number of memory accesses and Normalized number of MAC operations?
  6. What software did you use for prototype design and simulation?
  7. What benchmark data were used for calculation of energy consumption?
  8. Table 1: add area units.
  9. 247: “increasing the energy efficiency by 119%” -> check the number. How can the decrease in energy consumption (aka energy efficiency) exceed 100% ?
  10. Add critical discussion section. Discuss the limitations of the proposed method and any threats to the validity of the results.
  11. Improve conclusions; avoid vague statements; support your claims by the main experimental results; discuss further work.

Author Response

We thank all of the editors and reviewers for their constructive comments and valuable suggestions. We have carefully checked all of the specific comments and suggestions made by the reviewers. As clearly described below, the attached revision addresses all of their comments and suggestions. Please find that all the revised parts are highlighted in blue fonts in the revised manuscript.

Comment : This paper presents an approximate computing method of long short-term memory (LSTM) operations for energy-efficient end-to-end speech recognition. The method has been validated by design an approximate LSTM accelerator in 65nm CMOS process. The application of the designed unit for speech recognition system allowed to improve its energy-efficiency by a large margin. The proposed method and its physical implementation are especially relevant for developing artificial intelligence power internet of things systems. The paper is well written and technically sound, while the proposed solution is novel and relevant. I have only a few comments and suggestions.

Authors’ Response:
The authors highly appreciate your positive evaluation. The manuscript is carefully revised according to the comments as follows.


1. The context of research is represented rather poorly. Discuss other recent implementations of deep learning neural networks in hardware such as YodaNN. How the problem of energy efficiency is solved in these works and how is your solution different?

Answer:
According to this comment, in the revised manuscript, we have compared the proposed accelerator to the prior works, including the commented YodaNN architecture [R2]. In fact, the previous YodaNN design is focused on performing a binary neural network (BNN) with the flexible hardware components using latch-based scratch-pad memories (SCMs). As the proposed work focuses on more complicated networks utilizing multiple bits to represent weight values, which normally provide more accurate accuracy than BNN models, our method can be regarded as the complexity reduction schemes with the approximate computing approach rather than simplifying the network architecture itself. During the revision, in Section 1, we have tried to mention various optimization schemes in different levels conceptually and described how the approximate computing could be categorized into the recent complexity reduction schemes of deep learning solutions.

[R2] ([13] at the revised paper) Andri, R.; Cavigelli, L.; Rossi, D.; Benini, L. YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights. 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2016, pp. 236–241.


2. Define MAC –> multiply-accumulate operations.

Authors’ Response:
Thank you for carefully reviewing our manuscript. We had clearly denoted all the abbreviations when they first occurred. For your information, the meaning of MAC is described in Section 1, line 33.


3. 155: “reduce the computational complexity of DNN-based speech recognition by more than 49%” -> explain in detail how do you arrive at this number.

Authors’ Response:
We apologize for the unclear descriptions. In this work, we measure the computation complexity of DNN architecture by counting the number of MAC operations. In Section 3 of the revised manuscript, we have explained how we actually define computational complexity following this comment.


4. 186: “8 x8 versions.” -> should be multiplication operations?

Authors’ Response:
You are absolutely right. To avoid confusion, we have modified the expressions correctly.


5. Explain Figure 8 in more detail. How do you calculate Normalized number of memory accesses and Normalized number of MAC operations?

Authors’ Response:
We are sorry that the previous version did not include enough explanations of evaluation settings. To fairly compare different optimization schemes, we normalize the number of memory accessed and the number of MAC operations of each method to the baseline model that fully utilizes LSTM cells, achieving the most accurate result. In Section 5 of the revised paper, we have clearly mentioned how we compute the normalized numbers and present a more intuitive illustration by modifying Figure 8.



6. What software did you use for prototype design and simulation?

Authors’ Response:
In this work, we used the TensorFlow framework [R3] to evaluate different LSTM architectures to optimize the DeepSpeech network. In Section 5, we have precisely described the testing environment.

[R3] ([21] at the revised paper) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; others. Tensorflow: A system for large-scale machine learning. 12th USENIX symposium on operating systems design and implementation (OSDI16), 2016, pp. 265–283.


7. What benchmark data were used for calculation of energy consumption?

Authors’ Response:
For evaluating the amount of energy consumption, in this work, we perform the post-layout simulation of the prototype LSTM accelerator, which is designed in a 65nm CMOS technology. For synthesizing the prototype, more precisely, we adopted Synopsys Design Compiler [R4], whereas the place and route steps were based on Synopsys IC Compiler [R5]. In the revised paper, we have clearly denoted the design environments. Please check the newly included parts in Section 5 of this version.

[R5] ([23] at the revised paper) Dupenloup, G. Automatic synthesis script generation for synopsys design compiler, 2004. US Patent 6,836,877
[R6] ([24] at the revised paper) Kommuru, H.B.; Mahmoodi, H. ASIC design flow tutorial using synopsys tools.Nano-Electronics & Computing Research Lab, School of Engineering, San Francisco State University San Francisco, CA, Spring2009


8. Table 1: add area units.

Authors’ Response:
According to this comment, we have added area units in Table 1. Thank you for the careful suggestion.

 

9. 247: “increasing the energy efficiency by 119%” -> check the number. How can the decrease in energy consumption (aka energy efficiency) exceed 100% ?

Authors’ Response:
We believe there is a misunderstanding here. As you mentioned, it is impossible to reduce energy consumption by more than 100%. To fairly analyze the effectiveness of the proposed optimization schemes, however, we adopted the concept of energy efficiency (GOPS/W), which is widely used to present the allowed number of operations for the given power budget [R7]. Therefore, it is important to increase energy efficiency by either improving the throughput or reducing power consumption. For your information, the proposed method remarkably enhances energy efficiency by more than 100% compared to the baseline DeepSpeech architecture. We reduce the processing complexity for the same workload with the aggressive skipping and approximation of less important LSTM cells. To prevent unwanted misunderstanding, in the revised manuscript, we have denoted the definition of energy efficiency in Section 5.

[R7] ([25] at the revised paper) Moon, S.; Lee, H.; Byun, Y.; Park, J.; Joe, J.; Hwang, S.; Lee, S.; Lee, Y. FPGA-based sparsity-aware371CNN accelerator for noise-resilient edge-level image recognition. 2019 IEEE Asian Solid-State Circuits Conference (A-SSCC). IEEE, 2019, pp. 205–208


10. Add critical discussion section. Discuss the limitations of the proposed method and any threats to the validity of the results.

Authors’ Response:
As you mentioned, there could be some limitations when we directly apply the proposed approximate LSTM computing to the other applications. For example, even for the same speech recognition process, it is not advisable to use the proposed method at the streaming application, which is one of the emerging fields. As reported in [R1], more precisely, the recent streaming processing used for the speech recognition normally exploits the attention-based computations, and then only a few LSTM cells are involved in this step to identify the output character. Therefore, the aggressive approximation of LSTM cell operations may severely degrade the recognition accuracy. To extend the proposed work to this attention-based LSTM architecture, we may consider the context-level approximate computing, which will be one of our future works. In the revised paper, we have described the basic limitations of the proposed method, especially for the streaming applications, and provided the research directions in the areas. Please find the corresponding parts in Section 5 of the new submission. The authors highly appreciate this valuable comment that actually improves the quality of our manuscript.

[R1] ([26] at the revised paper) Jorge, J.; Giménez, A.; Iranzo-Sánchez, J.; Civera, J.; Sanchis, A.; Juan, A. Real-Time One-Pass Decoder for Speech Recognition Using LSTM Language Models. INTERSPEECH, 2019, pp. 3820–3824.


11. Improve conclusions; avoid vague statements; support your claims by the main experimental results; discuss further work.

Authors’ Response:
Thank you for this suggestion. We have carefully modified the conclusion by using more clear results.

Author Response File: Author Response.pdf

Back to TopTop