Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector

: This paper aims to design an online, low-latency, and high-performance speech recognition system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve this, we adopt a server-client model and a context-sensitive-chunk-based approach. The speech recognition server manages a main thread and a decoder thread for each client and one worker thread. The main thread communicates with the connected client, extracts speech features, and buffers the features. The decoder thread performs speech recognition, including the proposed multichannel parallel acoustic score computation of a BLSTM acoustic model, the proposed deep neural network-based voice activity detector, and Viterbi decoding. The proposed acoustic score computation method estimates the acoustic scores of a context-sensitive-chunk BLSTM acoustic model for the batched speech features from concurrent clients, using the worker thread. The proposed deep neural network-based voice activity detector detects short pauses in the utterance to reduce response latency, while the user utters long sentences. From the experiments of Korean speech recognition, the number of concurrent clients is increased from 22 to 44 using the proposed acoustic score computation. When combined with the frame skipping method, the number is further increased up to 59 clients with a small accuracy degradation. Moreover, the average user-perceived latency is reduced from 11.71 s to 3.09–5.41 s by using the proposed deep neural network-based voice activity detector.


Problem Definition
Deep learning with GPU and considerable speech data has greatly accelerated the advance of speech recognition [1][2][3][4][5]. In line with this advancement, automatic speech recognition (ASR) systems have been widely deployed in various applications such as dictation, voice search, and video captioning [6]. The research on an ASR deployment can be classified into (a) an on-device system and (b) a server-based system. For an on-device deployment, various optimizations have been proposed, such as efficient and light architectures of an acoustic model (AM) or a language model (LM), network pruning methods, parameter quantization, speaker-dependent models, and compiler optimizations [7][8][9][10][11][12]. Therefore, there exists a trade-off between ASR accuracy and real-time performance. For a server-based deployment where an ASR system is based on a server-client model, the server performs speech recognition using high-performance resources, whereas the client can be applied in various devices, including embedded devices. Therefore, the trade-off between accuracy and real-time performance In short, our goal is to deploy an online ASR system using a BLSTM AM where the AM is usually used in an offline ASR system since the vanilla AM requires a whole input speech during decoding. For a real-time online ASR system with more concurrent clients, we aim to solve the two challenges (a) a long decoding time and (b) long response latency, while the accuracy performance is maintained.

Literature Review
The BLSTM AM of an ASR system achieves better performance than a deep neural network (DNN)-or long short-term memory (LSTM)-based AM. However, the use of BLSTM AM is limited due to practical issues such as massive computational complexity, the use of a long speech sequence, and long latency. Therefore, many previous studies on online ASR assumed a BLSTM AM to be an offline AM having better performance but long latency [23][24][25][26][27][28][29]. In addition, there has been considerable research on fast training and decoding for a BLSTM AM. References [30][31][32] proposed a context-sensitive-chunk (CSC) BLSTM, a windowed BLSTM, and latency-controlled (LC) BLSTM, respectively. The CSC BLSTM considers a CSC to be an isolated audio sequence where the CSC comprises a chunk of fixed length and the past and future chunks. The windowed BLSTM is a variant of the CSC BLSTM with jitter training, and it was found that the limited context is adequate rather than an entire utterance. The LC BLSTM is decoded in a similar manner to a conventional unidirectional LSTM; however, it additionally decodes a fixed number of future frames for BLSTM. Reference [33] reduced the computational complexity of the LC BLSTM by approximating forward or backward BLSTMs as simple networks. Reference [34] utilized the LC BLSTM in an online hybrid connectionist temporal classification (CTC)/attention end-to-end (E2E) ASR. Reference [35] proposed a CTC-based E2E ASR using a chunk-based BLSTM where the chunk size was randomized over a predefined range, and showed accuracy improvement and low latency.
Moreover, there has been considerable research on online ASR systems for increasing the concurrency. Reference [13] proposed a CTC-based E2E ASR system, Deep Speech 2, and used an eager batch dispatch technique of GPU, 16-bit floating-point arithmetic, and beam pruning for real-time and low latency. Reference [14] proposed an online ASR architecture based on time-delay neural network (TDNN) and unidirectional LSTM layers. Reference [15] proposed a recurrent neural network (RNN)-based LM for an online ASR using a quantization of history vectors and the CPU-GPU hybrid scheme. Reference [16] proposed beam search pruning and used the LC BLSTM for an RNN transducer (RNNT)-based E2E ASR. Reference [17] proposed a monotonic chunkwise attention (MoChA) model using hard monotonic attention and soft chunk-wise attention. Reference [18] employed the MoChA-based approach in an E2E ASR using unidirectional LSTM and attention layers for streaming. Reference [19] proposed an online E2E ASR system based on a time-depth separable convolution and CTC for low latency and better throughput. Reference [20] proposed TDNN-based online and offline ASR by proposing a parallel Viterbi decoding based on an optimized, weighted finite-state transducer decoder using GPU, batching multiple audio streams, and so forth.
This paper adopts a CSC BLSTM AM that can decode not a whole input speech but segments of an input speech in order to reduce the decoding latency. To solve the long decoding time, unlike previous works for an online ASR which mostly degrade the accuracy performance [14,15,17,19], we focus on accelerating a GPU parallelization while the accuracy performance is maintained. To solve the long response latency to a user, we also utilize a voice activity detector.

Proposed Method and Its Contribution
This paper presents an online multichannel ASR system employing a BLSTM AM, which is hardly deployed in industries even though it is one of the best performing AMs. Accordingly, our online ASR system is based on the server-client model where the server can perform speech recognition using high-performance resources and the client interacts with the user and server via various devices. For the baseline server-side ASR system, we utilize a CSC BLSTM where the sizes of a chunk and its left and right contexts of a CSC are 20, 40, and 40 ms, respectively [30]. In fact, the baseline ASR performs in real time; however, the number of concurrent clients is limited. For more concurrent clients, we propose a method for accelerating the GPU parallelization and reducing the transmission overhead between the CPU and GPU based on the fact that a CSC is regarded as an isolated sequence. Moreover, we propose a DNN-based voice activity detector (DNN-VAD) for a low-latency response even though the client sends no end-of-utterance message. To evaluate the proposed online ASR system, we use the test data recorded from Korean documentary programs. The performance in terms of ASR accuracy, real time, and latency is measured as a function of the syllable error rate (SyllER), number of concurrent clients, and the elapsed time in seconds until a user receives the recognized text of they spoke.
The contributions of this paper are summarized as follows: Fast BLSTM-based online ASR system A BLSTM based AM is commonly regarded as an offline ASR system since a vanilla BLSTM can be decoded after an overall input speech is obtained. This paper successfully deploys a CTC-BLSTM-AM-based online ASR by using the proposed multichannel parallel acoustic score computation and DNN-VAD methods. It is noted that the proposed system can be employed for any language if proper acoustic and language models are prepared for the language though our experiments are conducted only for Korean in this paper.
Parallel acoustic score computation using multichannel data Even though a BLSTM-based online ASR system is successfully deployed using CTC-BLSTM, the number of concurrent clients is still limited due to the massive computational complexity. The proposed acoustic score computation method increases the number of concurrent clients by merging multichannel data and by processing the data in parallel, which accelerates the GPU parallelization and reduces the data transfer between CPU and GPU. From the ASR experiments, the number of concurrent clients is increased from 22 to 44 by using the proposed parallel acoustic score computation method. DNN-based voice-activity detector An online ASR system needs to send the recognized text to a user as soon as possible even before the end of sentence is detected to reduce response time. To this end, we propose a DNN-based voice-activity detector that detects short pauses in a continuous utterance. The benefits of the proposed method are (a) to reduce user perceived latency from 11.7 sec to 5.41 sec and 3.09 sec respectively depending on the parameter with little degradation in accuray and (b) this is done by the use of the acoustic model score used in ASR without any auxiliary training or the additional computation while decoding. ASR performance The proposed method maintains the accuracy performance whereas many previous works degrade ASR performance. Combination with additional optimization An additional optimization method that causes performance degradation can be applied to the proposed ASR. For instance, we utilize a frame skip method during Viterbi decoding to our ASR system. From the ASR experiments, the number of concurrent clients is increased from 44 to 59 while the syllable error rate is degraded from 11.94% to 12.53%.
The rest of this paper is organized as follows. Section 2 describes our configuration of a server-client-based ASR system, an acoustic score computation CTC BLSTM AM, and a baseline multichannel acoustic score computation using a CSC BLSTM AM. Next, Section 3 proposes a fast multichannel parallel acoustic score computation to support more clients and Section 4 proposes a DNN-VAD method for a low-latency response. Section 5 shows our experimental setup and its performance comparison. Finally, we conclude our findings in Section 6.

Baseline Multichannel Acoustic Score Computation of a CSC BLSTM AM
In advance to propose parallel multichannel acoustic score computation, we first introduce our server-client-based ASR system and the use of a CTC BLSTM AM [30]. Then, we present a baseline multichannel acoustic score computation method that is used in our CTC BLSTM-based online ASR.

Configuration of a Server-Client-Based ASR System
This section gives a detailed description on our server-client-based ASR system in terms of main thread a decoder thread.

Main Thread of the Online ASR Server
When a client requests a connection to the ASR server, a main thread is created for the client, which parses the messages from the client where the messages can be a begin-of-audio, audio segments, an end-of-audio, and an end-of-connection. If the begin-of-audio message is received, the main thread creates a decoder thread and prepares speech recognition. If an audio segment is obtained, the main thread extracts a 600-dimensional speech feature vector for each 10 ms audio segment. That is, it extracts 40-dimensional log mel filterbanks for the 10 ms audio segment, and then stacks the features of the past seven frames and the future seven frames. The extracted features are queued into a ring buffer R f eat , to perform speech recognition whenever the decoding thread is ready. If the end-of-audio message is received, the main thread waits to finish the decoder thread and terminates the decoder thread and itself.

Decoder Thread of the Online ASR Server
Whenever a decoder thread is in idle or ready state, the thread checks whether the size of the speech features in R f eat is larger than a pre-defined minibatch size, T bat . T bat indicates the length of the audio samples to be decoded at a time, and is defined during the initialization stage of the ASR server. In this paper, we set T bat as 200 frames, each of which is 2-s long [14,26,36].
If the decoder thread detects speech features that are larger than T bat , the thread obtains the speech features with a length of (T bat +T cxt ), while those with a length of T bat are popped from R f eat . T cxt is the length of the audio samples for left or right context information [30]. In this paper, we set T cxt as 40 frames. The obtained features are used to calculate the acoustic scores of a CSC BLSTM AM where the multichannel parallel decoding is proposed in Section 3. The acoustic scores are used in two ways: (a) DNN-VAD and (b) Viterbi decoding. DNN-VAD aims to automatically send the decoded text to the client even though no end-of-audio message is received from the client, where DNN-VAD is proposed in Section 4. If DNN-VAD detects any short pause in an utterance, the thread finds optimal texts up to the time, sends the texts to the client, and resets the search space of Viterbi decoding. Viterbi decoding estimates the probabilities of possible paths using the acoustic scores and an n-gram LM. The processes are repeatedly performed until the decoder thread receives an end-of-audio or end-of-connection message.

The Use of a CSC BLSTM AM
To efficiently incorporate a BLSTM AM, our ASR system utilizes a CSC-based backpropagation through time (BPTT) and decoding [30]. That is, a CSC comprises an audio chunk of fixed length and its left/right chunks. The chunk size is relatively smaller than the input audio. The ASR system employing the CSC BLSTM AM can reduce the training time and decoding latency as the CSC-based approach regards a CSC as an isolated sequence.
Let us assume the speech features of an overall input sequence of length T tot , as shown in Figure 2a. As mentioned in Section 2.1.1, the decoder thread obtains a speech feature vector with length T bat +T cxt . The features of length T bat and those of length T cxt are used as the features to be decoded at the time and its future context, which are drawn in yellow and in gray, respectively, in Figure 2b. As shown in Figure 2c, each speech feature vector is split into a set of chunks of length T chunk . Then, CSC vectors are generated by appending the past features of length T cxt and the future features of length T cxt for each split chunk, as shown in Figure 2d. The length (T win ) of a CSC vector and the number (N seg ) of CSC vectors are defined in Equation (1) and Equation (2), respectively.
For an efficient use of GPU parallelization, the CSC vectors are merged into a CSC matrix in the form of N seg × T win , as shown in Figure 2e. By using the CSC matrix, the acoustic scores are calculated using a CSC BLSTM AM. In this paper, T chunk and T cxt are set as 20 and 40 frames, respectively. Therefore, T win and N seg are 100 frames and 10.

Baseline Multichannel Acoustic Score Computation
When the decoder thread is initialized, the run-time memories are allocated. As shown in Figure 3a, the run-time memories comprise (a) three types of CPU memories, M CPU f eat_vec for a speech feature vector, M CPU f eat_mat for a CSC matrix, and M CPU prob_vec for acoustic scores, and (b) two types of GPU memories, M GPU f eat_mat for a CSC matrix and M GPU prob_vec for acoustic scores. Decode the acoustic probabilities Moreover, the sizes of the run-time memories are summarized in Table 1. That is, the size of M CPU f eat_vec is N bat , which is defined as where T shi f t and N f eat_dim are the sizes of frame shift and the speech feature dimension, respectively, which are 10 and 600 frames, as described in Section 2.1.1. Therefore, N bat is 12000. The size of M CPU f eat_mat or M GPU f eat_mat is N win × N seg and that of a CSC, N win , is defined as thus, N win is 6000. The size of M CPU prob_vec and M GPU prob_vec is N node , which is the number of output nodes of a BLSTM AM and is set as 19901 in this paper.. Table 1. Summarization of the allocated run-time memories at CPU and GPU of the baseline multichannel acoustic score computation.

Thread
Name Device Size Whenever a decoder thread is in idle state and R f eat contains speech features more than T bat , the decoder thread obtains a speech feature vector of length T bat . The speech feature vector is reformed into CSC vectors, which are merged into a CSC matrix. The CSC matrix is then transmitted from CPU to GPU. On GPU, the CSC matrix is normalized using a linear discriminant analysis (LDA)-based transform [37] and the acoustic scores of the matrix are calculated using the BLSTM AM. Next, the acoustic scores are transmitted from GPU to CPU and used in the subsequent steps, such as DNN-VAD and Viterbi decoding. The explained procedures are shown in Figure 3b.
For each feature with a duration of T bat per client, the transmission sizes are N win × N seg and N node for a CSC matrix from CPU to GPU and for acoustic scores from GPU to CPU, respectively, as shown in Table 2. Moreover, the transmission frequency is increased by N channel times if the number of concurrent clients is N channel . However, the frequent data transfer tends to degrade the overall computational performance of a system and causes a small utilization of the GPU [38,39]. Table 2. Summarization of the frequency and transmission sizes between CPU and GPU for each feature with a duration of T bat if the number of concurrent clients is N channel , when using the baseline multichannel acoustic score computation.

Transmission
Frequency Size

Proposed Fast Multichannel Parallel Acoustic Score Computation
Using the baseline acoustic score computation, the number of concurrent clients is restricted due to the frequent data transfer between GPU and CPU and the low parallelization of the GPU [39].
To support more concurrent clients in real time, this section proposes a fast multichannel parallel acoustic score computation method by accelerating the GPU parallelization and reducing the transmission overhead.
As shown in Figure 4, the proposed fast multichannel parallel acoustic score computation is performed with one decoding thread per client and an additional worker thread, whereas the baseline method is performed with no worker thread. When an online ASR server is launched, the server creates a worker thread for a GPU parallel decoding and initializes the maximum number (N CPU parallel ) of concurrent clients and that (N GPU parallel ) of the GPU parallel decodings. Once the worker thread is initialized, the run-time memories are allocated, as shown in Figure 4a Figure 4b, one run-time CPU memory is also allocated when a decoder thread for a client is initialized. The run-time CPU memory M CPU prob_vec is for the acoustic scores. Moreover, the sizes of the run-time memories are summarized in Table 3.
Then, the decoder thread waits for the acoustic scores to be calculated by the worker thread. On the other hand, whenever the worker thread is in idle state and there are buffered speech feature CSC (1) . . .

CSC
(1) where CSC (i) indicates the CSC-based matrix of the i-th speech feature vector of M GPU f eat_vec . Then, the matrix is normalized using an LDA-based transform [37] and the acoustic scores are calculated into M GPU prob_vec using the CSC BLSTM AM. The acoustic scores are in the following cascaded form: where prob i means the acoustic scores of the i-th speech feature vector of M GPU f eat_vec . Next, the acoustic scores are transmitted from GPU to CPU. For instance of prob i , if it is for the m-th client, prob i is stored If the waiting decoder thread detects the acoustic scores at M

CPU(W)
prob_vec with the corresponding offset index of Equation (8), the decoder thread copies them into its local memory (M CPU prob_vec ) and proceeds the subsequent steps as the baseline method. The descried procedures are shown in Figure 4c.
As shown in Table 4, the transmission sizes are N bat × k for k speech feature vectors from CPU to GPU and N node × k for acoustic scores from GPU to CPU, respectively, when the worker thread is ready and M CPU(W) f eat_vec_ f contains k speech feature vectors. In addition, the frequency is varied according to the size of k from N channel /N GPU parallel to N channel . Assuming that the number of concurrent clients is N channel and the number of decoded speech feature vectors obtained by the proposed method is 1 ≤ k ≤ N GPU parallel , the main differences between the baseline and proposed acoustic score computation methods are as follows: Decoding subject(s) The decoder thread of each client calculates acoustic scores in the baseline method, whereas the additionally used worker thread does so in the proposed method. Transmission frequency The transmission occurs 2 × N channel times in the baseline method and 2 × k times in the proposed method. Therefore, the proposed method reduces the transfer frequency by k. Transmission size For a transmission from CPU to GPU, the baseline method transmits N win × N seg for each client, whereas the proposed method N bat × k does so for each decoding turn. The total transmission data size to GPU is reduced by the proposed method. On the other hand, for the transmission from GPU to CPU, the baseline method transmits N node for each client, whereas the proposed method transmits N node × k for each decoding turn. The total transmission size to CPU is equal. Decoding size at a time The baseline method decodes one speech feature vector, whereas the proposed method decodes k vectors. This leads to more GPU parallelization. Table 4. Summary of the frequency and transmission sizes between CPU and GPU for each feature with a duration of T bat if the number of concurrent clients is N channel and the number of decoded speech feature vectors is k ≤ GPU parallel , when using the proposed fast multichannel parallel acoustic score computation.

Transmission
Min. Frequency Max. Frequency Size

Proposed DNN-Based VAD Method for Low Latency Decoding
Viterbi decoding involves two processes: one estimates probabilities of states in all possible paths, and the other finds an optimal path by backtracking the states with the highest probability. The ASR system yields the results only after both processes are completed, usually at the end of an utterance.
In an online ASR system recognizing long utterances, the end point of an utterance is not known in advance and deciding the back-tracking point affects user experience in terms of response time. If the backtracking is performed infrequently, the user will receive a delayed response, and in the opposite case, the beam search will not find the optimal path that reflects the language model contexts.
In our system, VAD based on an acoustic model for ASR is used to detect short pauses in continuous utterance, which will trigger backtracking. Especially, the acoustic model is built with a deep neural network; hence, we call it DNN-VAD. Here, DNN includes not only a fully connected DNN but also all types of deep models, including LSTM and BLSTM. As explained in previous sections, in our ASR system, BLSTM is used to compute a posterior probability of each state of triphone for each frame. By re-using these values, we can also estimate the probability of non-silence for a given frame with little additional computational cost.
Each output node of the DNN model can be mapped into states of nonsilence or silence phones. Let the output of the DNN model in the i-th node be o i . Then, the speech probability of the given frame is computed as log P nonsil = max i o i , where i ∈ non-silence states (9) log P sil = max i o i , where i ∈ silence states (10) where LLR is log likelihood ratio. Each frame at time t is decided to be a silence frame if LLR nonsil is smaller than the predefined threshold.
In addition, for smoothing purpose, the ratio of silence frames in a window of length (2W + 1) is computed and compared to the predefined threshold T r .
All computations in from Equation (9) to Equation (13) are performed within each minibatch, and the frame at t whenŝ(t − 1) = 1 andŝ(t) = 0 is regarded as the short pause in an utterance. As will be explained in the Section 5.3, the frequent backtracking reduces the response time but also degrades the recognition accuracy. Thus, a minimal interval is set between the detection of short pauses to control the trade-off.

Experiment
We select Korean as the target language for the experiments conducted on the proposed methods (Though our experiments are based on Korean speech recognition, the proposed method can be applied to a CTC BLSTM based speech recognition for any language), and all experiments are performed on two Intel(R) Xeon(R) Silver 4214 CPU @ 2.20 GHz and single NVIDIA GeForce RTX 2080 Ti. Section 5.1 describes the corpus and baseline ASR system and compares the performance of the ASR systems employing different AMs. Next, Sections 5.2 and 5.3 present the performances of the proposed parallel acoustic score computation method and the DNN-VAD method, respectively.

Corpus and Baseline Korean ASR
We use the 3440-h Korean speech and its transcription data to train the baseline Korean ASR system. The speech data comprise approximately 19-million utterances, which are recorded in various environments, such as speaker, noise environment, recording device, and recording scripts. Each utterance is sampled at a rate of 16 kHz and no further augmentation methods are adopted. To evaluate the proposed methods, we prepare a test set recorded from documentary programs. The recordings include voices of narrators and interviewee with and without various background music and noises. The recordings are manually split into 69 segments, each of which are 29.24-s long, on average, and 33.63 min in total.
Each utterance of the training speech data is converted into 600-dimensional speech features. With the extracted speech features, a CSC BLSTM AM is trained using a Kaldi toolkit [40], where the chunk and context sizes of the CSC are 20 and 40 ms, respectively. The AM comprises one input layer, five BLSTM layers, a fully connected layer, and a soft-max layer. Each BLSTM layer comprises 640 BLSTM cells and 128 projection units, while the output layer comprises 19901 units. For the language model, 38 GB of Korean text data is first preprocessed using text-normalization and word segmentation methods [41], and then, the most frequent 540k sub-words are obtained from the text data (For Korean, a sub-word unit is commonly used as a basic unit of an ASR system [42,43]). Next, we train a back-off trigram of 540k sub-words [41] using an SRILM toolkit [44,45].
During decoding, the minibatch size is set to 2 s. Although a larger minibatch size increases the decoding speed owing to the bulk computation of GPU, the latency also increases. We settle into 2 s of minibatch size as a compromise between decoding speed and latency [14,26,36].
To compare the baseline ASR system, we additionally train two types of AMs-(a) DNN-based AM and (b) LSTM-based AM. A DNN-based AM comprises one input layer, eight fully connected hidden layers, and a soft-max layer. Each hidden layer comprises 2048 units and the output layer comprises 19901 units. An LSTM-based AM consists of one input layer, five LSTM layers, a fully connected layer, and a soft-max layer. Each LSTM layer consists of 1024 LSTM cells and 128 projection units, and the output layer consists of 19901 units. The ASR accuracy performance is measured using SyllER, which is calculated as following, where S, D, I, and N are the numbers of substuted syllables, deleted syllables, inserted syllables, and reference syllables, respectivey. As shown in Table 5, the BLSTM AM achieves an error rate reduction (ERR) of 20.66% for the test set when compared to the DNN-based AM. In addition, it achieves an ERR of 11.56% for the test set when compared to the LSTM-based AM. Therefore, we employ the BLSTM AM as our baseline ASR system for achieving better ASR accuracy performance (As for the comparision, the Korean speech recognition experiments with the same data set using the Google Cloud API achieved an average SyllER of 14.69%). Next, we evaluate the multichannel performance of ASR systems employing the three AMs by examining the maximum number of concurrent clients where an ASR system can be performed in real time. That is, multiple clients are parallelly connected to an ASR server and each client requests to decode the test set. We then measure the real time factor for each client using the following equation The processing time for the client i The total duration of the test set .
Next, we confirm that the number of concurrent clients are performed in real time if the average real-time factors for the concurrent clients are smaller than 1.0. As shown in Table 6, the BLSTM AM supports 22 concurrent clients for the test set, whereas the DNN-or LSTM-based AMs support more concurrent clients. Hereafter, an experimental comparison is performed with only LSTM-based AM as our ASR system is optimized for uni-or bidirectional LSTM-based AMs. Table 6. Comparison of the multichannel performance of the Korean online ASR systems employing DNN-, LSTM-, and BLSTM AMs for the test set. The evaluation metric is the maximum number of concurrent clients where an ASR system can be performed in real time.

Max. Clients
Moreover, we evaluate the CPU and GPU usages (%) of the ASR systems using the baseline acoustic score computation method of Section 2 for (a) the LSTM-based AM and (b) the BLSTM AM. The experiments are performed with the test set. From Figure 5, the averaged usages of the CPU and GPU are 83.26% and 51.71% when the LSTM-based AM is employed, and 27.34% and 60.27% when the BLSTM AM is employed. The low usages of GPU can result from the frequent data transfer and low GPU parallelization. Moreover, the low usage of CPU can be observed for the BLSTM AM as CPU takes a long time to wait for the completion of acoustic score computation.

Experiments on the Proposed Fast Parallel Acoustic Score Computation Method
The proposed fast parallel acoustic score computation method can be utilized by replacing the baseline acoustic score computation method with no AM changes in the baseline Korean ASR system. As shown in Table 7, the ASR system using the proposed method of Section 3 achieves the same SyllER when compared to the system using the baseline method of Section 2.3. No performance degradation can be obtained since we only modify the way how the acoustic scores are calculated by accelerating a GPU parallelization. Moreover, the ASR system using the proposed method supports 22 more concurrent clients for the test set, when compared to the system using the baseline method. Therefore, we conclude that the proposed acoustic score computation method increases the concurrent clients with no performance degradation. To analyze the effects of the proposed acoustic score computation method, we compare the CPU and GPU usages (%) of ASR systems using the baseline and proposed methods for the test set. The averaged usages of the CPU and GPU are 78.58% and 68.17%, respectively, when the proposed method is used. By comparing Figure 5b and Figure 6a, the averaged usages of the CPU and GPU are improved by 51.24% and 7.90%, respectively. It can be concluded that the proposed method reduces the processing time of GPU and the waiting time of CPU by reducing the transfer overhead and increasing the GPU parallelization. In addition, we examine the number (k) of parallel decoded feature vectors at each time stamp when using the proposed method, as shown in Figure 6b. The parallel decoded feature vectors varies from 2 to 26, which is depend on the subsequent step, an optimal path search of Viterbi decoding.
For further improvement in the multichannel performance, some optimization methods can be applied, such as beam pruning. In this study, we apply a simple frame skip method during a token passing-based Viterbi decoding. That is, a token propagation is only performed in the odd time stamps during Viterbi decoding. From the experiments of the proposed acoustic score computation, the ASR system combined with the frame skip method supports up to 59 concurrent clients, although the SyllERs are degraded by 4.94% for the set, when compared to the ASR system without the frame skip method, as shown in Table 8.  Again, we measure the CPU and GPU usages (%) of the ASR systems employing the proposed method with/without the frame skip method with the test set, as shown in Figure 7a. The averaged usages of the CPU and GPU are measured as 47.24% and 71.87%. Note that the frame skip method unburdens the CPU load, and thus, the GPU usage is accordingly improved. Moreover, Figure 7b compares the number of active hypotheses during Viterbi decoding.
In addition, we examine the number (k) of parallel decoded feature vectors at each time stamp when using the proposed method, as shown in Figure 7c. The parallel decoded feature vectors varies from 22 to 35. When compared to the Figure 6b, the parallel decoded vectors are increased due to the reduced computation during Viterbi decoding when combining the frame skip method.

Experiments on DNN-VAD
DNN-VAD is used to reduce the waiting time for a user to receive the ASR results of what they said by triggering backtracking at the possible pause among user utterances. However, frequent backtracking at an improper time can degrade the recognition performance. Hence, in the experiments, the minimum interval between two consecutive backtracking points is set to various values. Table 9 shows the segment lengths divided by DNN-VAD and the recognition accuracy with and without DNN-VAD for test set 1. For example, when the minimum interval is limited to 6 s, an utterance is split into 3.6 segments of 8.2 s each, on average, and the word error rate (WER) is 11.13, which is slightly reduced compared to the case in which VAD is not used, where WER is 11.02.
As the minimum interval reduces, the number of segments increases and the length of each segment increases, which means more frequent backtrackings and smaller user-perceived latencies. The accuracy degrades only slightly, which means that the backtracking point is selected reasonably. The internal investigation confirms that the segments are split mostly at the pause between phrases. To measure the waiting time from the viewpoint of users, the user-perceived latency suggested in Reference [19] is used. User-perceived latency is measured for each word uttered, and estimated empirically by measuring the difference in the timestamp of when a transcribed word is available to the user and that in an original audio. The aligned information in the recognition result is used as the timestamp of a word.
The average user-perceived latency is 11.71 s for test set 1 without DNN-VAD, which is very large since all results are received after the end of segments are sent to the server. When DNN-VAD is applied, the average latency is reduced to 5.41 s with a minibatch of 200 frames and 3.09 s with a minibatch of 100 frames. For a detailed analysis, the histogram of latency for each word is shown in Figure 8.

Conclusions
In this paper, we presented a server-client-based online ASR system employing a BLSTM AM, which is a state-of-the-art AM. Accordingly, we adopted a CSC-based training and decoding approach for a BLSTM AM and proposed the following: (a) the parallel acoustic score computation to support more clients concurrently and (b) DNN-VAD to reduce the waiting time for a user to receive the recognition results. On the designed server-client-based ASR system, a client captures the audio signal from a user, sends the audio data to the ASR server, receives a decoded text from the server, and presents it to the user. The client can be deployed in various devices, from low to high performance. On the other hand, a server performs speech recognition using high-performance resources. That is, the server manages the main thread and decoder thread for each client and an additional worker thread for the proposed parallel acoustic computation method. The main thread communicates with the connected client, extracts speech features, and buffers them. The decoder thread performs speech recognition and sends the decoded text to the connected client. Speech recognition is performed in three main steps: acoustic score computation using a CSC BLSTM AM, DNN-VAD to detect a short pause in a long continuous utterance, and Viterbi decoding to search an optimal text using an LM. To handle more concurrent clients in real time, we first proposed the acoustic score computation method by merging the speech feature vectors collected from multiple clients, to reduce the amount of data transfer between the CPU and GPU, and calculating the acoustic scores with the merged data to increase GPU parallelization and to reduce the transfer overhead between the CPU and GPU. Second, we proposed DNN-VAD to detect a short pause in an utterance for a low latency response to a user. The Korean ASR experiments conducted using the broadcast audio data showed that the proposed acoustic score computation method increased the maximum number of concurrent clients from 22 to 44. Furthermore, by applying the frame skip method during Viterbi decoding, the maximum number of concurrent clients was increased to 59, although SyllER was degraded from 11.94% to 12.53%. Moreover, the average user-perceived latencies were reduced to 5.41 and 3.09 s with a minibatch of 200 frames and 100 frames, respectively, when the proposed DNN-VAD was used.

Author Contributions:
The authors discussed the contents of the manuscript. Methodology, validation, writing, editing, and formal analysis, Y.R.O. and K.P.; project administration, J.G.P. All authors have read and agreed to the published version of the manuscript.

Abbreviations
The following abbreviations are used in this manuscript: