Selection of Essential Neural Activity Timesteps for Intracortical Brain–Computer Interface Based on Recurrent Neural Network

Intracortical brain–computer interfaces (iBCIs) translate neural activity into control commands, thereby allowing paralyzed persons to control devices via their brain signals. Recurrent neural networks (RNNs) are widely used as neural decoders because they can learn neural response dynamics from continuous neural activity. Nevertheless, excessively long or short input neural activity for an RNN may decrease its decoding performance. Based on the temporal attention module exploiting relations in features over time, we propose a temporal attention-aware timestep selection (TTS) method that improves the interpretability of the salience of each timestep in an input neural activity. Furthermore, TTS determines the appropriate input neural activity length for accurate neural decoding. Experimental results show that the proposed TTS efficiently selects 28 essential timesteps for RNN-based neural decoders, outperforming state-of-the-art neural decoders on two nonhuman primate datasets (R2=0.76±0.05 for monkey Indy and CC=0.91±0.01 for monkey N). In addition, it reduces the computation time for offline training (reducing 5–12%) and online prediction (reducing 16–18%). When visualizing the attention mechanism in TTS, the preparatory neural activity is consecutively highlighted during arm movement, and the most recent neural activity is highlighted during the resting state in nonhuman primates. Selecting only a few essential timesteps for an RNN-based neural decoder provides sufficient decoding performance and requires only a short computation time.


Introduction
Intracortical brain-computer interfaces (iBCIs) aim to improve the daily lives of paralyzed patients by restoring their motor functions [1,2]. An iBCI ascertains the patient's movement intention and generates motor commands for assistive devices, such as computer cursors [3], and the functional electrical stimulation of paralyzed limbs [4]. The iBCI first records neural activity by microelectrode arrays implanted in motor-related brain regions, such as the primary motor cortex. Then, a neural decoder translates neural activity into movement intention or spatial location information. Conventional neural decoding techniques process well-segmented neural activities in previous time windows where task-related information is likely encoded as spiking sequences. Thus, neural decoding may benefit from continuous neural activity because movement intention often occurs before execution [5]. Therefore, a sequence of spike count vectors from many preceding time windows is usually adopted to capture the neural response dynamics [6] and improve the decoding accuracy [7]. However, an excessively long neural sequence may be polluted by stochastic noise, and a short neural sequence may not contain sufficient information. Furthermore, the length of the neural sequence for a neural decoder presents inter-subject variability and often depends on the task. Thus, the length of the neural sequence, a hyperparameter for decoding, is usually designed manually by trial and error. In the literature, various terminologies are used for the length of a neural sequence, including number of timesteps [8,9], number of taps [6,10], length of the sliding window [4] and time window size [7,11]. We refer to the length of a neural sequence for decoding as the number of timesteps for consistency with most neural decoders and considering its use in neuroscience [12].
Recent advances in recurrent neural networks (RNNs) have led to improved neural decoder designs [9,13] and real-time BCI systems [14]. RNN-based neural decoders learn the neural response dynamics from the neural activity in both the previous and current timesteps (time bins) [6]. These decoders also employ gates to balance the contributions of the current and previous timesteps when learning a neural sequence. Although most RNN hyperparameters have been analyzed and discussed in a large-scale study [15], obtaining the correct number of timesteps for neural decoding remains challenging.
Existing iBCIs often rely on grid search to optimize the number of timesteps. Alternatively, a high number of timesteps can be selected to retain informative timesteps. Table 1 summarizes the various iBCIs that use RNNs as the neural decoders for nonhuman primate models. In most of the studies, the long short-term memory (LSTM) architecture is applied as an RNN for neural decoding. These iBCIs predict the kinematic states based on the neural activity over the previous 2-30 timesteps. Table 1. Summary of existing iBCIs that use RNNs as neural decoders. The iBCIs were implemented for a nonhuman primate model. The time windows corresponding to the number of timesteps is given in parentheses in column #T. In [1], the number of hidden states and timesteps were optimized over a predefined range. ( As the number of timesteps varies according to the task and subject, grid search is commonly used to determine the appropriate number of timesteps. Grid search evaluates all the possible numbers of timesteps and selects the number that provides the best fit, as shown in Figure 1a. Grid search evaluates the decoding performance according to the number of timesteps. One possible number of timesteps is designated to a neural decoder, and the model weights of the neural decoder are then optimized. Subsequently, a score is determined to evaluate the effectiveness of the neural decoder for the selected number of timesteps. However, grid search may cause a high computational burden because weight optimization is computationally expensive over all the parameter combinations. Consequently, grid search has a low applicability in real-world iBCIs. optimization is computationally expensive over all the parameter combinations. Consequently, grid search has a low applicability in real-world iBCIs. Figure 1. Selection of essential timesteps for neural decoding. (a) Conventional method consisting of numerous iterations with high computational cost (R 2 , coefficient of determination). (b) Proposed temporal attention-aware timestep selection (TTS) method. This one-pass method only requires three steps. First, the neural decoder is trained with the temporal attention module (TAM). Next, essential timesteps ( * ) are selected according to the attention weights. Finally, the neural decoder with a short input sequence of length * is retrained.
Alternatively, a high number of timesteps can be used in a neural decoder [6], assuming that contextual information related to movement intention is likely available over such a long period. To efficiently extract task-related features, RNNs can decode neural activity over several timesteps [6]. However, excessively long neural activity periods cause a high computational burden because an RNN iteratively updates the hidden states at each timestep. The iterative processing in the RNN cannot be implemented in parallel, which increases the computational burden. Moreover, long neural activity periods may introduce stochastic noise, which hinders the decoding performance. Therefore, an adequate number of timesteps can reduce the computational burden and enable real-time iBCIs. However, the lack of meaningful interpretation for each timestep impedes the selection of an adequate number of input timesteps for decoding.
Ensuring that timesteps are interpretable by a neural decoder is essential for selecting an adequate number of input timesteps. In other words, one can easily select essential timesteps according to the interpretable salience of each timestep in a neural sequence. Because the RNN models the temporal dynamics of neural sequences, this architecture is widely used as the basis for neural decoders. In fact, RNNs employ gates to individually process each timestep and balance the contribution of the neural activity in the current and previous timesteps. Nevertheless, the relative importance of the timesteps is difficult to interpret [18]. To improve the interpretability of models, attention modules can be included to highlight salient information [19]. Attention modules have been demonstrated to enhance salient objects in images [20] and localize actions in videos [21]. In particular, a temporal attention module (TAM) highlights the time intervals of event occurrences by This one-pass method only requires three steps. First, the neural decoder is trained with the temporal attention module (TAM). Next, essential timesteps (T * ) are selected according to the attention weights. Finally, the neural decoder with a short input sequence of length T * is retrained.
Alternatively, a high number of timesteps can be used in a neural decoder [6], assuming that contextual information related to movement intention is likely available over such a long period. To efficiently extract task-related features, RNNs can decode neural activity over several timesteps [6]. However, excessively long neural activity periods cause a high computational burden because an RNN iteratively updates the hidden states at each timestep. The iterative processing in the RNN cannot be implemented in parallel, which increases the computational burden. Moreover, long neural activity periods may introduce stochastic noise, which hinders the decoding performance. Therefore, an adequate number of timesteps can reduce the computational burden and enable real-time iBCIs. However, the lack of meaningful interpretation for each timestep impedes the selection of an adequate number of input timesteps for decoding.
Ensuring that timesteps are interpretable by a neural decoder is essential for selecting an adequate number of input timesteps. In other words, one can easily select essential timesteps according to the interpretable salience of each timestep in a neural sequence. Because the RNN models the temporal dynamics of neural sequences, this architecture is widely used as the basis for neural decoders. In fact, RNNs employ gates to individually process each timestep and balance the contribution of the neural activity in the current and previous timesteps. Nevertheless, the relative importance of the timesteps is difficult to interpret [18]. To improve the interpretability of models, attention modules can be included to highlight salient information [19]. Attention modules have been demonstrated to enhance salient objects in images [20] and localize actions in videos [21]. In particular, a temporal attention module (TAM) highlights the time intervals of event occurrences by exploiting the relations in features over time [22]. The TAM provides a sequence of attention weights in which each element corresponds to a timestep and whose values represent the relevance of the timestep to an event of interest. Furthermore, the TAM highlights salient timesteps relevant to the final decision and ignores noisy timesteps in an input sequence. Therefore, we include the TAM in the RNN-based neural decoder to highlight salient timesteps that contribute to determining the kinematic states for increasing interpretability.
Inspired by the TAM that highlights the time intervals of interest, we propose a temporal attention-aware timestep selection (TTS) method that reduces the design complexity and computational burden in an RNN-based neural decoder. To the best of our knowledge, this is the first method that automatically selects essential timesteps for RNN-based neural decoding. We adopt the TAM to determine the contribution of neural activity at each timestep to the prediction of movement intention, assuming that this contribution varies across timesteps. Determining the relative importance of each timestep in a behavioral task can improve the interpretability of timesteps and facilitate timestep selection for neural decoding.
The main contributions of this study are summarized as follows: 1.
We propose a scheme that efficiently determines the essential timesteps for a general RNN-based neural decoder, thus avoiding the time-consuming grid search; 2.
We adopt a TAM that interprets the salience of each timestep for predicting movement intention, leading to essential timestep selection; 3.
Experimental results reveal that the proposed TTS can determine the essential timesteps for three RNN-based neural decoders while reducing the computation time of offline training and online prediction; 4.
The RNN-based neural decoders with a few essential timesteps outperform state-ofthe-art neural decoders on two nonhuman primate datasets; 5.
The visualization of attention weights demonstrates that only a few neural activity timesteps are emphasized for neural decoding during arm motion and resting.

Behavioral Tasks and Neural Data Collection
In this study, we used publicly available nonhuman primate datasets acquired from monkey Indy [23] and monkey N [24].

Monkey Indy
Monkey Indy (Macaca mulatta) was trained to reach targets randomly appearing on a plane using its left arm, as illustrated in Figure 2a. Monkey Indy successfully reached the target when placing the fingertip on the target for 450 ms. Then, a new target was randomly drawn on any possible locations. The fingertip position was acquired by an electromagnetic position sensor, and neural activity was recorded using a 96-channel silicon microelectrode array (Blackrock Microsystems, Salt Lake City, UT, USA), which was chronically implanted in the primary motor cortex (M1). This array was designed to cover the upper-arm and shoulder representation areas. The neuronal activity was sorted using a custom software written in C++, where 171-413 units were obtained. The population firing rate was computed at a bin width of 64 ms and was considered to be the input of a neural decoder. The population firing rate, x ∈ R C , had C elements, with C being either the number of channels or single units (single neurons). Data from three to seven sessions over 300 days were collected. More details about the corresponding behavioral tasks and neural data collection can be found in [23,25].

Monkey N
Monkey N (Macaca mulatta) was trained to perform an instructed delayed reach-tograsp task, as illustrated in Figure 2b. Monkey N grasped an object using either a side grip or a precision grip and then pulled the object against one of two possible loads by applying either a high or low force. The horizontal displacement of the object was measured using a Hall effect sensor. A 96-channel Utah electrode array (Blackrock Microsystems, Salt Lake City, UT, USA) was chronically implanted to cover the primary motor cortex (M1) and dorsal or ventral premotor cortex. The neuronal activity was sorted using the Plexon Offline Sorter (Plexon, Dallas, TX, USA), where 156 units were obtained. As the trials contained large portions of preparatory delay intervals, they were segmented to contain only the motion execution intervals. Nevertheless, only the data from one session are publicly available. More details about the corresponding behavioral tasks and neural data collection can be found in [24].

Monkey N
Monkey N (Macaca mulatta) was trained to perform an instructed delayed reach-tograsp task, as illustrated in Figure 2b. Monkey N grasped an object using either a side grip or a precision grip and then pulled the object against one of two possible loads by applying either a high or low force. The horizontal displacement of the object was measured using a Hall effect sensor. A 96-channel Utah electrode array (Blackrock Microsystems, Salt Lake City, UT, USA) was chronically implanted to cover the primary motor cortex (M1) and dorsal or ventral premotor cortex. The neuronal activity was sorted using the Plexon Offline Sorter (Plexon, Dallas, TX, USA), where 156 units were obtained. As the trials contained large portions of preparatory delay intervals, they were segmented to contain only the motion execution intervals. Nevertheless, only the data from one session are publicly available. More details about the corresponding behavioral tasks and neural data collection can be found in [24].

Temporal Attention-Aware Timestep Selection for RNN-Based Neural Decoder
We aimed to efficiently determine the essential timesteps for an RNN-based neural decoder. Figure 1b illustrates the proposed TTS. Most kinematic information is assumed to be encoded over a few essential timesteps in a long neural sequence. First, a long neural sequence is fed into the neural decoder with a TAM, which learns the attention weights that provide the meaningful salience of each timestep. The salient timesteps are likely related to movement intention. Then, a statistical analysis enables the determination of the number of timesteps required for the neural decoder. Next, the neural decoder is modified by the selected timesteps and retrained for a few epochs. Finally, a neural decoder with the adequate number of input timesteps is obtained. We share our code and models at https://github.com/nclab-me-ncku/Temporal_Attention_LSTM [26].

RNN-Based Neural Decoder
For an RNN-based neural decoder, let the input sequence = , ⋯ , contain the t-th observation of neural activity and the previous T timesteps of neural activity. For ease of notation and without loss of generality, we refer to this input sequence as = , ⋯ , , where ∈ ℝ represents neural activity with C elements (population firing rate) at the -th timestep.
An RNN is adopted to obtain a hidden representation embedded in the observed neural activity. The hidden representation consists of sequence 1 , ⋯ , of length T, where ∈ ℝ denotes a hidden state at the -th timestep. Once a bidirectional RNN is

Temporal Attention-Aware Timestep Selection for RNN-Based Neural Decoder
We aimed to efficiently determine the essential timesteps for an RNN-based neural decoder. Figure 1b illustrates the proposed TTS. Most kinematic information is assumed to be encoded over a few essential timesteps in a long neural sequence. First, a long neural sequence is fed into the neural decoder with a TAM, which learns the attention weights that provide the meaningful salience of each timestep. The salient timesteps are likely related to movement intention. Then, a statistical analysis enables the determination of the number of timesteps required for the neural decoder. Next, the neural decoder is modified by the selected timesteps and retrained for a few epochs. Finally, a neural decoder with the adequate number of input timesteps is obtained. We share our code and models at https://github. com/nclab-me-ncku/Temporal_Attention_LSTM (accessed on 1 September 2020) [26].

RNN-Based Neural Decoder
For an RNN-based neural decoder, let the input sequence x = {x t−T+1 , · · · , x t } contain the t-th observation of neural activity and the previous T timesteps of neural activity. For ease of notation and without loss of generality, we refer to this input sequence as x = {x 1 , · · · , x T }, where x τ ∈ R C represents neural activity with C elements (population firing rate) at the τ-th timestep.
An RNN is adopted to obtain a hidden representation embedded in the observed neural activity. The hidden representation consists of sequence h 1 , · · · , h T of length T, where h τ ∈ R M denotes a hidden state at the τ-th timestep. Once a bidirectional RNN is implemented, the hidden state consists of both forward hidden state → h τ ∈ R M and backward hidden state Mathematically, the hidden state can be computed as follows: where f (·) represents the RNN model. The RNN can be the vanilla RNN [27], LSTM [28], or gated recurrent unit (GRU) [29], which have been widely adopted for neural decoding [4,6,7].

Temporal Attention Module
The TAM aims to determine the relative importance of each neural activity timestep. The relative importance is then used to select the essential timesteps for neural decoding. To model the relative importance, attention weight a τ is estimated as follows: where W, b, v, and ReLU denote a linear transformation matrix, bias term, linear transformation vector, and rectified linear unit, respectively. Equation (3) represents a softmax function that normalizes attention weights in [0,1]. The TAM aggregates the hidden states of all timesteps according to the attention weights to form the following hidden representation: This learned hidden representation is then fed into a fully connected layer to predict a kinematic state as follows: where d denotes a linear transformation vector andŷ can represent position, velocity, or acceleration.

Essential Timestep Selection for Neural Decoding
Our aim is to select only essential consecutive timesteps for the RNN-based neural decoder. As the attention weights reflect the relevance of the neural activity information in the corresponding timesteps, the cumulative attention weights determine the importance of the consecutive timesteps for neural decoding. For a high cumulative value, the corresponding consecutive timesteps are suitable for the RNN-based neural decoder. In this case, the neural decoder can accurately predict the kinematic state from the neural activity observed over these few consecutive timesteps while reducing the computation time. Subsequently, the most recent T * consecutive timesteps are selected for neural decoding, where T * ≤ T. Figure 3 schematically illustrates the essential timestep selection process using attention weights.
The value of T * is determined in three steps by using the cumulative percentage of the attention weights. First, as the neural decoder predicts kinematic stateŷ t based on the t-th observation, the corresponding attention weights are a t = a t 1 , a t 2 , . . . , a t T T . Second, the average attention weight at timestep τ over all the observations (i.e., over clock time) is determined as follows: where N is the number of observations. Then, the average attention weights are obtained as Third, T * is selected as the minimum that makes to be above a predefined threshold. We select this threshold to be 95% based on principal component analysis [30], where the first few principal components represent over 95% of the total variation in a signal. Accordingly, we aim to retain the timesteps that represent 95% of the attention weights. In these timesteps, the corresponding neural activity likely encodes the most relevant information about a kinematic state.  The value of * is determined in three steps by using the cumulative percentage of the attention weights. First, as the neural decoder predicts kinematic state based on the t-th observation, the corresponding attention weights are = [ 1 , 2 , … , ] . Second, the average attention weight at timestep over all the observations (i.e., over clock time) is determined as follows: where N is the number of observations. Then, the average attention weights are obtained as Third, * is selected as the minimum that makes to be above a predefined threshold. We select this threshold to be 95% based on principal component analysis [30], where the first few principal components represent over 95% of the total variation in a signal. Accordingly, we aim to retain the timesteps that represent 95% of the attention weights. In these timesteps, the corresponding neural activity likely encodes the most relevant information about a kinematic state.

Neural Decoder Retraining with Short Input Sequence
Once * is obtained, the input sequence is modified to * = − * +1 , ⋯ , −1 , , which has a shorter length * ( * ). The RNN-based neural decoder is then retrained using the modified input sequence. Note that the TAM is no longer used in the retrained neural decoder. The kinematic state is predicted as follows:

Neural Decoder Retraining with Short Input Sequence
Once T * is obtained, the input sequence is modified to x * = {x T−T * +1 , · · · , x T−1 , x T }, which has a shorter length T * (T * < T). The RNN-based neural decoder is then retrained using the modified input sequence. Note that the TAM is no longer used in the retrained neural decoder. The kinematic state is predicted as follows: where h T denotes the hidden state at timestep T. Whereas (5) aggregates the hidden states only at all timesteps to predict the kinematic state, while (9) only aggregates the hidden state at the final timestep. We observed no difference between the two methods in terms of decoding performance. However, most existing RNN-based neural decoders adopt (9) because its computational complexity is lower than that of (5). Similarly, we adopt (9) for the neural decoding with short input sequences.

Optimization
The neural decoder aims to predict a kinematic state from neural activity. We use the squared L2 norm (mean squared error) as the loss function: where y andŷ denote the actual and predicted kinematic states, respectively, and B denotes the mini-batch. To optimize the model weights, we adopt Adam [31] with a learning rate of 10 −4 . The coefficients used for computing the running averages of the gradient and their squares are 0.9 and 0.999, respectively, to achieve a high generalization ability to various RNNs.

Performance Evaluation and Statistical Evaluation
We evaluated the decoding performance using the coefficient of determination (R 2 ) [25] and Pearson's correlation coefficient (CC) [8]. The coefficient of determination measures the goodness of fit of a neural decoder as follows: where Var(·) is the variance. The Pearson's correlation coefficient measures the linear correlation between the actual and predicted kinematic states as follows: where σ yŷ represents the covariance between y andŷ, and σ y and σŷ represent the standard deviations of y andŷ, respectively. For monkey Indy, the proposed TTS was compared with the state-of-the-art method in [25] that uses the coefficient of determination. For monkey N, the proposed TTS was compared with state-of-the-art method in [8] that uses Pearson's correlation coefficient.
For rigorous statistical evaluation, the Shapiro Wilk test was first performed to assess whether the observations generated from a normal distribution because it is suggested to be the most powerful normality test [32]. The Friedman test is a non-parametric equivalent of the repeated-measures analysis of variance (ANOVA) and does not assume that the observations are normally distributed. Therefore, the Friedman ANOVA test, followed by a post hoc Wilcoxon signed-rank test, was applied to assess the effect of various lengths T in the timestep selection. The significance levels were set at * p < 0.05, ** p < 0.01, and *** p < 0.001. This statistical evaluation has been widely adopted in the neuroscience community [33]. Furthermore, the Friedman ANOVA test, followed by a post hoc Wilcoxon signed-rank test, was applied to assess the decoding performance compared with state-ofthe-art methods.

Implementation Details
We implemented three RNN-based neural decoders that are widely used in existing iBCIs: vanilla RNN, LSTM, and GRU. Vanilla RNN predicts the output according to both the input sequence and hidden states. However, it suffers from the vanishing gradient problem during training. LSTM avoids this problem by introducing a forget gate that controls the influence of short-and long-term dependencies. GRU further simplifies the gates to reduce the computational cost. Table 2 lists the settings for the three neural decoders. The neural decoders predicted the velocity of the hand movement for both monkey Indy and monkey N in this study. As training involved random weight initialization, each architecture was executed in five independent runs. The experiments were implemented using the PyTorch library and executed on a computer equipped with an NVIDIA GeForce GTX 1080 Ti GPU. The training and testing protocols for the monkey Indy and monkey N data were the same as those used in [25] and [8], respectively. For monkey Indy, the first 5000 samples of each session were used to establish the training set, and the remaining samples were used as the test set. The neural decoder was independently trained during each session. For monkey N, the data, excluding large portions of preparatory delay intervals, were split into ten folds, where eight folds (1178 samples), one fold (147 samples), and one fold (148 samples) established the training, validation, and test sets, respectively.

Timestep Selection for Neural Decoders
Given an input neural sequence of length T, the TTS highlighted the neural activities and selected the essential timesteps T * for the RNN-based neural decoder. We quantitatively evaluated the proposed TTS for three RNN-based neural decoders and the effects of various lengths T.
For the quantitative evaluation of the timestep selection, Figure 4 shows the selected timesteps T * of the three RNN-based neural decoders for T = 10, 15, 20 using the monkey Indy data in [23] and the monkey N data in [24]. For monkey Indy, Figure 4a shows that each RNN-based neural decoder selected fewer than seven timesteps. For the vanilla RNN, significantly fewer timesteps (T * = 3.00 ± 0.71) was selected from T = 20 than that from T = 10 and 15 (analyzed by Friedman ANOVA test, followed by Wilcoxon signed-rank test). For the LSTM, significantly fewer timesteps (T * = 6.70 ± 0.57) was selected from T = 10 than that from T = 15 and 20 (analyzed by Friedman ANOVA test, followed by Wilcoxon signed-rank test). For the GRU, significantly fewer timesteps (T * = 4.30 ± 0.76) was selected from T = 20 than that from T = 10 and 15 (analyzed by Friedman ANOVA test, followed by the Wilcoxon signed-rank test). There was a significant difference among the groups T = 10, 15, 20 for the vanilla RNN (p < 0.001, Friedman ANOVA test), LSTM (p < 0.001, Friedman ANOVA test) and GRU (p < 0.001, Friedman ANOVA test).  For monkey N, Figure 4b shows that the proposed TTS selected fewer than three timesteps for the vanilla RNN. For the vanilla RNN, significantly fewer timesteps ( * = 2.00 ± 0) was selected from = 10 than that from = 15 and 20 (analyzed by Friedman ANOVA test, followed by Wilcoxon signed-rank test). For the LSTM, significantly fewer timesteps ( * = 9.17 ± 1.04) was selected from = 10 than that from = 15 and 20 (analyzed by Friedman ANOVA test, followed by Wilcoxon signed-rank test). For the GRU, significantly fewer timesteps ( * = 7.96 ± 0.73) was selected from = 10 than that from = 15 and 20 (analyzed by Friedman ANOVA test, followed by Wilcoxon signed-rank test). There was a significant difference among the groups = 10, 15, 20 for the vanilla RNN (p < 0.001, Friedman ANOVA test), LSTM (p < 0.001, Friedman ANOVA test) and GRU (p < 0.001, Friedman ANOVA test).
Our proposed TTS was compared to the Bayesian optimization [34], which is widely used to search effective timesteps for neural decoding. Figure 4 shows that our TTS could select significantly fewer timesteps for vanilla RNN and GRU than that achieved by the Bayesian optimization (p < 0.001 for = 10, 15, 20, Friedman ANOVA test, followed by For monkey N, Figure 4b shows that the proposed TTS selected fewer than three timesteps for the vanilla RNN. For the vanilla RNN, significantly fewer timesteps (T * = 2.00 ± 0) was selected from T = 10 than that from T = 15 and 20 (analyzed by Friedman ANOVA test, followed by Wilcoxon signed-rank test). For the LSTM, significantly fewer timesteps (T * = 9.17 ± 1.04) was selected from T = 10 than that from T = 15 and 20 (analyzed by Friedman ANOVA test, followed by Wilcoxon signed-rank test). For the GRU, significantly fewer timesteps (T * = 7.96 ± 0.73) was selected from T = 10 than that from T = 15 and 20 (analyzed by Friedman ANOVA test, followed by Wilcoxon signed-rank test). There was a significant difference among the groups T = 10, 15, 20 for the vanilla RNN (p < 0.001, Friedman ANOVA test), LSTM (p < 0.001, Friedman ANOVA test) and GRU (p < 0.001, Friedman ANOVA test).
Our proposed TTS was compared to the Bayesian optimization [34], which is widely used to search effective timesteps for neural decoding. Figure 4 shows that our TTS could select significantly fewer timesteps for vanilla RNN and GRU than that achieved by the Bayesian optimization (p < 0.001 for T = 10, 15, 20, Friedman ANOVA test, followed by the Wilcoxon signed-rank test). The Bayesian optimization selected fewer timesteps than that achieved by the LSTM (p < 0.001 for T = 15, 20, Friedman ANOVA test, followed by the Wilcoxon signed-rank test). However, the LSTM optimized by the Bayesian optimization achieved a lower decoding performance than that optimized by the TTS (see Section 4.4). The TTS was effective in selecting essential timesteps for neural decoders.

Timestep Selection across Multiple Sessions
In Section 4.2, we quantitatively evaluated the effectiveness of TTS for monkey Indy. As the neural activity of monkey Indy was collected across 37 sessions over 300 days, we also evaluated the variation of T * over the sessions. This evaluation could not be conducted for monkey N because only one of its sessions is publicly available. Figure 5 shows T * across the 37 sessions for each RNN-based neural decoder. T * slightly varied across the sessions. Given an input neural sequence of length T = 10, the timestep selection for the vanilla RNN was more stable than that for LSTM and GRU across all the 37 sessions. Even within a single session, T * slightly varied across independent runs, as indicated by the T * distributions defined by the shaded areas of Figure 5. This finding suggests that T * may be affected by the random initial weights of the neural decoder in each independent run. For = 20, the variation of * was larger than that for = 10 in the RNN-based neural decoders. Furthermore, both the GRU and LSTM obtained a more stable * than the vanilla RNN, which differed from the case of = 10.

Comparison with State-of-the-Art Methods
When the number of timesteps is reduced to * , the decoding performance may be affected. Thus, we evaluated the effect of timestep selection for three RNN-based neural decoders. Furthermore, we compared the decoding performance with state-of-the-art methods including recurrent exponential-family harmonium (rEFH) [25] for the monkey Indy data and the entire spiking activity-driven quasi-RNN (ESA-driven QRNN) [8] for the monkey N. Figure 6a-c show that when the number of timesteps was reduced from to * , all the neural decoders reduced their performance for monkey Indy. However, the decoding performances of both the LSTM and GRU ( = 0.74 ± 0.05 for LSTM and = 0.76 ± 0.05 for GRU) were significantly better than that achieved by the rEFH in For T = 20, the variation of T * was larger than that for T = 10 in the RNN-based neural decoders. Furthermore, both the GRU and LSTM obtained a more stable T * than the vanilla RNN, which differed from the case of T = 10.

Comparison with State-of-the-Art Methods
When the number of timesteps is reduced to T * , the decoding performance may be affected. Thus, we evaluated the effect of timestep selection for three RNN-based neural decoders. Furthermore, we compared the decoding performance with state-of-the-art methods including recurrent exponential-family harmonium (rEFH) [25] for the monkey Indy data and the entire spiking activity-driven quasi-RNN (ESA-driven QRNN) [8] for the monkey N. Figure 6a-c show that when the number of timesteps was reduced from T to T * , all the neural decoders reduced their performance for monkey Indy. However, the decoding performances of both the LSTM and GRU (R 2 = 0.74 ± 0.05 for LSTM and R 2 = 0.76 ± 0.05 for GRU) were significantly better than that achieved by the rEFH in [25] (p < 0.001 for T = 10, 15, 20, Friedman ANOVA test, followed by the Wilcoxon signed-rank test). The vanilla RNN with T * timesteps achieved a lower decoding performance than the rEFH. We further compared the decoding performance of the neural decoders whose timesteps were optimized by the Bayesian optimization [34] and the TTS. Figure 6 shows that both vanilla RNN and LSTM optimized by the Bayesian optimization achieved lower decoding performance than those optimized by the TTS (p < 0.001 for = 10, 15, 20, Friedman ANOVA test, followed by the Wilcoxon signed-rank test). The GRU, optimized by the Bayesian optimization, achieved a decoding performance comparable with that achieved by the TTS. However, the Bayesian optimization selected longer timesteps than the TTS for the GRU, as shown in Figure 4. The TTS could not only select fewer timesteps but also facilitate a higher decoding performance than the Bayesian optimization.

Reduced Computation Time
We also evaluated the computational efficiency of the proposed TTS by obtaining the computation time as listed in Table 3. For fair comparison, the computation time needed to train a neural decoder for one epoch during the training phase and that needed to predict one kinematic state during the testing phase are reported. Table 3 shows that the proposed TTS reduced the computation time for all the RNN-based neural decoders. The vanilla RNN and LSTM could reduce the computation time by at least 16% for online prediction, possibly enabling real-time iBCI operation. Furthermore, the computation time could be reduced by 5-12% for offline training. The time duration of the timesteps selection achieved by our TTS was compared to the Bayesian optimization [34], which is a conventional optimization approach. The rightmost two columns of Table 3 show that the TTS could save 48-81% of the time duration of the timesteps selection compared with the Bayesian optimization. This demonstrates the computational efficiency of our TTS method. For the monkey N, the decoding performance of the three RNN-based neural decoders did not decrease when the number of timesteps was reduced to T * , as shown in Figure 6d-f. Both vanilla RNN and LSTM with T * timesteps (CC = 0.91 ± 0.01 for vanilla RNN and CC = 0.91 ± 0.01 for LSTM) significantly outperformed the ESA-driven QRNN [8] (p < 0.001 for T = 10, 15, 20, Friedman ANOVA test, followed by the Wilcoxon signed-rank test). The GRU achieved significantly better decoding performance (CC = 0.89 ± 0.02) than the ESA-driven QRNN (p < 0.001 for T = 10, p < 0.05 for T = 15, and p < 0.01 for T = 20, Friedman ANOVA test, followed by the Wilcoxon signed-rank test).
We further compared the decoding performance of the neural decoders whose timesteps were optimized by the Bayesian optimization [34] and the TTS. Figure 6 shows that both vanilla RNN and LSTM optimized by the Bayesian optimization achieved lower decoding performance than those optimized by the TTS (p < 0.001 for T = 10, 15, 20, Friedman ANOVA test, followed by the Wilcoxon signed-rank test). The GRU, optimized by the Bayesian optimization, achieved a decoding performance comparable with that achieved by the TTS. However, the Bayesian optimization selected longer timesteps than the TTS for the GRU, as shown in Figure 4. The TTS could not only select fewer timesteps but also facilitate a higher decoding performance than the Bayesian optimization.

Reduced Computation Time
We also evaluated the computational efficiency of the proposed TTS by obtaining the computation time as listed in Table 3. For fair comparison, the computation time needed to train a neural decoder for one epoch during the training phase and that needed to predict one kinematic state during the testing phase are reported. Table 3 shows that the proposed TTS reduced the computation time for all the RNN-based neural decoders. The vanilla RNN and LSTM could reduce the computation time by at least 16% for online prediction, possibly enabling real-time iBCI operation. Furthermore, the computation time could be reduced by 5-12% for offline training. The time duration of the timesteps selection achieved by our TTS was compared to the Bayesian optimization [34], which is a conventional optimization approach. The rightmost two columns of Table 3 show that the TTS could save 48-81% of the time duration of the timesteps selection compared with the Bayesian optimization. This demonstrates the computational efficiency of our TTS method. Table 3. Computation time of neural decoders using neural activity with T and T * timesteps. The efficiency is given by the computation time (ms) required to complete one epoch of offline training and one online prediction. The reduced number of timesteps T * of vanilla RNN, LSTM, and GRU were obtained from Figure 4, and the corresponding computation time was obtained over 37 sessions. The reduction in the computation time (in percentage) compared to that of T = 10 is shown in parentheses. D BO and D TTS represent the time duration (second) of the timesteps selection achieved by the Bayesian optimization and our TTS, respectively. The reduction of the time duration (in percentage) compared to that of the Bayesian optimization is shown in parentheses.

Visualization of Attention Weights in TTS
The attention weights of the TAM in TTS were visualized across all observations, as shown in Figure 7. Three RNN-based neural decoders were implemented using neural activity over 15 timesteps (i.e., T = 15) for the monkey Indy data [23] and monkey N data [24]. Different attention weights were obtained from the vanilla RNN, LSTM, and GRU. Furthermore, the attention weights also differed between behavioral tasks.
The attention weights implicitly reveal many line segments with a positive slope and various lengths. Figure 3a illustrates the meaning of the line segments. At the first time bin, the TAM highlights the neural activity at timestep T, which is the most recently observed neural activity. At the second time bin, the TAM highlights the neural activity at timestep T − 1, which represents identical neural activity to that highlighted at the first time bin. The top panel of Figure 3a shows that the TAM highlights identical neural activity to that observed in the first time bin (the leftmost part in purple in the bottom panel) across the first three time bins. The attention weights with large values could be fitted with a straight-line segment. This suggests that, although new neural activity is observed, the neural decoder prefers to decode the leftmost neural activity rather than the most recently observed neural activity. A longer line segment indicates that neural activity within a time window is successively highlighted over a longer period. Thus, the highlighted neural activity is likely important over this period for neural decoding. This further implies that most highlighted timesteps are potentially related to the kinematic state and are thus selected as essential timesteps for neural decoding.
For the vanilla RNN-based neural decoder, multiple timesteps are highlighted at each time bin, as shown in Figure 7a (i.e., numerous attention weights are large in each column in the middle panel) for monkey Indy. Hence, the vanilla RNN relies on multiple previous neural activity data for neural decoding. The vanilla RNN dynamically requires shorter and longer timesteps for neural decoding, as indicated by the green and blue arrows in Figure 7a, respectively. For monkey N, the preparatory neural activity observed at the leftmost time bin in the green-shaded area in Figure 7d is consecutively highlighted during the preparatory period and arm acceleration period where monkey N was instructed to grasp and pull an object. In addition to the preparatory neural activity, the recent two timesteps are highlighted simultaneously; however, the most recent timestep is highlighted during the arm deceleration period, as indicated by the yellow-shaded areas where monkey N was allowed to release the object. The attention weights implicitly reveal many line segments with a positive slope and various lengths. Figure 3a illustrates the meaning of the line segments. At the first time bin, the TAM highlights the neural activity at timestep , which is the most recently observed neural activity. At the second time bin, the TAM highlights the neural activity at timestep − 1, which represents identical neural activity to that highlighted at the first time bin. The top panel of Figure 3a shows that the TAM highlights identical neural activity to that observed in the first time bin (the leftmost part in purple in the bottom panel) across the first three time bins. The attention weights with large values could be fitted with a straight-line segment. This suggests that, although new neural activity is observed, In the LSTM-based neural decoder, attention weights have more regular patterns compared with those of the vanilla RNN-based neural decoder for monkey Indy. Figure 7b shows that approximately five timesteps are highlighted at each time bin. When monkey Indy performed the reaching movement, the attention weights implicitly revealed thick line segments, as indicated by the green-shaded areas. According to the illustration of the line segments in Figure 3a, the thick line segment suggests that the neural activity observed at multiple time bins are consecutively highlighted and are important for neural decoding over this reaching period. In other words, the preparatory neural activity that occurred before the movement onset is consecutively highlighted during the arm movement. In contrast, the neural activity observed at the previous one timestep was highlighted during the resting state, as indicated by the yellow-shaded areas in Figure 7b.
For monkey N, the neural activity observed during the early preparatory period (as indicated by the pink arrows) was consecutively highlighted during the preparatory period, as indicated by the green-shaded areas in Figure 7e. The attention weights implicitly revealed a thinner line segment than that for monkey Indy. This suggests that neural decoding of the instructed delayed reach-to-grasp task highlights fewer timesteps compared with the neural decoding of the reaching task, as can be seen from Figure 7b. The values of the attention weights during the arm acceleration period are small (i.e., the attention weights at all timesteps are almost white), as indicated by the orange arrows. This suggests that the LSTM-based neural decoder treats the neural activity at all timesteps equally. This makes it difficult for TTS to select fewer essential timesteps for neural decoding, which leads to a high number of selected timesteps, as shown in Figure 4b. During the arm deceleration period, the neural activity observed at the most recent timestep is highlighted, as indicated by the yellow-shaded areas in Figure 7b.
For the GRU-based neural decoder, the most recently observed neural activity was highlighted during the resting state for monkey Indy, as indicated by the yellow-shaded areas in Figure 7c. During reaching, the neural activity observed earlier than the movement onset was consecutively highlighted over the movement period, as indicated by the greenshaded areas. The attention weights implicitly revealed thinner line segments than those in the LSTM-based neural decoder, suggesting that the GRU highlights fewer timesteps compared with the LSTM at each time bin for neural decoding. For monkey N, a few recently observed neural activities were highlighted during the preparatory and arm acceleration periods, as indicated by the green-shaded areas in Figure 7f. Compared with the LSTM, the GRU highlighted the neural activity observed at four recent timesteps rather than those at farther timesteps (i.e., the early preparatory period indicated by green arrows in Figure 7e). Similar to the vanilla RNN and LSTM, the GRU highlighted the most recent neural activity during the arm deceleration period (release phase).

Timestep Selection for Varying Recording Conditions
For monkey Indy, the T * slightly differed, which suggests that the T * may be slightly affected by T for the three RNN-based neural decoders; however, these T * values varied in a small range, with T * ∈ [2,4] for the vanilla RNN, T * ∈ [5,7] for the LSTM and T * ∈ [3,6] for the GRU. Overall, the proposed TTS selected essential timesteps even when the input neural sequences covered long periods.
For monkey N, the TTS selected more than seven timesteps for the LSTM and GRU. This result is consistent with that of the monkey Indy, where the timesteps required for the vanilla RNN were fewer than those required for LSTM and GRU. The numbers of selected timesteps of the LSTM and GRU were strongly sensitive to T.
Neural recordings are nonstationary [4]. For instance, a slight movement of the electrodes or changes in the electrode impedance can alter neural recordings [1,35]. These changes affect the mapping between the neural activity and kinematic states, which in turn modify the required timesteps for neural decoding. Regardless of the nonstationary nature of the neural recordings, the proposed TTS can select the essential timesteps, imbuing the neural decoder with robustness against varying recording conditions.

Computational Efficiency and Comparable Decoding Performance
It has been known that a long neural sequence may introduce stochastic noise that decreases the decoding performance. Therefore, a neural decoder may benefit from the appropriate number of timesteps. The proposed TTS not only reduces the number of timesteps for neural decoding but also preserves the decoding performance, which remains comparable to that of a state-of-the-art method.
The proposed TTS could save the computation time of iBCI operation. The reduced computational burden for training is important for the clinical application of iBCIs. In fact, neural decoders typically require daily calibration for long-term use owing to variations in the recording conditions, such as micro-motion of electrodes or changes in an electrode's impedance. The reduced calibration time provides a substantial benefit for the user of being able to operate the iBCI without long waiting periods. Moreover, a short calibration time reduces the power consumption, facilitating clinical application in portable devices.

Interpretation of Attention Weights in TTS
The attention weights of the TAM in TTS indicated whether the neural activity at one timestep is important to the neural decoding. For the vanilla RNN-based neural decoder, the patterns of the attention weights for monkey Indy and monkey N differ. Monkey Indy performed the reaching task in a horizontal plane, whereas monkey N performed the instructed delayed reach-to-grasp task. This suggests that various behavioral tasks might result in different patterns of the attention weights. The patterns of the attention weights suggested that the previous neural activity is essential for neural decoding during the preparatory period and execution of the movement, whereas the most recent neural activity is essential during the release phase. This finding is consistent with observations made in [5,[36][37][38], where preparatory neural activity served as an initial condition for subsequent activity patterns. For the instructed delayed reach-to-grasp task, all findings from the three RNN-based neural decoders suggest that the preparatory neural activity is essential during the preparatory period and the execution of the movement, whereas the most recent neural activity is essential during the release phase. Figure 4 reveals that the number of timesteps selected by the TTS is sensitive to T for LSTM and GRU-based neural decoders. Assigning T for robust timesteps selection would be a challenging task. Furthermore, Figure 6 reveals that the vanilla RNN-based neural decoder with reduced timesteps achieved a lower decoding performance than that of stateof-the-art methods. Our TTS fails to select essential timesteps for the vanilla RNN-based neural decoder. How to select timesteps which are robust to T and are effective for all neural decoders is left for future work.

Conclusions
In this paper, we proposed TTS to select a few essential timesteps for RNN-based neural decoders while reducing the adverse effects of stochastic noise embedded in long neural sequences. The proposed TTS incorporates a TAM to estimate the saliency and relative importance of the input neural activity at each timestep. When an iBCI learns the functional mapping between the neural activity and kinematic states, the TAM is expected to learn to highlight the essential timesteps that contribute to neural decoding. The attention mechanism improves the interpretability of the information in the timesteps. Experimental results demonstrate that the proposed TTS selects a few essential timesteps from a long neural sequence to accurately predict the kinematic states. Using the proposed TTS, RNN-based neural decoders outperformed state-of-the-art methods on two nonhuman primate datasets. Furthermore, the TTS reduced the computation time for offline training and online prediction in the three RNN-based neural decoders. When visualizing the attention weights of the TAM in TTS, the preparatory neural activity observed before the movement onset is consecutively highlighted over the movement period, whereas the most recent neural activity is highlighted during the resting state. Experimental results also revealed that the number of essential timesteps varied over long recording days owing to changes in the recording conditions. Potential future work may incorporate an adaptive mechanism into the proposed TTS to handle changes in the recording conditions for the long-term use of iBCI.