Language Model Using Neural Turing Machine Based on Localized Content-Based Addressing

: The performance of a long short-term memory (LSTM) recurrent neural network (RNN)-based language model has been improved on language model benchmarks. Although a recurrent layer has been widely used, previous studies showed that an LSTM RNN-based language model (LM) cannot overcome the limitation of the context length. To train LMs on longer sequences, attention mechanism-based models have recently been used. In this paper, we propose a LM using a neural Turing machine (NTM) architecture based on localized content-based addressing (LCA). The NTM architecture is one of the attention-based model. However, the NTM encounters a problem with content-based addressing because all memory addresses need to be accessed for calculating cosine similarities. To address this problem, we propose an LCA method. The LCA method searches for the maximum of all cosine similarities generated from all memory addresses. Next, a speciﬁc memory area including the selected memory address is normalized with the softmax function. The LCA method is applied to pre-trained NTM-based LM during the test stage. The proposed architecture is evaluated on Penn Treebank and enwik8 LM tasks. The experimental results indicate that the proposed approach outperforms the previous NTM architecture.


Introduction
A language model (LM) estimates the probability of the current word based on the previous word sequence. For the word sequence W = (w 1 , w 2 , ..., w N ), the probability of the LM is denoted as P(W). P(W) = P(w 1 , w 2 , ..., w N ) (1) where N is the length of W. When the number of words in the word history (w 1 , w 2 , ..., w i−1 ) increases, it becomes increasingly difficult to calculate the probability for the current word w i , because the word history will not appear in the text corpus. For this reason, the Markov assumption is applied to the LM to compute P(w i |w 1 , ..., w i−1 ). Here, the length of the word sequence which affects w i is (n − 1). For modeling the LM, the conventional method is an n-gram but it has the following two problems: an unseen word sequence problem and a limitation in the length of the word history. To address these problems, an LM using a deep neural network (DNN) was proposed to model word sequences [1]. It could predict the probability of the unseen word sequence with a high-dimensional hyperplane of the DNN. However, the DNN-based LM cannot solve the limitation in the length of the word history.
To solve the aforementioned limitation, a recurrent neural network (RNN) was used to model the longer word sequence [2]. In the RNN-based LM, a recurrent hidden layer performs the role of

Related Works
Language modeling has two important problems: (1) unseen word sequence problem and (2) a limitation in the length of the longer word history. To solve the unseen word sequence problem, a DNN was used in the LM [1]. In DNN-based LMs, a semantic embedding vector that expresses semantic information in 8-10 words was used as the input of the DNN for the improvement of a model performance [21]. In addition, an optimization method of calculation in a softmax layer was proposed [22]. However, the DNN-based LM has the limitation of the length of the word history. The limitation occurs because the number of input layers increases as the number of words in context information increases.
RNN was applied to the LM for training with a longer word history [2]. In RNN-based LMs, one word is used as an input, and the output of the recurrent hidden layer at time (t − 1) is also used as input. Therefore, recurrent hidden layers maintain information of longer word sequences. Therefore, it solves the problem of the limitation in the length of the longer word history. Discriminant training [23], class-based clustering [24], and noise-contrastive estimation (NCE) [25] were used in the RNN LM for computation speed improvement of the softmax layer, because the dimension of the output layer is the vocabulary size and the number of weight parameters for the RNN-based LM is higher than that of the DNN-based LM. Despite the recurrent hidden layer, the RNN-based LM is difficult to train with a longer word sequence due to gradient vanishing and explosion [26]. The problem is that the error rate converges to zero or infinity when error back-propagation is performed.
To solve the gradient vanishing problem, LSTM was proposed [3]. LSTM is a hidden node structure that consists of one or more memory cells, an input gate, an output gate, and a forget gate. In [27], the performance of LMs based on RNN and LSTM RNN was compared. The LSTM RNN-based LM achieved an absolute improvement of 0.3% in comparison with the RNN-based LM in terms of word error rate (WER) on 12 M words of English Broadcast News transcription task. Although the gradient clipping [28] works well, LSTM has an issue. In addition, the LSTM RNN-based LM cannot treat input sequences longer than 200 words or character sequences [5].
To train LMs on longer word or character sequences, an attention mechanism has been used for LMs. The attention mechanism is an effective method for selecting important information on longer sequences [29]. In the LM, the attention mechanism supports the LM to obtain the ability of how LMs attend to different blocks of input sequences [6]. One of the most widely used attention-based models is the Transformer [7], and it outperforms LSTMs on LM tasks [9]. The Transformer is an encoder-decoder model and relies on a self-attention mechanism [30] on multi-head and positional encoding [7]. Multi-head attention allows the LM to attend to information on different vector-space representations at different positions of a sequence. To provide position information of a sequence to the Transformer, positional encoding is used to obtain long-range dependency from position information on sequence order. However, the Transformer encodes longer context information into a fixed size sequence [10]. In [31], the training dataset was split into shorter chunks and then used as the input of the Transformer. A drawback of this method is that fixed-length chunks lead to the context fragmentation problem [10].
Recently, in order to solve the context fragmentation problem, Transformer-XL [10], GPT-2 [11], and BERT [12] have been proposed and widely used in various tasks. The Transformer-XL uses a hidden state computed at the previous time step as previous context information for the current chunk. This chunk-level recurrence allows the Transformer-XL to maintain long-term dependency and treat the context fragmentation problem. The Transformer-XL showed high performance results of perplexity to 54.52 on the word-level PTB dataset and bits-per-character (BPC) to 0.99 on the character-level enwiki8 dataset [10]. GPT-2 is a multi-layer decoder of the Transformer. In [11], a pre-trained GPT-2 consisted of 12-layer decoder blocks. Each decoder block had 768 hidden nodes and 12 heads for a multi-head self-attention layer. In LM tasks, the GPT-2 achieved a higher performance of BPC to 0.93 with 1542 M weight parameters, in comparison to the result of the Transformer-XL. BERT is a multi-layer bidirectional encoder of the Transformer. BERT is proposed for pre-training deep bi-directional vector space representations by considering left and right context information in all layers. In [12], BERT showed the best performance in NLP tasks, such as a question-answering task and a named entity task.
Another deep learning model based on the attention mechanism is the NTM architecture [8]. The NTM architecture is analogous to the Von Neumann machine with a controller that interacts with an external memory through an attention mechanism. The controller is a deep learning model and the external memory is a set of M-dimensional real-valued vectors. In [8], the experiments showed that the NTM architecture with an LSTM-based controller is capable of learning simple algorithms such as copy, associative recall, and priority sort algorithms. In addition, the MDM-NTM architecture improved the performance of read and write operations from the vanilla NTM architecture [14]. In the MDM-NTM architecture, the memory attention mechanism decides where information is stored in the external memory and maintains the order of sequences through a temporal link. In the experiments, the MDM-NTM architecture showed a performance improvement on the bAbI question-answering task, graph traversal, and block puzzle problems in comparison with the vanilla NTM architecture. Recently, LMs based on the NTM architecture have been proposed. In [15], the NTM architecture was first used for the LM task and it showed perplexity to 98.6 on the word-level PTB dataset. It exhibited better performance than DNN and LSTM-based LMs, but lower performance than the LM based on the Transformer-XL. On the character-level PTB corpus, the MDM-NTM architecture using the highway network-based controller achieved the performance of BPC to 1.147 [17]. It exhibited better performance than the trellis network (BPC 1.158) [32], the AWD-LSTM network (BPC 1.169) [33], and the vanilla Transformer (BPC 1.227) [34].
Despite the success of the LM based on the NTM architecture, content-based addressing of the NTM architecture is necessary to access all external memory addresses to calculate cosine similarity and softmax normalization [20]. To address this issue, content-based addressing is required to select external memory addresses related to the input sequence. In this paper, we propose localized content-based addressing (LCA). We describe more details of the NTM architecture in Section 3 and the proposed LCA-NTM architecture in Section 4.

Neural Turing Machine
As shown in Figure 1, the NTM architecture consists of a controller and an external memory [14,18]. The controller is a deep learning model F. The external memory EM is an element of a set R N×M . N is the number of M-dimensional real-valued vectors. If the controller F does not perform a read and write operation to the external memory, then the NTM architecture is equal to the deep learning model topology. We assume that the number of read vectors is R, R read vectors r i t (i = 1, 2, ..., R) are generated by the read operation at time t and its dimension is M. The input x t and R read vectors r i t−1 generated from read operations at time (t − 1) are concatenated and then used as the input of F. The controller emits an interface vector ξ t and a controller output vector o t . Moreover, ξ t is used to interact between F and EM. After the read operation is performed, R read vectors r i t are generated at time t. R read vectors r i t are used as the input of a deep learning model G. The dimension of an output vector G(r 1 t , ..., r R t ) generated from G is the same as the dimension of o t . Notably, o t is added to G(r 1 t , ..., r R t ) and then projected into an output vector y t . Furthermore, y t is equal to the final output of other deep learning model topologies, and the dimension of y t is the same as the dimension of a target vector.

Content-Based Addressing and Location-Based Addressing
To generate the read and write weighting vectors, content-based addressing and location-based addressing are performed sequentially in the NTM architecture [14]. The interface vector ξ t is used for addressing these methods. ξ t consists of the following elements: k t , ks t , g t , s t , γ t , e t , and v t . k t (∈ R M×1 ) is a key vector at time t. ks t (∈ R) is a key strength at time t. g t (∈ R) is an interpolation factor at time t. s t (∈ R 3x1 ) is a shift vector at time t. γ t (∈ R) is a sharpening factor at time t. e t (∈ R M×1 ) is an erase vector at time t. v t (∈ R M×1 ) is a converted input vector at time t.
For content-based addressing, a cosine similarity is measured with k t and a vector of each external memory address. The cosine similarity is normalized with the softmax function. The value of content-based addressing in the i-th memory address at time t is determined as follows: where exp is an exponential function, CS is the cosine similarity function, and EM[i, ·] is the M-dimensional real-valued vector of the i-th external memory address. After content-based addressing, location-based addressing is performed. CA t is interpolated with ω t−1 using scalar g t . ω t−1 is a weighting vector generated at time (t − 1). Interpolated weighting w ip k t , ks t , g t , s t , and γ t are assigned to each read and write weighting. In the NTM architecture, the number of read and write weightings is R and one, respectively. Therefore, a dimension of

Read and Write Operations
For the read operation, the NTM architecture attends to a specific external memory area, which is related to the input vector and generates read vectors. The read vector is a M-dimensional real-valued vector, defined as the weighted summation over all vectors of the external memory. Specifically, for the i-th read vector at time t, we define where EM t is a transposed external memory at time t, and ω r,i t is the i-th N-dimensional read weighting vector at time t. The read weighting vector is an attention vector generated by Section 3.1.
For the write operation, the NTM architecture generates a write weighting vector to determine the external memory addresses and stores the input vector to the selected memory addresses. We define the write operation as follows: where EM t and EM t−1 are the external memories at time t and (t − 1), respectively. • is the element-wise product. OM is a matrix of the same size as that of the external memory. All elements of OM are 1. e t is a transposed erase vector at time t. The erase vector determines the ratio at which information stored in the external memory is erased. v t is the converted input vector that is transposed at time t. ω w t is an N-dimensional write weighting vector at time t. The write weighting vector is the attention vector generated by Section 3.1.

Neural Turing Machine Using Localized Content-Based Addressing
Previous content-based addressing, as described in Section 3.1, is necessary for accessing all memory addresses for the calculation of the cosine similarities and using these similarities for generating the attention vector. Consequently, the attention vector can include weights to the memory address, which are not related to the input vector. The unnecessary memory area is reflected in the external memory, which results in performance deterioration.
Hence, we introduce the NTM architecture using LCA-NTM. As shown in Figure 2, cosine similarities are calculated using a key vector k t and each external memory address. Subsequently, the LCA selects a memory address that shows the maximum cosine similarity. This process is the main difference with vanilla content-based addressing, because vanilla content-based addressing does not select the maximum cosine similarity. If all the cosine similarities are negative, vanilla content-based addressing is used. After searching for the maximum value of cosine similarity, softmax normalization is performed on d memory addresses on both sides of the selected memory address to generate the content-based addressing vector CA t . CA t [i], where softmax is not performed, is 0. Algorithm 1 describes the proposed LCA procedure in detail.
The proposed NTM architecture attends to a specific memory area selected by LCA. The result of the LCA is the content-based addressing vector. Vanilla location-based addressing is applied to the content-based addressing vector generated from LCA. The final weighting vector is generated by location-based addressing. The proposed LCA method is applied to test the stage on pre-trained NTM-based LM.
For the i-th read vector at time t, we define, where ω LCA r ,i t is the i-th N-dimensional read weighting vector at time t. The proposed NTM architecture stores the converted input vector to the selected memory addresses. Selected memory addresses are decided by the write weighting vector. We define the write operation in the proposed LCA-NTM architecture as, where ω LCA w t is an N-dimensional write weighting vector at time t. LCA searches for the maximum cosine similarity value. To search for this value, the cosine similarity values are not sorted because all values have to be considered in the deterministic algorithm for obtaining the maximum value. It takes O(N), where N is the number of vectors in the external memory. The time complexity of softmax normalization used for LCA is O(2d + 1) because the denominator for softmax normalization is dependent on the selected memory area and comprises (2d + 1) memory addresses. Therefore, the time complexity of the LCA is O(N + 2d + 1).

Experiments and Discussion
To evaluate the proposed LCA-NTM architecture, the vanilla NTM-based LM was trained, and we evaluated the pre-trained NTM-based LM using the LCA method. We trained and tested all LMs on the character-level PTB LM task [16] and the enwik8 LM task [35]. We compared the proposed NTM architecture with state-of-the-art LMs and the MDM-NTM architecture.

Experimental Environment
The PTB dataset is composed of sentences collected from the Wall Street Journal news domain. The character-level PTB dataset was used in the experiments. However, the character-level PTB dataset does not contain spaces between characters; therefore, it is difficult to recognize a specific word from the character sequence. Hence, space markers (-space-) and a marker for the beginning of a sentence (-bos-) were added to the character-level PTB dataset. Therefore, the total number of characters used for the experiments was 50. The character-level PTB dataset used for the experiments contained 4.88, 0.38, and 0.43 M characters, as the training, validation, and test sets, respectively. The PTB LM task experiment was repeated five times because we verify the stability of hyper-parameters in different LMs and test their generalization.
The enwik8 dataset contains 100 M characters of unprocessed Wikipedia text. The total number of characters was 206. Following previous studies, we split the enwik8 dataset into 90, 5, and 5 M characters for the training, validation, and test sets, respectively. The enwik8 LM task experiment was repeated three times.
We used a 3.40GHz Intel Xeon E5-2643 v4 CPU and four Nvidia GTX 1080 Ti GPUs. We used two evaluation metrics: BPC and training time. BPC is the average number of bits required to encode one character [31]. A bit is used as a unit of entropy. We defined BPC as loss/log (2). To evaluate the inference time of each model, we measured the inference time per batch.

Experimental Results in Character-Level Penn Treebank LM Task
The LSTM RNN is the baseline LM in the experiments. We trained the baseline LM on PyTorch with the following hyper-parameters: the number of nodes in the embedding layer was 50, the number of hidden layers was 3, the dimension of each hidden layer was 1024, the learning rate was initialized at 1 × 10 −1 , the number of epochs was 300, the number of batches was 6, the weight decay was 1 × 10 −6 , and the length of the back-propagation through time (BPTT) was 120. For training the Transformer-based LM, we used the following hyper-parameters (We used the following open source to train the LSTM RNN and Transformer-based LM: https://github.com/pytorch/examples/blob/ master/word_language_model): the number of nodes in the embedding layer was 50, the number of heads in the encoder and decoder was 4, the number of hidden layers was 3, the dimension of each hidden layer was 1024, the learning rate was initialized at 1 × 10 −3 , the number of epochs was 300, the number of batches was 6, the weight decay was 1 × 10 −6 , and the length of the input chunks was 120.
Moreover, we compared the MDM-NTM-based LM with the trellis and AWD-LSTM networks. The hyper-parameters of the trellis and AWD-LSTM networks were the same as those used in previous studies. However, the number of batches was 6 and the BPTT length was 120 for the experiments. In addition, we only applied a dropout factor to the hidden layers, and not the embedding, input, or output layers.
To train the MDM-NTM-based LM, we used the LSTM RNN as the controller. The following hyper-parameters were used for the experiments: the number of nodes in the embedding layer was 50, the number of hidden layers was 3, and the numbers of dimensions of each hidden layer were 1024, 512, and 512. We used the external memory consisting of 1024 vectors. The dimensions of each vector were 512. The learning rate was initialized at 1 × 10 −3 and reduced on the plateau of an objective function with a factor of 1 × 10 −1 . The number of epochs was 300, the size of the batch was 6, the weight decay was 1 × 10 −7 , and the character sequence length was 120. Table 1 shows the evaluation results of MDM-NTM-based LMs on the PTB LM task. All MDM-NTM-based LMs had a faster inference time than the trellis network. We measured the performance of the MDM-NTM architecture according to the number of read vectors. The number of read vectors was doubled to evaluate their effect. The MDM-NTM-based LM using a single read vector demonstrated a higher performance with a BPC of 1.5986 than the MDM-NTM-based LMs using two and four read vectors. Each BPC sample of experiment results on the PTB LM task is shown in Table A1. The analysis for the performance of the MDM-NTM-based LM according to the number of read vectors led to three important findings: (1) the number of weight parameters increased with the number of read vectors. To generate the read vectors, the controller outputs the key vectors. The number of key vectors was the same as the number of read vectors. The key vectors were M-dimensional vectors, where M is a dimension of the vector in the external memory. Furthermore, the number of read vectors affected the dimension of the input layer in the controller, because the read vectors generated at time (t − 1) were used as the input to the controller. (2) BPC decreased depending on the number of read vectors. When the MDM-NTM-based LM used one to four read vectors, the performance decreased. This implies that the PTB LM task was insufficient for training all weight parameters of the MDM-NTM-based LM using two or more read vectors. (3) The inference time was disproportional to the number of read vectors. We assumed that a larger number of read vectors required a longer inference time. However, the MDM-NTM-based LM using two read vectors showed a faster inference time than that using one read vector.
We evaluated the performance of the MDM-NTM architecture according to the weight decay. The weight decay reduced the model overfitting by imposing increasingly large penalties as the weight parameters increased [36]. To implement the weight decay, a set of assumed weight parameters such as W, 1 2 λWW T was added to the loss function and a penalty is imposed. Here, as λ increases, an increasingly large penalty was imposed on W. We used two λ values, 1 × 10 −5 and 1 × 10 −7 , during the experiments. Table 1 presents the evaluation results of the MDM-NTM architecture according to weight decay. When we used λ = 1 × 10 −5 for the weight decay, the MDM-NTM-based LM using one and two read vectors demonstrated a BPC of 1.6998 and 1.7341, respectively. The performance of the MDM-NTM-based LM using λ = 1 × 10 −5 for the weight decay was lower than that using 1 × 10 −7 .
Two important findings were observed after analyzing the performance of the MDM-NTM-based LM according to the weight decay: (1) BPC decreased according to the weight decay. When the value of λ of the weight decay used in the MDM-NTM-based LM along with a single read vector ranged from 1 × 10 −7 to 1 × 10 −5 , the performance degraded, that is, BPC increased from 1.5986 to 1.6998. When λ of the weight decay was extremely high, the model was trained to underfit. When λ of the weight decay was extremely low, the model was trained to overfit. Therefore, the MDM-NTM-based LM using λ = 1 × 10 −5 for the weight decay showed an underfitting in the experiments. (2) The inference time was disproportional to λ of the weight decay. We assumed that the training time was the same, even when the MDM-NTM-based LM was trained with any value assigned to λ of the weight decay, because the number of weight parameters did not change. However, the experimental results demonstrated that the inference times of each model are different.
We measured the performance of the MDM-NTM architecture according to the number of vectors in the external memory; Table 1 presents the evaluation results. When we used 1024 as the number of vectors in the external memory, the MDM-NTM-based LM demonstrated the highest performance, with a BPC of 1.5986. The inference time decreased when the number of vectors in the external memory decreased.
Three important findings were observed regarding the performance of the MDM-NTM-based LM according to the number of vectors in the external memory: (1) the number of weight parameters is the same, although the number of vectors in the external memory decreased. All MDM-NTM-based LMs used the same controller and all the vectors in the external memory had the same dimensions. The number of weight parameters is related to the controller and the dimension of the vector in the external memory. Therefore, the number of weight parameters is not related to the number of vectors in the external memory. (2) BPC is not proportional to the number of vectors in the external memory. We assumed that the MDM-NTM-based LM demonstrated the highest performance when the number of vectors in the external memory was 120, because the length of the character sequence was limited and the LM could predict the next character without additional vectors in the external memory. However, the MDM-NTM-based LM using the external memory consisting of 129 vectors exhibited a BPC result which was the same as that of the MDM-NTM-based LM using the external memory consisting of 1024 vectors. (3) The inference time is proportional to the number of vectors in the external memory. The time complexity of content-based addressing is O(N) because of the denominator of the softmax normalization. Therefore, the time spent in the calculation required for softmax normalization influenced the inference time.
We applied the proposed LCA mechanism to the pre-trained MDM-NTM architecture during the evaluation stage. The pre-trained MDM-NTM architecture showed that the highest performance in Table 1 was used. In Table 2, when the selected memory address was 257, the performance of the LCA-NTM-based LM was 1.5648 BPC. The performance was higher than that of the MDM-NTM-based LM. Two important findings were observed regarding the performance of the LCA-NTM-based LM: (1) the BPC of the proposed architecture was lower than that of the MDM-NTM-based LM, except when the selected memory address was 257. We analyzed errors with cosine similarities and discovered that negative values or values close to zero existed, although positive values were also obtained in the cosine similarities of the selected memory address. In the LCA, these cosine similarities were used to generate an attention vector. The proposed NTM architecture yields worse results than the MDM-NTM-based LM. Furthermore, although many of the cosine similarity values were approximately unity, the maximum cosine similarity was always selected in the LCA. If two maximum cosine similarities exist, LCA selects only the first cosine similarity that has the maximum value. These drawbacks led to the performance degradation of the proposed LCA-NTM architecture.
(2) The inference time of the proposed NTM architecture in the test stage was twice that of the MDM-NTM-based LM in the test stage. To obtain the maximum cosine similarity, we used the search algorithm for the LCA. The time complexity of the previous content-based addressing was O(N) because the denominator of softmax normalization had to be computed. However, the time complexity of the LCA was O(N + 2d + 1).

Experimental Results in enwik8 LM Task
We compared the results achieved through the proposed LCA-NTM architecture with those of the Transformer [10]. For the baseline LM, we also used the previous experimental result of the LSTM RNN-based LM [37]. For training the MDM-NTM-based LM, we used the LSTM RNN as the controller. The number of dimensions of the hidden layers was 1024, and the number of hidden layers was 4. We used an external memory consisting of 128 vectors. The number of dimensions of each vector was 256. In addition, the batch size was 20. We could not train or evaluate the large-scale MDM-NTM-based LM because the Nvidia GTX 1080 Ti GPU has a 11-GB memory capacity, and the batch size would have been considerably small if we trained a large-scale MDM-NTM-based LM.
We evaluated the performance of the MDM-NTM architecture according to the number of read vectors. Table 3 shows the evaluation results. When we used four read vectors, the MDM-NTM-based LM demonstrated the highest BPC performance of 1.3922. The inference time increased when the number of read vectors was increased. Each BPC sample of experiment results on the PTB LM Task is shown in Table A2. The analysis of the performance of the MDM-NTM-based LM according to the number of read vectors led to two important findings: (1) BPC decreased depending on the number of read vectors. When the MDM-NTM-based LM used from one to four read vectors, the performance improved.
(2) The inference time was proportional to the number of read vectors. We assumed that a larger number of read vectors required a longer inference time. The MDM-NTM-based LM using one read vector showed a faster inference time than that using two or more read vectors.
Furthermore, we applied the proposed LCA mechanism to the pre-trained MDM-NTM architecture during the test stage. The pre-trained MDM-NTM architecture showed that the highest performance in Table 3 was used. In Table 4, when the selected memory address was 97, the performance of the LCA-NTM-based LM showed a BPC of 1.3887. The performance was higher than that of the MDM-NTM-based LM. An important finding was observed regarding the performance of the LCA-NTM-based LM. The BPC of the proposed architecture was lower than that of the MDM-NTM-based LM, except when the selected memory address was 97. The performance improvement was not insignificant compared to that observed through the experimental results on the PTB LM task. We analyzed the errors with cosine similarities and discovered that more negative values existed than those in the experimental results on the PTB LM task. Therefore, the proposed NTM architecture applied vanilla content-based addressing, not LCA. These drawbacks led to the performance degradation of the proposed LCA-NTM architecture. Table 4. Evaluation results of LCA-NTM-based LM according to number of selected memory addresses on the enwik8 LM task (nWP, number of weight parameters; nRV, number of read vectors; nVEM, number of vectors in the external memory; nSVEM, number of selected vectors in the external memory; WD, weight decay; IT, inference time (ms/batch); µ, mean of BPC results; σ, standard deviation of BPC results).

Conclusions and Future Work
We presented the LM using the LCA-NTM architecture. The LCA methods selected a memory address that represented the maximum cosine similarity. This differed from vanilla content-based addressing, because vanilla content-based addressing does not search for the maximum cosine similarity. The specific memory area, including the selected memory address, was normalized using a softmax function. For the PTB LM task, when the selected memory address was 257 when applying the LCA, the performance of the LCA-NTM-based LM showed a BPC of 1.5648. For the enwik8 LM task, when the selected memory address was 97, the BPC performance of the LCA-NTM-based LM showed a BPC of 1.3887. These results indicate that the proposed approach outperformed the MDM-NTM-based LM.
In a future work, we intend to modify the LCA-NTM architecture to select multiple addresses according to the cosine similarity. In addition, we will implement methods to improve the inference time of the LCA-NTM architecture. We will also evaluate the LCA-NTM architecture on web-scale LM tasks, such as WikiText-103 and One Billion Word.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: