A Curriculum Batching Strategy for Automatic ICD Coding with Deep Multi-Label Classification Models

The International Classification of Diseases (ICD) has an important role in building applications for clinical medicine. Extremely large ICD coding label sets and imbalanced label distribution bring the problem of inconsistency between the local batch data distribution and the global training data distribution into the minibatch gradient descent (MBGD)-based training procedure for deep multi-label classification models for automatic ICD coding. The problem further leads to an overfitting issue. In order to improve the performance and generalization ability of the deep learning automatic ICD coding model, we proposed a simple and effective curriculum batching strategy in this paper for improving the MBGD-based training procedure. This strategy generates three batch sets offline through applying three predefined sampling algorithms. These batch sets satisfy a uniform data distribution, a shuffling data distribution and the original training data distribution, respectively, and the learning tasks corresponding to these batch sets range from simple to complex. Experiments show that, after replacing the original shuffling algorithm-based batching strategy with the proposed curriculum batching strategy, the performance of the three investigated deep multi-label classification models for automatic ICD coding all have dramatic improvements. At the same time, the models avoid the overfitting issue and all show better ability to learn the long-tailed label information. The performance is also better than a SOTA label set reconstruction model.


Introduction
The International Classification of Diseases (ICD) is a healthcare classification system initiated by the World Health Organization [1]. It is mainly used for uniformly coding diseases, symptoms, processes, injuries and other features contained in electronic medical records (EMRs) [2]. The coding results are widely used for epidemiological studies [3], the billing and reimbursement for medical services [4] and diagnostic information retrieval [5].
The ICD coding task requires professionals to master both medical knowledge and the ICD coding system. The coding process is time-consuming and laborious. In addition, different professionals may have different understandings of the complex ICD coding system. This may also lead to inconsistent coding results. Consequently, automatic ICD coding based on machine learning techniques has become an interesting research area in recent years [6,7].
Automatic ICD coding is usually regarded as a multi-label classification problem. The performance of automatic ICD coding is directly affected by extremely large label sets and imbalanced label distribution [8,9]. Recently, researchers have challenged these issues, mainly from two perspectives. One is reconstructing the label sets (ReLS) through designing different label-combining strategies [7,8,10]. The new label sets consist of multigranularity labels. Another is building multi-level label hierarchical classifiers based on hierarchical joint learning mechanisms [9,11]. The main idea of these methods is to reduce the label space dimension and alleviate the imbalanced label distribution issue.
Deep multi-label classification models are widely used for automatic ICD coding, and the minibatch gradient descent (MBGD) is a widely used optimization algorithm for training the models [7]. In each iteration during a training procedure, the gradient updates of the model parameters are approximated, based on the examples in one batch. The studies show that a larger batch size results in a lower frequency of gradient updates and better gradient update approximation, but it may hurt generalization and result in finding poorer local optima [12].
In this paper, we further find that, during training, deep multi-label classification models for automatic ICD coding, which split the training data into batches with a simple but commonly used shuffling algorithm in MBGD, always result in the local data distribution of the batches (or the local batch data distribution) being inconsistent with the global training data distribution (see Figure 1). This phenomenon causes the local statistics of batches to vary substantially from the global statistics of the training data [13]. It will also hurt the model performance and generalizability (see Table 1).  To challenge the inconsistency problem, we propose a simple and effective curriculum batching strategy to conveniently replace the shuffling algorithm for splitting the training data into batches. Assuming that the uniform data distribution (UDD) is easier to be learned by deep multi-label classification models for automatic ICD coding than the shuffling data distribution (SDD), and the SDD is easier to be learned than the original imbalanced training data distribution (IDD), the proposed strategy applies three specific sampling algorithms to generate three batch sets with the three types of mentioned data distributions. The models then iterate each batch set, in turn, and learn the ICD coding experience from the easiest batch set to the hardest batch set.
A series of experimental results show that the proposed strategy can effectively avoid the overfitting issue and dramatically improve the automatic ICD coding performance compared with the three representative deep multi-label classification models and one state-of-the-art ReLS method for automatic ICD coding. Moreover, the improved models also have a better ability to learn the long-tailed label information. To the best of our knowledge, this is the first work to introduce a curriculum learning method for automatic ICD coding with deep multi-label classification models. In summary, the contribution of this paper is fourfold: • This paper finds that the shuffling algorithm in the MBGD always causes the local data distribution of batches to be inconsistent with the global training data distribution, which hurts the model's performance and generalizability; • This paper alleviates the problems of poor generalization ability and low classification accuracy of the deep learning model in the field of ICD automatic coding; • This paper greatly improves the learning ability of the automatic ICD coding model for data with long-tailed labels; • This paper introduces the curriculum learning method into automatic ICD coding of deep multi-label classification model for the first time.

Automatic ICD Coding
Automatic ICD coding is a widely discussed, hot research topic in the field of medicine and healthcare [6,7]. Many statistical machine learning methods have been applied to challenge this problem [6,14]. In recent years, with the widespread success of deep learning, it has also been introduced to solve the problem of automatic ICD coding [7,15]. The automatic ICD coding problem is formulated as a multi-label classification problem [16,17].
Extremely large label sets and imbalanced data distribution are two main issues directly impacting coding performance. Mullenbach presented a Convolutional Attention for Multilabel Classification (CAML) model incorporating an attention technique with convolutional neural networks [18]. Sadoughi extended this approach by adding a maximum pooling layer across all the multiple convolution layers [19]. To further improve the performance of CAML, Li integrating a multi-filter CNN architecture with a residual CNN architecture, which is called a multi-filter residual convolutional neural network [16]. Bhutto present a Deep Recurrent Convolutional Neural Network with Hybrid Pooling (DRCNN-HP), which considers the different lengths as well as the dependency of the ICD code-related text chunks [20]. Another current solution idea is to reduce the label space dimensions to alleviate the imbalanced data distribution. The label set reconstruction method [7,8,10] and the multi-label hierarchical classifier-based method [9,11] are two representative approaches. Differently from the previous studies, in this paper we explored further, from the perspective of how these problems affect the local model training procedure.

Mini-Batch Gradient Descent
Mini-batch gradient descent is the widely used optimization algorithm to train deep neural models [21,22], including the deep multi-label classification models for automatic ICD coding [7]. How to effectively use MBGD to train models is always a concern in machine learning and other related fields [23,24]. There are many aspects of research, including the effect of batch normalization [25], batch size and learning rate settings [12,26], etc. We studied the inconsistency problem between the local batch data distribution and the global training data distribution and attempted to avoid the risk of overfitting and to improve the model generalization ability.

Curriculum Learning
Elman first proposed a "starting small strategy" approach to learn a model with a curriculum [27]. Two methods were developed, including an incremental input method and an incremental memory method. The motivation behind these methods was that compound sentences make it a little difficult for neural networks to learn grammatical concepts in the early phases of training. Therefore, these methods make the models learn gradually, by varying the percentage of simple and compound sentences used during training or constraining the model's capacity in the early phases of training [27].
Bengio et al. [28] revisited Elman's approach and formalized the approach of starting training on easy examples first and then gradually introducing more complex examples during the training. This type of approach was named as "curriculum learning", and it can achieve significant improvements in generalization. The core of curriculum learning is to develop a curriculum strategy to train the models from simple to complex.
Consequently, we proposed a simple and effective curriculum batching strategy for automatic ICD coding with deep multi-label classification models in order to solve the inconsistency problem mentioned earlier, improve the model generalization ability and avoid the risk of overfitting. The curriculum batching strategy generates and sequentially utilizes three types of batch sets composed of examples from easy-to-learn to hard-to-learn.

Methodology
In this section, we first review the original MBGD-based training procedure for deep multi-label classification models for automatic ICD coding. Then, we introduce the proposed curriculum batching strategy and how to integrate it conveniently into the MBGDbased training procedure, followed by the description of three specific sampling algorithms for obtaining batch sets satisfying the UDD, SDD and IDD.

MBGD-Based Training Procedure
The MBGD-based training procedure for deep multi-label classification models for automatic ICD coding is shown in Figure 2. This procedure consists of two modules: the data batch process and model training. In module one, the training data set D is divided into several mini-batches (b 1 , b 2 , . . . , b K−1 , b K ), which constitute the batch set B. Second, each mini-batch b is sent to module two in turn. Every example (x, y) in mini-batch b contains an electronic case x and its corresponding ICD codes y, a real example (x, y) is shown in Figure 3. Third, the examples are sent to the model training module. The deep multi-label ICD coding model uses the electronic case x to predict its ICD codes and compares the predicted output with the real output y. Finally, module 2 calculates the losses for each sample and updates the model. The procedure is repeated until the model converges.
To be specific, there are six data or parameter processes and/or computational steps in the procedure, including:

1.
Batching Strategy: In this data process, a shuffling algorithm is commonly applied [26]; firstly, each example (x, y) in training data D is randomly sorted to generate a list. Then, the batch set B(b 1 , b 2 , . . . , b K−1 , b K ) is formed by scanning the list with a sliding window with the size of M, i.e., the batch size. Thus, K, the number of batches b in B, equals |D|/M . This data process sometimes may be performed more than once; 2.
Parameter Initialization: In this parameter process, θ 0 are usually drawn randomly from a distribution (e.g., uniform) as the initialization parameters; 3.
Loss Computation: Many kinds of deep multi-label classification models for automatic ICD coding can be applied in this step, including TextCNN, TextRNN and TextRCNN, which are compared in this paper. We uniformly note them as Gradient Update Estimation: According to θ k , f θ k (x i ) and the examples in B k , the kth iteration's gradient updates, ∆θ k , can be estimated by the average of the gradients of the examples in B k , as shown in Figure 2; 5.
Optimized Parameter Output: In general, the steps from 3 to 5 will loop E epochs until the model converges.  To be specific, there are six data or parameter processes and/or computational steps in the procedure, including: 1. Batching Strategy: In this data process, a shuffling algorithm is commonly applied [26]; firstly, each example ( , ) in training data is randomly sorted to generate a list. Then, the batch set ℬ( 1 , 2 , … , −1 , ) is formed by scanning the list with a sliding window with the size of , i.e., the batch size. Thus, , the number of batches in ℬ, equals ⌈| |⁄ ⌉.This data process sometimes may be performed more than once; 2. Parameter Initialization: In this parameter process, 0 are usually drawn randomly from a distribution (e.g., uniform) as the initialization parameters; 3. Loss Computation: Many kinds of deep multi-label classification models for automatic ICD coding can be applied in this step, including TextCNN, TextRNN and Tex-tRCNN, which are compared in this paper. We uniformly note them as ( ). ( ) means that the model ( ) takes the example as the input with the th  To be specific, there are six data or parameter processes and/or computational steps in the procedure, including: 1. Batching Strategy: In this data process, a shuffling algorithm is commonly applied [26]; firstly, each example ( , ) in training data is randomly sorted to generate a list. Then, the batch set ℬ( 1 , 2 , … , −1 , ) is formed by scanning the list with a sliding window with the size of , i.e., the batch size. Thus, , the number of batches in ℬ, equals ⌈| |⁄ ⌉.This data process sometimes may be performed more than once; 2. Parameter Initialization: In this parameter process, 0 are usually drawn randomly from a distribution (e.g., uniform) as the initialization parameters; 3. Loss Computation: Many kinds of deep multi-label classification models for automatic ICD coding can be applied in this step, including TextCNN, TextRNN and Tex-tRCNN, which are compared in this paper. We uniformly note them as ( ). ( ) means that the model ( ) takes the example as the input with the th Theoretically, ∆θ k is the unbiased estimation of the kth iteration's gradient updates. Consequently, a larger batch size leads to a better gradient update approximation and results in a lower frequency of gradient updating. However, there is no "free lunch", in that it may also hurt generalization and result in finding poorer local optima [12].
Moreover, the commonly used batching strategy mentioned above makes each B k carry different local statistics information from the global statistics information of B, due to the inconsistency between the local data distribution of B k and the global data distribution of B. In addition, the extremely large label sets and imbalanced data distribution problems of automatic ICD coding also easily lead to a local overfitting on a small number of frequently occurring labels, due to the imbalanced label distribution of D. To challenge these problems, we propose a simple and effective curriculum batching strategy to conveniently replace the shuffling algorithm in the MBGD-based training procedure.

Curriculum-Batching Strategy
In the original MBGD-based training procedure for deep multi-label classification models for automatic ICD coding, B is generated by the batching strategy introduced in Section 3.1, and the generated B is used repeatedly by the following processes in the procedure and, sometimes, B may be regenerated. To challenge the problems mentioned earlier, we propose a curriculum batching strategy (CBS) data batch process module in order to generate batch sets B UDD , B SDD and B IDD with three types of data distribution, i.e., UDD, SDD and IDD, respectively. The structure of the new module is shown in Figure 4.
in that it may also hurt generalization and result in finding poorer local optima [12].
Moreover, the commonly used batching strategy mentioned above makes each ℬ carry different local statistics information from the global statistics information of ℬ, due to the inconsistency between the local data distribution of ℬ and the global data distribution of ℬ. In addition, the extremely large label sets and imbalanced data distribution problems of automatic ICD coding also easily lead to a local overfitting on a small number of frequently occurring labels, due to the imbalanced label distribution of . To challenge these problems, we propose a simple and effective curriculum batching strategy to conveniently replace the shuffling algorithm in the MBGD-based training procedure.

Curriculum-Batching Strategy
In the original MBGD-based training procedure for deep multi-label classification models for automatic ICD coding, ℬ is generated by the batching strategy introduced in Section 3.1, and the generated ℬ is used repeatedly by the following processes in the procedure and, sometimes, ℬ may be regenerated. To challenge the problems mentioned earlier, we propose a curriculum batching strategy (CBS) data batch process module in order to generate batch sets ℬ , ℬ and ℬ with three types of data distribution, i.e., UDD, SDD and IDD, respectively. The structure of the new module is shown in Figure  4·  The data batch process CBS can conveniently replace the original data batch process in Figure 2 (as depicted in Figure 4) except that the CBS will generate three batch sets, including ℬ , ℬ and ℬ , which are sampled from the original training data, respectively, based on specific sampling algorithms. ℬ is generated through a stratified sampling with replacement (SSR); ℬ is generated based on the original shuffling algorithm; and ℬ is generated by a probability sampling with replacement (PSR). The The data batch process CBS can conveniently replace the original data batch process in Figure 2 (as depicted in Figure 4) except that the CBS will generate three batch sets, including B UDD , B SDD and B IDD , which are sampled from the original training data, respectively, based on specific sampling algorithms. B UDD is generated through a stratified sampling with replacement (SSR); B SDD is generated based on the original shuffling algorithm; and B IDD is generated by a probability sampling with replacement (PSR). The specific SSR and PSR methods will be described later; the original shuffling algorithm follows the batching strategy introduced in Section 3.1.
The intuitive assumption of the CBS is that B UDD is easier to be learned by the deep multi-label ICD coding models than B SDD , and, in the same way, B SDD is easier than B IDD . Moreover, B UDD makes the models have an equally likely opportunity to learn the examples corresponding to each type of label, which can alleviate the imbalanced data distribution problem [29,30]. B SDD allows models to learn every example corresponding to each type of label. B IDD expects models to learn the ICD coding experience from examples through keeping the consistency between the local data distribution of B k and the global data distribution of B. So far, we have designed a curriculum strategy. According to the idea of curriculum learning [28,31], B UDD , B SDD and B IDD will be learned sequentially.
SSR for generating B UDD : The labels of the ICD coding system are diverse, extremely large and imbalanced. These issues make the examples in D have unequal opportunities to be learned by the models, which easily lead to a local overfitting on a small number of frequently occurring labels in a fixed number of loops.
Stratified sampling has the ability to ensure that every characteristic of the population is properly represented in one sample [32]. Consequently, we designed a simple SSR to generate a batch set satisfying the requirement that every example corresponding to each type of label in D has an equal chance to be learned by the models.
First, SSR samples M labels randomly, where M equals the batch size. Then for each sampled label, SSR samples one example with a replacement randomly from the examples in D whose corresponding labels contain the sampled label. Finally, the M examples constitute one batch. The above procedure will be repeated N times, where N is the size of the batch set B UDD . The sampled N batches form the batch set B UDD . In this paper, N is uniformly set to |D|/M . PSR for generating B IDD : Each B k in B UDD and B SDD still carries different local statistics information from the global statistics information of D. In order to ensure the consistency between the local batch data distribution and the global training data distribution, we designed a simple PSR to generate a batch set satisfying the requirement that every example in B is drawn from the global data distribution of D with a replacement.
Differently from the SSR, the PSR firstly draws M labels according to the label distribution of B, where M equals the batch size. Then, for each drawn label, the PSR samples one example randomly from the examples in D whose corresponding labels contain the drawn label. Finally, the sampled M examples constitute a batch. After repeating above procedure N times, the batch set B IDD is formed. In this paper, N , i.e., the size of the batch set B IDD , is uniformly set to |D|/M , too.

Experimental Dataset
In our experiments, the third version of the Medical Information Mart of Intensive Care (MIMIC-III) dataset [33] was adapted for evaluating the effectiveness of our proposed curriculum batching strategy for automatic ICD coding with multi-label classification models. As described in Section 3.1 above, we used one discharge summary (electronic case) of the NOTEEVENTS in MIMIC-III and its corresponding ICD-9 codes to form an example. Next, we conducted a series of data cleaning and preprocessing for all the samples to obtain the experimental dataset D. The specific preprocesses were as follows: • All the punctuation, numbers and stop words were removed from all the examples; • The discharge summaries were segmented into tokens by using the space as the separator, and then we built a vocabulary V based on these tokens; • The TF-IDF values of each word in V are calculated, based on all the examples, and only the top 10,000 words are kept to be used to build the final vocabulary V.
After preprocessing, the basic information of D is listed in Table 2. As shown in Table 2, D contains 55,177 different electronic medical records with 6918 different diseases in the ICD-9 code. Furthermore, in our experiments, D is further divided by a shuffling algorithm into a training dataset D train , a validation dataset D val and a test dataset D test in the ratio of 7:1:2.

Evaluation Metrics
In this paper, widely used micro-measures for multi-label classification tasks, including Precision micro (P micro ), Recall micro (R micro ), F1-Score micro (F1 micro ) and F1-Score macro (F1 macro ),

Experimental Settings
In our experiments, we firstly trained TextCNN, TextRNN and TextRCNN for automatic ICD coding on D train and tuned them on D val ; three models were implemented, based on an open-source tool named NeuralNLP [34]. Finally, we applied the best model on D test to get the results. A shuffling algorithm was applied for generating batches, and the batch size was set to different values to observe the sensitivity of the models. Adam was used as the optimizer, and the learning rate η and the epochs E were set to 0.008 and 150, respectively. The results are listed in Table 3. CBS was applied on TextCNN, TextRNN and TextRCNN for the automatic ICD coding through directly replacing the original shuffling algorithm-based batching strategy. We refer to them by "TextCNN+CBS", "TextRNN+CBS" and "TextRCNN+CBS". B UDD , B SDD and B IDD were built offline previously, and they were switched to be utilized in the training procedure every 50 epochs. For the sake of fairness, the other settings are the same as the settings described earlier, and the results are listed in Table 4. One ReLS method mentioned in [10] was applied directly to TextCNN, TextRNN and TextRCNN as a reference comparison model, and we named the results TextCNN+ReLS, TextRNN+ReLS and TextRCNN+ReLS. The settings of the models are the same as the settings mentioned earlier. Their results are listed in Table 5. Moreover, every experiment described above was run three times, and the average of the results is reported.

Overall Results
First of all, comparing F1 micro in Table 3 with that in Table 4 shows clearly that, after changing the original batching strategy to CBS, the performance of TextCNN, TextRNN and TextRCNN improved by more than double. In particular, the F1 macro increased from less than 5% to around 50% (F1 macro is more affected by the long-tailed label). Moreover, the results shown in Tables 4 and 5 demonstrate that, with a simple and effective curriculum batching strategy substitution, the results of TextCNN+CBS, TextRNN+CBS and TextR-CNN+CBS are much better than TextCNN+ReLS, TextRNN+ReLS and TextRCNN+ReLS, which use a complex label set reconstruction method. These results demonstrate that our proposed curriculum batching strategy for automatic ICD coding with multi-label classification models is effective.
All the results state that feeding the training data to the deep multi-label ICD coding model with an easy-to-learn data distribution first, and then, gradually, with a hard-to-learn data distribution can effectively improve the model's performance.
In order to illustrate the ability of our proposed strategy for automatic ICD coding with deep multi-label classification models without harming the generalization ability of the models, we also applied the models TextCNN+CBS, TextRNN+CBS and TextRCNN+CBS directly on D train . The results are listed in Table 6. By comparing the F1 micro in Tables 4  and 6, it is easy to find that the performance of the models on the training data is comparable to that on the test data. It means that TextCNN+CBS, TextRNN-+CBS and TextRCNN+CBS do not suffer from the overfitting problem.

Detailed Analysis
It can be seen from the results listed in Tables 3-5 that the proposed curriculum batching strategy for automatic ICD coding with deep multi-label classification models improves the results of F1 micro as well as the results of P micro and R micro . This result is different from the results obtained by the models TextCNN, TextRNN, TextRCNN, TextCNN+ReLS, Tex-tRNN+ReLS and TextRCNN+ReLS. They improve the P micro performance while sacrificing their R micro performance. This further demonstrates that, after applying our proposed CBS to the training procedure, the generalization ability of the trained models has improved.
Our results are also consistent with the results of other existing research. Larger batch sizes result in a lower frequency of gradient updates and better gradient update approximation, but they hurt the performance (as shown in Tables 3-5). Therefore, the batch size should not be too large, and, of course, it should also not be too small. In our experiments, when the batch size is set to 1000, the results are the best for TextCNN+CBS and TextRCNN+CBS, and when it is set to 500, the results are the best for TextRNN+CBS.
Moreover, we plotted the training procedures for the nine models whose F1 micro are shown in bold in Tables 3-5 in order to observe the convergence of every model. The training procedures are grouped according to their base models, i.e., TextCNN, TextRNN and TextRCNN, and presented in Figures 5-7, respectively. In the three figures, the Xaxis is the number of epochs and the Y-axis is the loss value; the loss is calculated using BCEWITHLOGITSLOSS in PyTroch, and its calculation formula is as follows: It can be seen in these figures that all the models converge after 150 epochs. It can also be observed from Figures 5-7 that TextCNN+CBS, TextRNN+CBS and TextRCNN+CBS converge the fastest, respectively, in their group. TextCNN+ReLS, Tex-tRNN+ReLS and TextRCNN-+ReLS have the slowest rate of convergence, but they have better F1 micro than their base models, TextCNN, TextRNN and TextRCNN. The results show that our proposed curriculum batching strategy for automatic ICD coding with multi-label classification models can help the models to find their better local optima quickly through a gradual learning process without hurting their generalization ability.
Moreover, the loss values of TextCNN+CBS, TextRNN-+CBS and TextRCNN+CBS, in Figures 5-7 at the 50th epoch and the 100th epoch, all have two interesting change points showing a steep rise and then a steep drop. These change points occur when B UDD switches to B SDD , and B SDD switches to B IDD , respectively. That means when the local data distribution changes dramatically, the loss will be affected. In other words, the performance of the models would be hurt at those points. However, owing to our proposed curriculum batching strategy, the models can quickly adapt to the new data distributions. It not only keeps good prediction performance, but also avoids the risk of falling into a local optimal solution.
Healthcare 2022, 10, x FOR PEER REVIEW 11 of 16 curriculum batching strategy, the models can quickly adapt to the new data distributions.
It not only keeps good prediction performance, but also avoids the risk of falling into a local optimal solution.          In this paper, the proposed strategy is based on an intuitive assumption that B UDD is easier to be learned by the deep multi-label ICD coding models than B SDD , and B SDD is easier to be learned than B IDD . The implication of this assumption is that the data distribution of B IDD is more like the data distribution of D than that of B SDD , and the data distribution of B SDD is more like that of D than that of B UDD .
In order to further verify the effectiveness of the proposed curriculum batching strategy, the data distribution similarities between the three generated batch sets and D, i.e., the learning difficulties of B UDD , B SDD and B IDD , are evaluated by using the Kullback-Leibler Divergence (KLD) and the Jensen-Shannon Distance (JSD). The evaluation results are listed in Table 7. It can clearly see that the data distribution of B UDD is the least like that of D. B SDD is the next, and the data distribution of B IDD is the most like that of D. Table 7. Learning difficulties of B UDD , B SDD and B IDD evaluated by KLD and JSD, respectively. The smaller the KLD value is, the more difficult to be learned the batch set is. The smaller the JSD value is, the less difficult to be learned the batch set is.

Method
B

Ablation Experiments of CBS
To prove the effectiveness of the curriculum in the CBS, we designed and implemented ablation experiments for the three models, and the experimental results are shown in Table 8. It can be clearly observed that all the performance indicators of the three models are steadily improving with the increase in the curricula.

Evaluating the Long-Tailed Label Learning Performance
The improvements in the automatic ICD coding performance benefit from our proposed curriculum batching strategy. The improvements are also related to the ability of the proposed strategy to promote the learning of the long-tailed labels (labels with a low frequency in D, as shown in Figure 8). In the early phase of the training, we used B UDD , which satisfies the uniform distribution, to train the models. This not only provided a simple learning task, but also allowed the information in the long-tailed labels in the MIMIC-III dataset to be learned fully. To evaluate this, we set a threshold γ = 219. If the frequencies of labels were lower than γ, we considered these labels the long-tailed labels. Then, we calculated the precisions and recalls of the long-tailed labels of the models, and the results are listed in Table 9. It vividly shows that the proposed curriculum batching strategy improves the long-tailed label learning ability of deep multi-label classification models for automatic ICD coding.

Conclusions and Future Work
In this paper, we observed that extremely large ICD coding label sets and imbalanced label distribution lead to inconsistency between the local batch data distribution and the global training data distribution, and this inconsistency brings the overfitting problem into the mini-batch gradient descent algorithm-based training procedure for deep multi-label classification models for automatic ICD coding. Therefore, we proposed and investigated a simple and effective curriculum batching strategy to replace the original shuffling algorithmbased batching strategy in the training procedure. The proposed strategy emphasizes that the model training procedure should be gradual, rather than random; the proposed curriculum batching strategy realizes a gradual model training procedure by generating three batch sets offline through applying three specific sampling algorithms. These batch sets satisfy the uniform data distribution, the shuffling data distribution and the original training data distribution, respectively. The learning tasks corresponding to these batch sets range from easy to hard. A series of experiments demonstrate the effectiveness of the proposed batching strategy. After replacing the original shuffling algorithm-based batching strategy with the proposed curriculum batching strategy, the performance of the three investigated deep multi-label classification models for automatic ICD coding all improve dramatically. At the same time, the models avoid the overfitting issue and all show better ability to learn the long-tailed label information. In addition, we also compared the performance with a state-of-the-art label set reconstruction model for automatic ICD coding. The performance of ours is consistently higher than that of the state-of-the-art model. These results further prove the effectiveness of the proposed curriculum batching strategy for automatic ICD coding with multi-label classify-cation models.
Possible directions for future work may include choosing the course of the CBS strategy adaptively according to the convergence of the multi-label classification model during the training. Currently, the order of courses and the number of times they are used are fixed. Another direction is to combine transfer learning and pre-train the model with data sets in the medical field.