1. Introduction
Polar codes, proposed by Arikan in [
1], are the first error correcting codes to provably achieve the symmetric capacity with low complexity in binaryinput discrete memoryless channels (BDMCs). This result means that Shannon’s random codes, which achieve channel capacity, are replaced by a practical code with a lowcomplexity decoding algorithm [
1]. Due to its better performance compared to turbo and lowdensity paritycheck (LDPC) codes at short lengths, it has been adopted as the error correcting code for control signals in the enhanced mobile broadband (eMBB) scenario of the third generation partnership project (3GPP) 5G standard [
2].
Decoding schemes of polar codes are commonly used with successive cancellation (SC) [
1] and belief propagation (BP) algorithms [
1]. SC decoding has a relatively low computational complexity and high latency, while BP decoding has a high throughput and computational complexity [
3]. In addition, both schemes perform poorly at finite lengths, e.g., hundreds to thousands of bits. SC list (SCL) decoding with concatenated a cyclic redundancy check (CRC) code was proposed later [
4], and it was proved that similar performance can be obtained as compared to turbo and LDPC codes in finite lengths. That is, polar codes have been shown to be practically available. Recently, decoders using deep learning (DL) have been proposed to replace traditional decoders of polar codes [
5,
6,
7,
8,
9,
10,
11].
DL has made a great success in computer vision [
12], machine translation [
13], speech and image recognition [
14], and many other fields. Its influence has become widespread and has reached communication systems, where analytic solutions have been preferred. There have been a number of DL approaches [
5,
15,
16] to physical layer communications. Among them, we have got a special interest in deep neural networks (DNN)based decoding of channel codes. As a pioneering work, Gruber et al. employed feedforward deep neural networks to learn the maximum a posteriori (MAP) decoding of polar codes [
5]. That approach was extended to convolutional neural networks (CNN) and recurrent neural networks (RNN) and it was shown that better performance can be obtained by the advanced structures of DNN [
6]. Such small neural network decoders (NNDs) are used to form a decoder for longer polar codes in combination with BP processing [
7]. Nachmani et al. designed an NND called the ‘neural BP decoder’ that basically performs weighted BP decoding with variable weights [
8]. A proper training finds the weight values that compensate finite length impairments of the BP algorithm and the performance is therefore improved for high density parity check codes such as Bose, Chaudhuri, and Hocquenghem (BCH) codes. Bennatan et al. proposed a syndromebased NND for linear block codes [
9]. For CRCpolar codes, a neural BP algorithm was proposed by Doan et al. [
10]. In [
17], Tandler et al. proposed an ordering of data for efficient training of the decoder of convolutional codes. Unlike the problems of applying general deep learning, NND has an advantage that it is very easy to generate a training data set. In addition, the NNDs are capable of oneshot decoding because the received signal is decoded only once through the hidden layers, and can approach the optimal decoding performance with low latency.
Previous studies [
5,
6,
7,
8,
9,
10,
11,
17] considered learning an NND for a specific code of fixed length and rate. However, wireless standards normally use a class of codes of multiple parameter sets, since a receiver is required to have either multiple decoders each of which is specialized to a code or a decoder which is flexible to decode many codes. For NND, a decoder is determined by the values of the weights. To support multiple codes, the same number of sets of weight values should be stored even though the NND hardware is commonly used. Because the supervised learning method [
5,
6] trains a decoder with data from a specific code and channel, a straightforward approach is to train all the decoders separately. However, we thought if codes are closely related, then decoder training can be aided by the results of the training of another decoder.
In this paper, we consider the problem of training a set of ratecompatible polar codes that are expurgated from the same mother code. Exploiting the inclusion property of ratecompatible polar codes which are defined by a single polar coding sequence as adopted in a 5G New Radio (NR) [
2], we propose an efficient decoder training method via transfer learning. We compare the complexity and performance of the proposed method with conventional separate learning. We also tackle an underfitting problem where the given model complexity is not sufficiently high to be well trained by conventional training methods. It is shown that transfer learning from low to highrate codes can train a highrate code decoder better than the conventional training methods.
The rest of this paper is organized as follows. In
Section 2, we briefly introduce the NND framework, basics of polar codes, and the conventional NND for polar codes. The characteristics of ratecompatible polar codes and their training data are investigated and the proposed transfer learningbased training method of NNDs for the ratecompatible polar codes is then presented in
Section 3. Performance comparison with the conventional separate learning is given in
Section 4.
Section 5 concludes the paper.
3. Training of NNDs via Transfer Learning
Since the decoding function for the NND [
5,
6] is a classification into all codewords, the number of classes (i.e., the size of the code) grows exponentially with the code dimension
K. Naturally, the training data should be sufficiently larger than the number of classes. Therefore, the training complexity is a major bottleneck for a long code although the generation of a data set is easy. This limits the training problem to only codes of a small
K. Even though the code is short, training complexity increases when we need to support ratecompatible codes.
If a set of polar codes is constructed based on expurgation that uses a single polar coding sequence [
2], a low rate code is included in a higher rate code within the family of codes. In this section, we introduce a method of generating training data using the inclusion relationship between codewords according to the code rate of polar codes. Then, we propose an efficient training method for the NNDs of multiple polar codes in the context of transfer learning. We also suggest to train a single decoder via transfer learning method to solve an underfitting problem due to low model complexity.
3.1. Inclusion Relation of Training Data
In this subsection, we investigate the inclusion relation of the polar codes defined by a single polar coding sequence. To support ratecompatibility of polar codes, multiple codes with rate
$\mathbb{R}=\{{R}_{i}={K}_{i}/N\},\phantom{\rule{3.33333pt}{0ex}}1\le i\le T$ should be used. Assume without loss of generality
${R}_{i}<{R}_{j}$ for
$i<j$. Because the polarization is channelsensitive, the optimal construction for
${R}_{i}$ requires the optimal channel parameter
${\alpha}_{i}$ and the corresponding order
${S}_{{\alpha}_{i}}$. However, to reduce the complexity of description, a unified sequence
S can be used for multiple ratecompatible codes with small performance penalty as adopted in 5G NR [
2]. Let us define such a set of codes as follows.
Definition 1. Let $\mathbb{C}=\left\{{C}_{i}\right\}$, $1\le i\le T$ be a ratecompatible set of polar codes supporting $\mathbb{R}$ where ${C}_{i}=C(N,{K}_{i})$ and $C(N,{K}_{i})$ is the polar code of dimension ${K}_{i}$ defined by the unified polar coding sequence S.
It is manifest that the inclusion relation ${I}_{{K}_{i}}\subset {I}_{{K}_{j}}$ for $i<j$ holds because ${I}_{K}=S[1:K]$ for ratecompatible polar codes based on a single polar coding sequence. Let ${G}_{{I}_{{K}_{i}}}$ be the submatrix of G.
Proposition 1. Polar code ${C}_{i}$ is linear if the frozen vector ${\mathbf{u}}_{{I}_{{K}_{i}}^{c}}=\mathbf{0}$.
Proof. Assume
${\mathbf{u}}_{{I}_{{K}_{i}}^{c}}=\mathbf{0}$ and for
${\mathbf{c}}^{\left(i\right)}\in {C}_{i}$,
which proves that
${C}_{i}$ is a binary linear code with the generator matrix
${G}_{{I}_{{K}_{i}}}$. □
Proposition 2. Assume all ${C}_{i}\in \mathbb{C},1\le i\le T$, are linear by setting ${\mathbf{u}}_{{I}_{{K}_{i}}^{c}}=\mathbf{0}$. For $i<j$, we have ${C}_{i}\subset {C}_{j}$.
Proof. For a codeword
${\mathbf{c}}^{\left(j\right)}\in {C}_{j}$
since
${\mathbf{u}}_{{I}_{{K}_{j}}^{c}}=\mathbf{0}$. Due to (
2) for a codeword
${\mathbf{c}}^{\left(i\right)}\in {C}_{i}$,
where
$\mathbf{0}$ is the zero vector of length
${K}_{j}{K}_{i}$ and there exists a permutation matrix
P satisfying the last equality. It has been proved that codeword
${\mathbf{c}}^{\left(i\right)}$ corresponding to message
${\mathbf{u}}_{{I}_{{K}_{i}}}$ is the codeword of
${C}_{j}$ for message
$\left[{\mathbf{u}}_{{I}_{{K}_{i}}}\mathbf{0}\right]P$. □
Figure 2 exhibits an example of the inclusion relationship of polar codes
$C(8,2),\phantom{\rule{3.33333pt}{0ex}}C(8,3)$, and
$C(8,4)$ with generator matrix
${G}_{8}={F}^{\otimes 3}$ and
$S=(8,7,6,4,5,3,2,1)$. Note that
$C(8,2)\subset C(8,3)\subset C(8,4)$. Since
${C}_{i}\subset {C}_{j}$ holds for
$i<j$ and
${I}_{{K}_{i}}\subset {I}_{{K}_{j}}$, data points generated for the code of dimension
${K}_{i}$ can be valid data points for the code of
${K}_{j}$. The set of training data for
$\mathbb{C}$ can be made to have an inclusion relation between the data for individual codes. The data used to train the NND for
${C}_{i}$ is a valid subset of data for training the NND for
${C}_{j}$. Therefore, we apply transfer learning to train an NND for
${C}_{i+1}$ by adopting the NND trained for
${C}_{i}$ as the initial state. Transfer learning can be applied recursively to the sequence of NND training in the increasing order of rate. In the next subsection, we describe the procedure of transfer learning for training NNDs of ratecompatible polar codes in detail.
3.2. Transfer Learning for NNDs of Rate Compatible Polar Codes
Our problem is to train
$\left\mathbb{C}\right=M$ NNDs where the elements in
$\mathbb{C}$ are equally long ratecompatible polar codes as defined in
Section 3.1. A naive approach is to train them independently, but a more efficient way can be considered. Complexity of NND training is counted in the number of minibatches. Let
${M}_{b}^{tot}={\sum}_{i}^{T}{M}_{b}^{\left(i\right)}$ be the total complexity, where
${M}_{b}^{\left(i\right)}$ is the complexity used for training the
${C}_{i}$NND. We set the size of the minibatch
${l}_{b}$ to 128 throughout the paper. We consider the application of transfer learning [
20,
21] for decoder training, which has been used when similar problems and solutions exist on an NN model, where a trained model can be reused to boost the training of another problem. For a given
${M}_{b}^{tot}$, we pursue a more efficient training in terms of performance of the NNDs.
As noted in the previous section, each data point used to train the decoder for ${C}_{i}$ is a valid data point for ${C}_{i+1}$. So we assume that the decoder for ${C}_{i}$ may be a good initial state of the training phase of ${C}_{i+1}$. In other words, the learned state of an NND is transferred to the NND for a code of a higher dimension at the beginning of a training session. If transfer learning is effective, the overall complexity of the training may be reduced by this approach. In order to reduce the complexity, the training should be planned well. We train the NNDs in increasing order of rate. So we start from the code of the lowest rate and the training procedure is described in detail below.
In Algorithm 1, the training procedure of NNDs via transfer learning is described. NNDs are trained in increasing order of rate using a single NN model. The NN model used is either the MLP or the LSTM described in
Section 2.3. The NN model is initialized with random weights first. Let
${C}_{i}$NND be the decoder for code
${C}_{i}$. The training data is generated for training of
${C}_{i}$NND and the decoder is then trained
${M}_{b}^{\left(i\right)}$ minibatches. How to generate data is described in Algorithm 2. The learned state of
${C}_{i}$NND is stored and transferred to the training of
${C}_{i+1}$NND as its initial state. MSE is used for the cost function and SGD is used as the optimizer. For the training of a low rate code NND where
${K}_{i}<{K}_{T}$, redundant output nodes are labeled as
$0.5$ which remains neutral between 0 and 1. This procedure is repeated until the last decoder,
${C}_{T}$NND training is finished. The entire procedure can viewed as a multistep training of an NN toward a good
${C}_{T}$NND, during that the state of NN is sampled as
${C}_{i}$NND.
Each NND can be tested concurrently along with its training. A test evaluates the BER of the decoder where a bit error is counted when the message mismatches with the output truncated to the message length
${K}_{i}$. When data is generated, the
$\mathbf{CollectData}(\xb7)$ function defined in Algorithm 2 collects
${N}_{sample}$ data points each of which is a pair of a message
$\mathbf{m}$ and a received signal
$\mathbf{y}$ or the output of an AWGN channel. A message is randomly generated from the entire set of messages. The message is encoded by
${C}_{i}$ and modulated with BPSK, and then sent through an AWGN channel where a Gaussian noise vector
$\mathbf{z}$ is added to the transmitted vector
$\mathbf{x}$. Random sampling of the received vector is repeated without replacement
${N}_{sample}$ times. For the AWGN channel, the channel parameter
${E}_{b}/{N}_{0}$ is chosen empirically.
Algorithm 1 Train NNDs via transfer learning for ratecompatible polar codes. 
Input: Rate $\mathbb{R}$=(${R}_{1},\dots ,{R}_{T})$, code length N, numbers of minibatches ${M}_{b}^{\left(i\right)}$’s, polar coding sequence S 1:
Initialize the ${C}_{1}$NND with random weights  2:
fori = 1 : T do  3:
${\mathcal{X}}_{i}\leftarrow \mathbf{CollectData}\phantom{\rule{3.33333pt}{0ex}}\left(i\right)$  4:
Train ${C}_{i}$NND with ${M}_{b}^{\left(i\right)}$ minibatches randomsampled from ${\mathcal{X}}_{i}$  5:
Store ${C}_{i}$NND  6:
if i < T then  7:
Initialize ${C}_{i+1}$NND with ${C}_{i}$NND  8:
end if  9:
end for  10:
return All NNDs

Algorithm 2$\mathbf{CollectData}\phantom{\rule{3.33333pt}{0ex}}\left(i\right)$. 
Input: index i, code length N, polar coding sequence S, training ${E}_{b}/{N}_{0}$, data size ${N}_{sample}$ 1:
Generate all messages $\mathcal{M}=({\mathbf{m}}_{1},\dots ,{\mathbf{m}}_{{2}^{{K}_{i}}})$ of length ${K}_{i}$  2:
Empty data set $\mathcal{X}$  3:
for$j=1:{N}_{sample}$do  4:
Initialize $\mathbf{u}$ as the zero vector  5:
Select a message ${\mathbf{m}}^{\prime}$ randomly from $\mathcal{M}$  6:
Determine the information and frozen vectors (${I}_{k}\leftarrow S[1:K]$ and ${\mathbf{u}}_{{I}_{K}}\leftarrow \mathbf{m}\u2018$)  7:
Make the transmitted signal ($\mathbf{c}\leftarrow \mathbf{u}{G}_{N}$, $\mathbf{x}\leftarrow ({x}_{1},{x}_{2},\dots ,{x}_{N})$ with ${x}_{k}={(1)}^{{c}_{k}}$)  8:
Channel operation ($\mathbf{y}\leftarrow \mathbf{x}+\mathbf{z}$ where $\mathbf{z}$ is iid Gaussian with variance ${\sigma}^{2}=\frac{1}{2R}\frac{1}{({E}_{b}/{N}_{0})}$)  9:
Add ${X}_{j}=(\mathbf{y},{\mathbf{m}}^{\prime})$ to $\mathcal{X}$  10:
end for  11:
return$\mathcal{X}$

The benefit of transfer learning lies in the efficiency of training. In order to get a well trained set of NNDs, the total complexity ${M}_{b}^{tot}$ should be properly distributed among the ${M}_{b}^{\left(i\right)}$, $i=1,\dots ,T$. Faster learning due to transfer learning saves training complexity for small K so that NNDs for a larger K can be trained more. To show the advantage of the proposed learning method effectively, the uniform allocation ${M}_{b}^{\left(i\right)}={M}_{b}$, is considered for the conventional separate learning.
3.3. Training of Individual NND via Transfer Learning
In this subsection, we consider the training problem of a single NND with a limited model. According to previous results [
5], polar NNDs have been well trained from the MLP model when
N is 16 or smaller. Similar performance was achieved on LSTM models with a lower model complexity but higher training complexity [
6]. Assume we want to train a
${C}_{k}$NND individually from an NN model. If the model complexity is not sufficiently high, the model might underfit even though the data size and
${M}_{b}$ are large. We propose to use transfer learning to solve the underfitting problem. It will be shown that multistep training with a proper sequential application of data sets can train the
${C}_{k}$NND better at the same complexity. We simply run Algorithm 1 with a given total training complexity
${M}_{b}^{tot}$. However, training data
${\mathcal{X}}_{i}$’s and
${M}_{b}^{\left(i\right)}$ are sequentially applied from a certain value of
$i<k$.
4. Numerical Results
In this section, we numerically evaluate our proposed transfer learning technique for ratecompatible polar codes. The training results are compared with those of separate learning in terms of performance. We assume the BPSK modulation and the AWGN channel for all simulations. Codes with parameters
$(N,K)=(16,38)$ and
$(32,716)$ were used. The polar coding sequence defined in [
2] was used in the construction of such codes. As mentioned an MLP and an LSTM model is used for the corresponding NNDs. The structures of the NN models were specified in
Section 2.3. The detailed parameter setting is shown in
Table 1. A 643216 MLP and an LSTM have similar complexities in terms of the number of trainable parameters. For the training, the dropout and learning rate are set to 0.1 and 0.0009, respectively. As noted, MSE is used for the loss function as in
Section 2. The SGD method with ADAM optimizer [
22] is used. The training is implemented using TensorFlow. The hyperparameters of NN training are listed in
Table 2. Polar codes of parameter sets
$(N,K)=(16,38)$,
$(32,712)$, and
$(32,1116)$ are used for training multiple MLP and LSTMbased NNDs.
We generate
${N}_{sample}={10}^{6}$ training data points for each training session. The training data of the proposed method is generated at
${E}_{b}/{N}_{0}=4\phantom{\rule{3.33333pt}{0ex}}\mathrm{dB}$ according to Algorithm 2. We did not rigorously optimize the training
${E}_{b}/{N}_{0}$ to simplify comparison. On the other hand, the separate learning generates data with the training ratio
p [
6], which is the portion of codewords used to generate training data, compared to the entire code. In this simulation, we set
$p=(0.4,0.6,0.8,1.0)$. That is, the selecting from all messages in Algorithm 2 is changed from the selecting a smaller message set to
p. For each
p, we took 5 different training
${E}_{b}/{N}_{0}$ points from −2.0 to 6.0 dB. We train the NND using the total of 20 training data and select the parameters that show the best test performance. We assume the complexity of training with a minibatch of a fixed size is similar among the codes of the same length. To train the NNDs for a given set of ratecompatible codes, uniform allocation of complexity or the number of weight updates is considered. For both the proposed and separate learning methods, the numbers of weight updates are assigned as
${M}_{b}^{\left(i\right)}=5000$ for
$N=16$ and
${M}_{b}^{\left(i\right)}=\mathrm{50,000}$ for
$N=32$. Note that we are interested in a training setting with constrained computing resources. The trained decoders are evaluated in terms of BER for the considered communication system. A test set has
${10}^{5}$ data points for each
${E}_{b}/{N}_{0}$ point ranging from 0 to 6 dB.
The NNDs trained by separate learning perform closely to MAP decoding for
$N=16$ and all rates if
${M}_{b}^{\left(i\right)}$ are sufficiently large without training ratio adjustment (
$p=1.0$). However, the proposed method trains the NNDs quicker.
Figure 3 shows the BER of MLP and LSTMbased decoders for
$(N,K)=(16,38)$ polar codes. At low rates, both learning methods perform similarly. However, as the code rate increases, the proposed method shows better decoding performance than the separate leaning. Especially, when
$K=8$, MLP attains a coding gain of 0.6 dB and LSTM gets 0.5 dB with the proposed learning method. Via transfer learning, good performance can be achieved with smaller
${M}_{b}^{\left(i\right)}$, i.e., less learning complexity, as the code rate increases. If
${M}_{b}^{\left(i\right)}$ is increased at all code rates, the decoding performances of all NNDs come close to MAP decoder.
Figure 4 and
Figure 5 show the BER performance of different learning methods for polar codes of
$N=32$. Unlike the case of
$N=16$, NNDs are not trained to achieve MAP performances. The BER of the MLP and LSTMbased decoders for
$(N,K)=(32,712)$ is shown in
Figure 4. Separate learning can achieve a better performance with an adjustment of
p down to 0.4. while the NND underfits for
$p=1.0$. For low rates, performances are similar between the proposed and the separate learning methods as for
$N=16$. However, as the code rate increases, the proposed method outperforms the conventional one. The MLPbased decoder with proposed method achieves a performance gain of 0.5 dB for
$K=8$, 0.7 dB for
$K=9$, 0.9 dB for
$K=10$, and 1.0 dB for
$K=11$ at a BER of
${10}^{3}$ over the separate learning. For
$K=12$, the separate learning fails to train the decoder well even for large
${M}_{b}^{\left(6\right)}=\mathrm{300,000}$. On the other hand, the proposed method shows a much better error performance already at
${M}_{b}^{\left(6\right)}=\mathrm{50,000}$. The performance gain of LSTMbased decoder with the proposed method is 0.2 dB for
$K=8$, 0.2 dB for
$K=9$, 0.3 dB for
$K=10$, 0.5 dB for
$K=11$, and 0.5 dB for
$K=12$. For
$K=12$, the performance of the separate learning does not improve as
${M}_{b}^{\left(6\right)}$ increases even to
$\mathrm{300,000}$ from
$\mathrm{50,000}$. As a result, we confirm that the proposed method mitigates the underfitting problem of the separate learning.
Figure 5 exhibits the BER performance of MLP and LSTMbased decoders for
$(N,K)=(32,1116)$. We employ a smaller 643216 MLP whose number of parameters is similar to that of the LSTMbased decoder with single cell of 256 units for fair comparison. For
$K\le 12$, there is no difference in performance of the LSTMbased decoder between the two learning methods. The proposed method performs better for
$K\ge 13$. For
$K=16$, separate learning does not train the LSTMbased decoder well at rather small
${M}_{b}^{\left(6\right)}=\mathrm{50,000}$ although the performance eventually improves up to
${M}_{b}^{\left(6\right)}=\mathrm{300,000}$. On the other hand, the proposed method trains the NND faster without showing error floor at
${M}_{b}^{\left(6\right)}=\mathrm{50,000}$, already. It has been confirmed that NNDs can be trained by the proposed transfer learning method in a lower complexity than the separate learning method. As you can see, the MLPbased decoder does not learn at all for both the proposed and separate learning methods, whereas the LSTMbased decoder performs fairly well. It seems that the lower triangular structure of the generator matrix
${G}_{N}$ induces a desired but hidden sequential processing that is better learnable by the LSTM model under the model complexity constraint than the MLP model.