Adversarial Optimization-Based Knowledge Transfer of Layer-Wise Dense Flow for Image Classiﬁcation

: A deep-learning technology for knowledge transfer is necessary to advance and optimize efﬁcient knowledge distillation. Here, we aim to develop a new adversarial optimization-based knowledge transfer method involved with a layer-wise dense ﬂow that is distilled from a pre-trained deep neural network (DNN). Knowledge distillation transferred to another target DNN based on adversarial loss functions has multiple ﬂow-based knowledge items that are densely extracted by overlapping them from a pre-trained DNN to enhance the existing knowledge. We propose a semi-supervised learning-based knowledge transfer with multiple items of dense ﬂow-based knowledge extracted from the pre-trained DNN. The proposed loss function would comprise a supervised cross-entropy loss for a typical classiﬁcation, an adversarial training loss for the target DNN and discriminators, and Euclidean distance-based loss in terms of dense ﬂow. For both pre-trained and target DNNs considered in this study, we adopt a residual network (ResNet) architecture. We propose methods of (1) the adversarial-based knowledge optimization, (2) the extended and ﬂow-based knowledge transfer scheme, and (3) the combined layer-wise dense ﬂow in an adversarial network. The results show that it provides higher accuracy performance in the improved target ResNet compared to the prior knowledge transfer methods.


Introduction
In the past few years, as deep-learning technology has advanced dramatically, state-ofthe-art deep neural network (DNN) models find applications in several fields, ranging from computer vision to natural language processing [1][2][3][4][5][6][7][8][9][10]. Modern DNNs are based on the convolutional neural network (CNN) structure [11], such as AlexNet [12], GoogleNet [13], VGGNet [14], the residual network (ResNet) [15,16], a densely connected convolutional network (DenseNet) [17], and EfficientNet [18], that has achieved increased accuracy by expanding more layers. Therefore, generally, top-performing DNNs have deep and wide neural network architectures with enormous parameters, significantly increasing the training time at high computational costs. Moreover, it is challenging to achieve global or local optimization for a complex DNN with an extended dataset, such as ImageNet data [19], used for training from scratch. Transfer learning [20] can be a reasonable candidate to address this limitation. This is because it leverages the knowledge gained from solving a task when applied to other similar tasks. When the wide and deep DNNs are successfully trained, they usually contain a wealth of knowledge within the learning parameters. Therefore, well-known transfer learning based on CNN structures [21,22] directly reuses most pre-trained convolutional layers that automatically learn hierarchical feature representations for knowledge formation. Notably, CNN-based transfer learning allows us to quickly and easily build some of the accurate network models by taking advantage of the previous learning without beginning from scratch. In addition, although transfer In contrast to previous work on the dense flow-based knowledge transfer [34], there are major differences between [34] and the proposed method. First, flow-based knowledge in [34] is transferred based on l 2 -norm-based training when the dense flow extracted from layers is transferred and trained to another target model. However, knowledge in our study is transferred in the GAN structure-based adversarial optimization manner. Second, the densely extracted flow-based knowledge in [34] is sequentially transferred step-by-step to a target model, but this study deals with layer-wise concurrent training when transferring the dense flow.

Generative Adversarial Networks
The GAN, comprising generator and discriminator, was proposed in [32] to capture the given data distributions. The role of the generator is to generate newly synthesized images or fake data, and that of the discriminator is to determine whether an input sample is the given real data or fake data from the generator. These two are designed to compete in an adversarial optimization manner such that the generator captures the distributions of the given data, and the discriminator makes the right decision as to whether the sample is real or fake. Let G and D be the sets of weights of the generator and discriminator, respectively. Then, the min-max optimization problem to train G and D can be defined as follows: min In (1), L GAN is defined as where P data and P z denote the distributions of the real data and input noise of the generator, respectively. To derive a well-optimized solution using the GAN, (1) is usually solved with min-max optimization iteratively. However, it is challenging to find the optimal solution for (1) because of some of the undesired saddle point problems. Therefore, several existing studies have been conducted to stabilize the convergence of the GAN algorithm, such as unrolled GANs [35], Wasserstein GANs [36], and least-squares GANs [37]. In addition, to improve the convergence and robustness for learning optimization of (1), heuristic techniques such as feature matching, minibatch discrimination, and one-sided label smoothing were introduced in [38]. Another type of prior advanced GAN architecture, robust deep convolutional generative adversarial networks (DCGANs) [39], have been successfully applied to image processing tasks such as object removal and vector arithmetic [39], super-resolution [40,41], and denoising [42,43].

Output-Distribution-Based Knowledge Transfer Using Adversarial Networks
Based on the general GAN-based architecture, an adversarial network-based knowledge transfer approach adopted by the KTF was first introduced in [30]. The relaxed output probability of neural networks described in [27] was used as the pre-trained knowledge in their method. GAN-based previous work in [30] considered couple of different optimization procedures for knowledge transfer method such as solving a discriminator's maximization problem and solving a minimization problem for a target DNN model. To update the discriminator and target DNN for adversarial training in the KTF. First, the discriminator's goal is required to distinguish whether a relaxed output distribution is provided by a pre-trained DNN or target DNN. In contrast, the target DNN, which plays the same role as the generator of the original GAN-based structure [32], is adversarially trained, similar to the relaxed output probability of the pre-trained DNN. Therefore, the knowledge transmitted from the discriminator leads the target DNN to provide the probability result of the output layer, similar to the pre-trained DNN. In addition, according to [27], the cross-entropy loss for the supervised training approach is considered in the KTF. Experimentally, their method proved that the adversarial network-based approach could effectively transfer output-distribution-based knowledge from the pre-trained DNN to the target DNN, compared to the l 2 -norm-based knowledge transfer [27] without using adversarial training loss. However, the optimized target DNN using [30] is usually inferior to the pre-trained original DNN, especially when considering wider and deeper neural networks as a source DNN.

FSP-Based Knowledge Transfer Using Adversarial Network
An adversarial network-based KTF technique using FSP-based knowledge distillation was proposed in [31] to improve the performance of the traditional adversarial networkbased KTF using relaxed output distribution-based knowledge distillation [30]. For distilled knowledge, three flow-based items of source knowledge in the pre-trained DNN are extracted in the form of FSP matrices based on the input and output of the residual block in the ResNet structure. In [29], the FSP matrix was mathematically devised for flow-based knowledge distillation across two layers. Let F(x; W) ∈ R h×w×m and H(x; W) ∈ R h×w×n be two different feature maps with an input x and weights W in a ResNet. Then, the FSP matrix G W (x; W) = (g W ij ) ∈ R m×n between F and H is defined by where F :,:,i and H :,:,i denote i-th h × w matrices of F and H, respectively, and ·, · F denotes the Frobenius inner product between two matrices of the same size. Next, for knowledge transfer, multiple flow-based source knowledge is transferred using the adversarial optimization procedure between the target ResNet and the three discriminators. The target ResNet was trained to build its FSP matrices to deceive the discriminators. Simultaneously, the discriminators were trained to distinguish FSP matrices created by the pre-trained ResNet from those created by the target ResNet. Therefore, the adversarial optimization-based KTF approach [31] was implemented in semi-supervised learning such that the target ResNet can (i) capture the distribution of the FSP-based source knowledge and (ii) simultaneously use a known dataset with true labels. According to the results of [31], the classification accuracy of the target ResNet trained using [31] is better than the existing adversarial optimization-based knowledge transfer method [30] because of the FSP-based rich source knowledge. In addition, a target ResNet using [31] can accurately capture better knowledge from the original pre-trained knowledge distribution than the l 2 -norm-based knowledge transfer method using FSP-based distilled knowledge [29].

Proposed Method
The proposed method is an adversarial training scheme using densely distilled flowbased knowledge based on the pre-trained DNN approach, which can efficiently optimize the KTF network for image classification tasks. The pre-trained information for dense flow is fully generated by converting the detailed features from the lower layers into abstracted features in the higher layers. This process requires efficient transmission of dense flow-based information to the target DNN module through multiple discriminators to optimize the proposed distilled-knowledge transfer. In this section, we present the proposed methods involving the main concepts to improve the classification performance over the prior KTF methods in terms of the knowledge transfer scheme. Figure 1 shows the adversarial-based KTF architecture, where flow-based knowledge is extracted from a pre-trained network. This knowledge can be transferred to target and discriminator networks for updating them in an adversarial-optimization manner. The flow-based knowledge considered in this study is represented as an FSP matrix [29] based on the direction between the input and output results of the residual module of a pretrained ResNet. Specifically, the adversarial-based KTF, as shown in Figure 1, describes how the target network can be trained and built to deceive the discriminator network in its flow-based result. In addition, the discriminator network is required to optimize and distinguish the flow-based knowledge created by the pre-trained network from the flow-based result of the target network. As shown in Figure 1, mathematical notations for the adversarial-based knowledge optimization can be represented and derived as follows: Let T, R, and D be the weights of each target, pre-trained, and discriminator networks, respectively. Let G T and G R denote the flow-based FSP matrices of the target and pre-trained networks, respectively. D(·) represents the output of probability between zero and one by the discriminator, and the input of D(·) is assumed to be the FSP matrix of the target or pre-trained networks. When the value of D(·) is one, it implies that the input is an FSP matrix created by the pretrained network, and, conversely, zero implies that the target network generates an FSP matrix rather than a pre-trained network. Then, the loss function of the adversarial-based knowledge optimization for positive parameters α, β, and γ is given below:

Adversarial-Based Knowledge Optimization
where and Here, · F denotes the Frobenius norm of the matrix. Then, an adversarial-based optimization problem for knowledge transfer is given as To address the optimization problem of (8) with respect to the mini-batch B = (8) as (9): According to (9), (5) and (6) can be represented as (10) and (11) below, respectively: and where T(·) denotes the output probability of the target network. For a mini-batch size of N, can be rewritten as Regarding (12), we consider adopting the l 2 -norm-based loss term representing a blurring effect [31] of constructing the adversarial optimization-based loss function of (4). It is anticipated to provide more information about pre-trained source knowledge to the target network than flow-based knowledge without softening.
Based on the aforementioned loss functions, we can fully and simultaneously train both numerous discriminators and the target DNN of the adversarial network-based KTF to optimize the densely distilled knowledge transfer.

Knowledge Transfer Scheme for Densely Distilled Flow-Based Knowledge
In our previous work [34], we introduced a dense flow-based knowledge transfer learning scheme with a deep neural network, where flow-based knowledge was densely overlapped and extracted from a pre-trained ResNet. Compared to the original flowbased knowledge [29], the target ResNet obtained using dense flow-based training yielded higher performance owing to the rich information of the extended flow-based features for dense learning. Notably, when training a target ResNet in [34], the densely extracted knowledge was sequentially delivered step-by-step to the target ResNet. In this regard, the knowledge transfer of dense flow is performed using the l 2 -norm-based loss function between a pre-trained and target ResNets. According to [34], concurrent flow-based training yielded inferior accuracy to the bottom-up sequential training scheme when transferring the densely extracted pre-trained knowledge in a KTF. Therefore, there is a limitation in simultaneously transmitting several densely extracted information using the traditional l 2 -distance-based similarity measure.
Applying an adversarial network-based architecture that uses discriminator networks rather than the l 2 -norm-based training approach to knowledge transfer of dense flow can be an effective solution to address this limitation. This explains why a typical target network using the adversarial-based training method can more accurately capture the distribution of pre-trained knowledge than the transfer learning method using the l 2 -norm that usually produces blurriness in image restoration. Therefore, even considering densely extracted knowledge items, the adversarial training method can handle concurrent transference of the densely distilled knowledge, whereas the traditional l 2 -norm-based method cannot. Figure 2 shows the proposed adversarial network-based structure for concurrent knowledge transfer of the densely distilled flow-based knowledge when considering the popular ResNet model with three residual blocks in a KTF. To rephrase, six FSP matrices G R i,j (i = 0, 1, 2 and j = 1, 2, 3) from the pre-trained ResNet are extracted from a dense overlap, and, similarly, the same number of FSP matrices G T i,j from the target ResNet are generated. Then, the target ResNet is trained using the same number of discriminators D i,j (i = 0, 1, 2 and j = 1, 2, 3) such that the target ResNet's flow-based features are formed as close as possible to the actual features of the pre-trained ResNet by deceiving the discriminators. Meanwhile, the discriminators are trained to distinguish FSP matrices extracted by the pretrained ResNet from those generated by the target ResNet. Notably, a single discriminator is assigned to compare a pair of FSP matrices between the pre-trained and target ResNets. In discriminator's architecture, a multi-layer perceptron (MLP)-based discriminator with M linear units [31] rather than the popular CNN-based discriminator [39] was adopted in this study, considering computational bottleneck for the LDF-based knowledge transfer scheme. Here, a single linear unit comprises a fully connected layer, a batch normalization layer, and a leak rectified linear unit [31]. Thus, the flow-based ResNet layers are densely trained, as more enhanced information can be transmitted to the target ResNet fully and simultaneously using the overlapping flow-based features and densely designed discriminators. Pre-trained Network Target Network Trainable Variables Figure 2. Adversarial concurrent knowledge transfer structure using the layer-wise dense flow. In this figure, G R i,j , G T i,j , and D i,j refer to the FSP matrix of pre-trained ResNet, the FSP matrix of target ResNet, and the discriminator, respectively. Only the variables in the dotted box marked "Trainable Variables" are used for training.

Adversarial-Based Loss Functions for Knowledge Transfer Using Layer-Wise Dense Flow
In Section 3.1, we present the loss functions of the adversarial-based optimization for distilled-knowledge-based transfer. Furthermore, dense flow-based feature extraction can enhance the original flow-based knowledge distillation, as described in Section 3.2. Therefore, by applying LDF-based knowledge transfer to the adversarial-based knowledge optimization, the proposed loss functions for the dense flow can be derived as follows: First, the adversarial loss function consists of M residual blocks of the pre-trained and target ResNets. Let G T i,j and G R i,j be the FSP matrix between the input feature of the (i + 1)-th residual block and the output feature of the j-th residual block in the target and pretrained ResNets, respectively, and let D i,j be a discriminator for G T i,j and G R i,j , where i = 0, 1, · · · , M − 1 and j = 1, 2, · · · , M such that i < j. Then, we define L adv i,j and L FSP i,j for i = 0, 1, · · · , M − 1 and j = 1, 2, · · · , M as follows: and Then, together with the supervised cross-entropy loss function with true labels, we define a final loss function L LDF of the LDF-based KTF for semi-supervised knowledge transfer as follows: where α, β, and γ denote positive control parameters for the adversarial-based knowledge optimization. First, for the optimization of densely designed discriminators, we set the variables T and D according to the well-known Gaussian distribution-based weight initialization. Then, we simultaneously update the discriminators with D by maximizing the adversarial loss of (15) while freezing the variables of T in the target ResNet. In this stage, each discriminator makes an optimal binary decision of whether the flow-based feature is generated by the target ResNet or the pre-trained ResNet.
Next, we update the target ResNet with T by applying a stochastic gradient descent to (15) with respect to T for fixed variables of D. The target ResNet tries to generate LDF-based features similar to the real LDF-based features of the pre-trained ResNet. Simultaneously, the target ResNet is trained to perform an ordinary classification task using real labels.
Therefore, adversarial-based knowledge transfer using dense flow is implemented alternatively to update D and T until the number of iterations reaches a predefined threshold. The entire learning procedure for the proposed method is summarized in Algorithm 1.

Experiments
In this section, we analyze the proposed method using reliable benchmark datasets: CIFAR-10 and CIFAR-100 [44]. First, for CIFAR-10, we considered adapting a ResNet structure with three residual modules with {16,32,64} filters [29] for the pre-trained and target DNNs in a KTF. Second, for CIFAR-100, we used a wide ResNet structure with {64,128,256} four times more than those in the CIFAR-10, considering the small number of training images per class. In this experiment, there are six discriminators: D 0,1 , D 0,2 , D 0,3 , D 1,2 , D 1,3 , and D 2,3 . Each discriminator structure is based on multilayer perceptron [31]. Here, the number of linear units with each discriminator of CIFAR-10 and CIFAR-100 is configured, as shown in Table 1. When each discriminator structure of D i,j is designed with the number of MLP-based linear units, the number is determined by the spatial size of the corresponding G i,j . For example, in CIFAR-10, sorting by the number of elements constituting the Gramian matrix is as follows: G 0,1 < G 0,2 = G 1,2 < G 0,3 = G 1,3 < G 2,3 . Notably, this is because the dimensions of the Gramian matrices are represented as G 0,1 ∈ R 16×16 , G 0,2 , G 1,2 ∈ R 16×32 , G 0,3 , G 1,3 ∈ R 16×64 , and G 23 ∈ R 32×64 . For this reason, we set the number of linear units of the discriminators in the following order: There-fore, based on several experiments, the final number of linear units of the discriminators is determined in Table 1. Similar to CIFAR-10, in CIFAR-100, the number of linear units of the discriminators was set experimentally, as shown in Table 1, according to the Gramian matrix dimension.
The experimental conditions for the proposed method were as follows: The loss function using (15) has α and β, as shown in Table 2. γ was used in all experiments, 0.01. Both optimizers for the target ResNet and the discriminator used the same RMSProp optimization algorithm [45]. In addition, 64,000 iterations were performed, and a batch size of 256 was used when one target ResNet was trained in the KTF. Notably, lr T and lr D have an initial learning rate for training each target ResNet and discriminator. In our experiment, both lr T and lr D were trained by applying variable learning rates, where the learning rate changed 0.1 times after 32,000 iterations and 0.01 times after 48,000 iterations.

Dense Flow-Based Knowledge Distribution
This section discusses the ability of the proposed method for delivering LDF-based distilled knowledge in the KTF. To evaluate how well the target ResNet learned pre-trained knowledge, we used the LDF-based knowledge distribution as a performance metric. In Figure 3, the proposed method shows that the Gramian matrix distribution results in (i) the pre-trained ResNet to transmit LDF-based knowledge and (ii) the target ResNet to receive the transferred knowledge. In addition, to derive more specific results, we experimented with simple Gramian distributions of the original ResNet without pre-trained knowledge. In Figure 3, we used CIFAR-100 as the training dataset and adopted a 32-layer ResNet and an 8-layer ResNet, respectively, as the pre-trained and target DNNs in the KTF.
We have observed that all Gramian matrix distributions of the original 8-layer ResNet, which does not take knowledge transfer for the learning process, are significantly different from the distributions of the pre-trained ResNet. However, using the proposed method, the target ResNet can yield distributions with a higher learning performance, and it is largely similar to the pre-trained ResNet. Furthermore, as shown in the distribution table in Figure 3, the obtained knowledge involved with low-level features has significant training results for the distributed information from the pre-trained ResNet. However, although the distributions between the pre-trained and target ResNets are generally in agreement, the knowledge based on high-level features, such as G 0,3 , G 1,3 , and G 2,3 , yields slightly lower learning performance, compared to the distributions for G 0,1 , G 0,2 , and G 1,2 .

Evaluation of the Proposed Method for Dense Flow
In these experiments, we compare the performance of the proposed adversarialbased method with that of the existing l 2 -norm-based method from a knowledge transfer perspective for the LDF-based distilled knowledge. The related parameters in (15) and hyper-parameters related to learning rates are given in Table 2. For the same pre-trained ResNet, Table 3 shows the training results based on the CIFAR-10 dataset using a 14-layer target ResNet. In addition, Table 4 presents the training results with the same dataset using a 20-layer target ResNet to solve a classification problem.
In addition, Table 5 shows the results of using a 32-layer pre-trained and a 14-layer target ResNets for the CIFAR-100 dataset. In Table 3-5, we can observe that the proposed method has better accuracy than the prior l 2 -norm-based methods [34] that follow both sequential and concurrent training approaches mainly based on a dense flow-based scheme. Here, the sequential training involves repetitive sequential knowledge transfer of dense flow from bottom to top between pre-trained and target DNNs, whereas the concurrent approach involves the simultaneous transmission of dense flow into a target DNN. Furthermore, we compared the relative difference between the performance of the target and pre-trained DNNs owing to the difference in performance between the pre-trained DNN model used in the experiment [34] and the pre-trained DNN model used in our experiment. The relative difference is calculated as where Acc P and Acc T denote the accuracies of the pre-trained and target DNNs, respectively. In CIFAR-10, there was no significant difference in classification accuracy performance between the 26-layer pre-trained ResNet adopted in [34] (Acc P =91.91%) and the pretrained ResNet used in our experiment (Acc P =91.79%). In contrast, we can observe that the 32-layer pre-trained ResNet in [34] provides a lower performance for CIFAR-100 than the 32-layer pre-trained ResNet used in our experiment, although the two pre-trained ResNets have the same number of layers. Notably, this is because the pre-trained ResNet model used in [34] did not adopt any data pre-processing, resulting in Acc P =64.69%. To rephrase, we adopted the pre-trained ResNet with data pre-processing in our experiment [15], resulting in Acc P =74.70%. According to [15], we used a random crop of size 32 × 32 after 4 × 4 padding with a pre-processing of random flip, and the same pre-processing method was used for knowledge transfer. Tables 3-5 show that the relative difference using the proposed adversarial training can produce better results compared to the previous l 2 -normbased training experiments [34]. In addition, it can be observed that more complex and deeper ResNet structures can yield better accuracy in terms of knowledge transfer in the proposed method.

Comparison of Knowledge Transfer Performance
In addition, as shown in Tables 6 and 7, we compare the results between the existing knowledge transfer methods and the proposed method. In both existing methods, flow-based knowledge was chosen as the distilled knowledge of the pre-trained DNNs. Conversely, the difference between these two techniques is the loss function design used for knowledge transfer in a KTF. In essence, knowledge transfer using l 2 -loss was performed in [3] to calculate the cost function of the flow-based distilled knowledge. In the previous method [31], the adversarial loss was used in the cost function to transfer the flowbased knowledge. In contrast to these two methods, we proposed an adversarial-based knowledge transfer method coupled with the layer-wise overlapping dense flow.
The performance shown in Tables 6 and 7 is the average of the three high values extracted from five experiments. The results indicate that both of the existing methods of [3] and [31] have better classification accuracy than all original ResNets trained without knowledge transfer approach. However, we can observe that the performance of the obtained target ResNet using the proposed approach outperforms the two methods. As mentioned in Section 4.2, a deeper and more complex network structure can obtain better performance enhancement.
In Table 8, the experimental results represent the total number of floating-point operations (FLOPs) required to infer pre-trained and target ResNets. FLOPs ratio (T/R) in Table 8 represents the ratio between the total number of FLOPs in the target ResNet and that of FLOPs in the pre-trained ResNet. First, the CIFAR-10 results in Table 6 show that the performance of the pre-trained and target DNNs in the KTF is largely similar to each other when the target DNN is a 14-layer ResNet. However, the number of FLOPs for inference is 50% or less as shown in Table 8. In addition, the 20-layer target ResNet obtained using the proposed method is superior to the pre-trained 26-layer ResNet, which has more layers and provides improved accuracy compared to the two existing knowledge transfer methods. Subsequently, for CIFAR-100, as shown in Table 7 and 8, the 14-layer target ResNet performs 1.2% higher than the 32-layer pre-trained ResNet but only 37.6% of inference complexity. In particular, the classification accuracy of the 20-layer ResNet in Table 7 is 77.32%, which is slightly higher or similar to 77.29% of the 1001-layer ResNet performance [46].

Conclusions
In this study, we proposed an adversarial-based knowledge transfer approach using densely distilled layer-wise flow-based knowledge of a pre-trained deep neural network for image classification tasks. The proposed knowledge transfer framework was composed of a pre-trained ResNet to extract LDF-based knowledge, a given target ResNet to receive extracted knowledge, and densely placed discriminators to transfer adversarial optimizationbased knowledge. In particular, to process LDF-based knowledge distilled from the pretrained ResNet, the proposed framework was implemented by a semi-supervised learning technique using numerous discriminators for adversarial training and true labels for conventional training. In addition, we designed several adversarial-based loss functions suitable for densely distilled flow-based knowledge transfer. Regarding the loss functions, the l 2 distance-based loss function using densely generated FSP matrices was considered in the proposed framework to deliver more LDF-based feature information to a target ResNet while maintaining stability through adversarial optimization-based knowledge transfer. According to the devised loss functions and adversarial-based knowledge transfer scheme, the proposed method can concurrently update the numerous discriminators and target ResNet.
To validate the performance of the proposed method in terms of knowledge transfer accuracy, we used reliable benchmark datasets such as CIFAR-10 and CIFAR-100 and considered various ResNet architectures with different numbers of layers for a pre-trained source and target models. For all LDF distributions, the results demonstrated that the proposed approach more accurately transferred pre-trained rich information of dense flow between low-level detailed and high-level abstract knowledge compared to the existing l 2 -norm-based approach. Furthermore, the small target ResNet obtained from the proposed layer-wise concurrent training yielded higher accuracy than the existing knowledge transfer methods considered in this study or even the original complex pre-trained ResNet. In future work, we plan to use more complicated CNN-based architectures to further analyze the effect of knowledge distributions so that the parameters of the discriminators can be dynamically optimized in the adversarial learning process for a flow-based feature that has a two-dimensional image shape. We will also apply and analyze knowledge transfer proposed in this study to other DNN models that have a different form from the ResNet in future research.