EXPANSE: A Deep Continual / Progressive Learning System for Deep Transfer Learning

Deep transfer learning techniques try to tackle the limitations of deep learning, the dependency on extensive training data and the training costs, by reusing obtained knowledge. However, the current DTL techniques suffer from either catastrophic forgetting dilemma (losing the previously obtained knowledge) or overly biased pre-trained models (harder to adapt to target data) in finetuning pre-trained models or freezing a part of the pre-trained model, respectively. Progressive learning, a sub-category of DTL, reduces the effect of the overly biased model in the case of freezing earlier layers by adding a new layer to the end of a frozen pre-trained model. Even though it has been successful in many cases, it cannot yet handle distant source and target data. We propose a new continual/progressive learning approach for deep transfer learning to tackle these limitations. To avoid both catastrophic forgetting and overly biased-model problems, we expand the pre-trained model by expanding pre-trained layers (adding new nodes to each layer) in the model instead of only adding new layers. Hence the method is named EXPANSE. Our experimental results confirm that we can tackle distant source and target data using this technique. At the same time, the final model is still valid on the source data, achieving a promising deep continual learning approach. Moreover, we offer a new way of training deep learning models inspired by the human education system. We termed this two-step training: learning basics first, then adding complexities and uncertainties. The evaluation implies that the two-step training extracts more meaningful features and a finer basin on the error surface since it can achieve better accuracy in comparison to regular training. EXPANSE (model expansion and two-step training) is a systematic continual learning approach applicable to different problems and DL models.


Introduction
In recent years, Deep Learning (DL) has successfully addressed a number of challenging and interesting applications; in particular, problems that involved non-linearity of datasets.Recent advancements in deep learning methods deliver various usages and applications in extremely different areas such as image processing, natural language processing (NLP), numerical data analysis and predictions, and voice recognition.However, deep learning comes with restrictions, such as expensive training processes (time and processing) and the requirement of extensive training data (labeled data) [1].
Since the start of the Machine Learning (ML) era, transfer learning has been a neat exploration for scientists.Before the rise of deep learning models, transfer learning was known as domain adaptation and focused on homogeneous data sets and how to relate such sets to each other because of the nature of ML algorithms [2,3].Traditional ML models have less dependency on dataset size, and usually, their training is less costly than deep learning models since they have been mostly designed for linear problems.Therefore, the motivation for using transfer learning in deep learning is higher than ever in the AI (Artificial Intelligence) and ML fields since it can address the two restraints of extensive training data and training costs.
Recent transfer learning methods on deep learning aim to reduce training process time and cost, and the necessity of extensive training datasets which can be hard to harvest in some areas such as medical images.Moreover, a pre-trained model for a specific job can be run on a simple edge device like a cellphone with limited processing capacity and limited training time [4].Also, developments in DTL are opening the door to more intuitive and sophisticated AI systems since it considers learning a continuous task.A great example of this idea is Google's deep mind project and advancements such as progressive learning [5].All this is bringing DTL to the forefront of research in artificial intelligence and machine learning.
In this paper, first, the definition of DTL is reviewed, followed by the taxonomy of DTL.Then, selected recent practical studies of DTL are listed, categorized, and summarized.Moreover, two experimental evaluations of DTL and their conclusions are reviewed.Last but not least, we discuss the limitations of today's DTL techniques and possible ways to tackle them.

Deep Learning
Deep learning (DL) or deep neural network (DNN) is a machine learning subcategory, which can deal with nonlinear datasets.DNNs consist of layers of stacked nodes, with activation function and associated weights, (fully/partially) connected and usually trained (weight adjustments) by backpropagation and optimization algorithms.During the past two decades, DNNs were developed rapidly and are used in many aspects of our daily lives today.For instance, Convolutional Neural Network (CNN) layers have improved deep learning models for visual-related tasks since 2011, and as of today, most DLs use CNN layers [1].For more details about machine learning and deep learning, please refer to [1] since this paper is focused on deep transfer learning, and we assume that the reader should have a thorough understanding of machine learning and deep learning.

Deep Transfer Learning (DTL)
Deep transfer learning is about using the obtained knowledge from another task and dataset (even one not strongly related to the source task or dataset) to reduce learning costs.In many ML problems arranging a large amount of labeled data is impossible, which is mandatory for most DL models.For instance, at the beginning of the Covid-19 pandemic or even a year into it, providing enough chest X-Ray labeled data for training a deep learning model was still challenging, while using deep transfer learning, the AI achieved detecting the disease with very high accuracy with a limited training set [17,18].Another application is applying machine learning on edge devices such as phones for variant tasks by taking advantage of deep transfer learning to reduce the need for processing power.
An untrained DL uses a random initializing weight for nodes, and during the expensive training process, those weights adjust to the most optimized values by applying an optimization algorithm for a specific task (dataset).Remarkably, [6] proved that initializing those weights based on a trained network with even a very distant dataset improves training performance compared to the random initialization.
Deep transfer learning differs from semi-supervised learning since, in DTL, the source and target datasets can have a different distribution and just be related to each other, while in semi-supervised learning, the source and target data are from the same dataset, only the target set does not have the labels [2].DTL is also not the same as Multiview learning since Multiview learning uses two or more distinct datasets to improve the quality of one task, e.g., video datasets can be separated into image and audio datasets [2].Last but not least, DTL differs from Multitask learning despite many shared similarities.The most fundamental difference is that in Multitask learning, the tasks use interconnections to boost each other, and knowledge transfer happens concurrently between related tasks.In contrast in DTL, the target domain is the focus, and the knowledge has already been obtained for target data from source data, and they do not need to be related or function simultaneously [2].

From Transfer Learning to Deep Transfer Learning, Taxonomy
It is possible to categorize Deep Transfer Learnings (DTLs) in different ways by various criteria, similar to Transfer Learnings.DTLs can be divided into two categories of homogeneous and heterogenous based on the homogeneity of source and target data [2].However, this categorization can be done differently because it is subjective and relative.For example, a dataset of X-Ray photos can be considered heterogeneous to a dataset of tree species photos when the comparison domain is limited to only image data.In contrast, it can be considered homogeneous to the same tree species photo dataset when the domain consists of audio and text datasets.Also, DTLs can be categorized into three groups based on label-setting aspects: (i) transductive, (ii) inductive, and (iii) unsupervised [2].Briefly, transductive is when only the source data is labeled; if both source and target data are labeled it is inductive; if none of the data are labeled it is unsupervised deep transfer learning [2].
[2] and [7] mention and define another categorization of DTLs through the aspect of applied approaches.They similarly categorized DTLs into four groups of: (i) instance-based, (ii) feature-based / mapping-based, (iii) parameter-based / network-based, and (iv) relational-based / adversarialbased approaches.Instance-based transfer learning approaches are based on using selected parts of instances (or all) in source data and applying different weighting strategies to be used with target data.Feature-based approaches map instances (or some features) from both source and target data into more homogeneous data.Further, the [2] survey divides the feature-based category into asymmetric and symmetric feature-based transfer learning subcategories."Asymmetric approaches transform the source features to match the target ones.In contrast, symmetric approaches attempt to find a common latent feature space and then transform both the source and the target features into a new feature representation."[2] The network-based (parameter-based) methods are about using the obtained knowledge in the model (network) with different combinations of pre-trained layers: freezing some and/or finetuning some and/or adding some fresh layers.Relational/adversarial-based approaches focus on extracting transferable features from both source and target data either using the logical relationship or rules learned in the source domain or by applying methods inspired by generative adversarial networks (GAN) [2,7].Figure 1 shows the taxonomy of the above-mentioned categories [2].
Other than the network-based and adversarial-based approaches, all other categories have been explored deeply during the last couple of decades for different ML techniques known as domain adaptation or transfer learning [2,3].However, most of those techniques are still applicable to deep transfer learning (DTL) as well.Network-based (parameter-based) approaches are the most applied techniques in DTL since they can tackle the domain adaptation between source and target data by adjusting the network (model).In other words, deep transfer learning is mainly focused on network-based approaches.Remarkably, network-based approaches in deep learning models can even tackle the adaptation of a very distant source and target data [2,7].
In deep transfer learning (DTL), different techniques are applied for network-based approaches, although generally, they are combinations of pretraining, freezing, finetuning, and/or adding a fresh layer(s).A deep learning network (DL model) trained on source data is called a pre-trained model consisting of pre-trained layers.Freezing and finetuning are techniques using some or all layers of pre-trained models to train the model on target data.Freezing some layers means the parameters/weights will not change and are constant values for frozen layers from a pre-trained model.finetune means the parameters/weights are initialized with the pre-trained values instead of random initialization for the whole network or some selected layers.Another recent DTL technique is based on freezing a pre-trained model and adding new layers to that model for training on target data; Google's deep mind project introduces this technique in 2016 as Progressive Learning / progressive neural networks (PNNs) [5,8].
The concept of progressive learning mimics human skill learning, which is adding a new skill on top of previously learned skills as a foundation to learn a new one.E.g., a child learns how to run after learning to crawl and walk and using all the skills obtained in the process.Similarly, PNNs prevent catastrophic forgetting in DTL versus finetuning techniques by freezing the whole pre-trained model and learning (adjusting to) the new task by training the newly added layers on top of previously trained layers [5,8].
In deep learning models, usually, the earlier layers do the feature extraction at a high level of detail, further layers towards the end extract the information and conceptualize the given data, and lateral layers do the classifications or predictions.For instance, in the image-related model, the earlier layers of CNN extract the edges, corners, and tiny patches of a given image.Further layers put those details together to detect objects or faces, and the lateral layers, usually fully connected layers, do the classification [9].Given this process, the most effective and efficient approach for DTL, to our knowledge, is to freeze the earlier and middle layers from a related pre-trained model and finetune the lateral layers for the new task/dataset [10].Similarly, the new layers are added to the last part of a pre-trained model in progressive learning.
Nonetheless, some other research in this area use combinational and sophisticated methods to tackle transfer learning in deep learning like ensembled networks, weighting strategies, etc. [2].However, to our knowledge, the search for recent advancements in DTL for practical tasks ends up with methods based on mostly the network-based and limited number of adversarial-based approaches.

Review of Recent Advancements in DTL
We limited our selection to the last five years of published studies on deep transfer learning for various tasks and data types.Table 1 shows the list of selected works from hundreds of reviewed literature sorted by their DTL approaches.We used the systematic literature review (SLR) technique [11] for the process of finding and selecting these thirty-eight publications.The inclusion criteria that we used for our selection process are as follows: a) published in the past five years, b) reproducible (detailed implementation and models), c) applied to practical ML problems, and d) generalizable.We found that all reviewed studies mostly fall into three categories of network-based approaches and some into the adversarial-based approach, which are explained in the previous section.We name these approaches as (i) Finetuning: finetuning a pre-trained model on target data; (ii) Freezing CNN layers: the earlier CNN layers are frozen, and only the lateral fully connected layers are finetuned; (iii) Progressive learning: some or all layers of a pre-trained model are selected and used frozen, and some fresh layers will be added to the model to be trained on target data; and (iv) Adversarial-based: extracting transferable features from both source and target data using adversarial or relational methods, Figure 2. The most common DTL method is using a trained model on a highly related dataset to target data and finetune it on target data (finetuning).The simplicity of applying this technique makes it the most popular DTL method in our selection; 21 of 38 selected works have used this method.This method can improve training on target data in various ways, such as reducing training costs and tackling the need for an extensive target dataset.However, it is still prone to catastrophic forgetting.Needless to say, it is a very effective DTL method for many tasks and datasets in various fields such as medical, mechanics, art, physics, security, etc.Also, it has been applied for both image datasets and tabular (numerical) datasets as listed in Table 1.
The second popular approach in DTL is freezing CNN layers in a pretrained model and finetune only lateral fully connected layers (Freezing CNN layers).CNN layers extract features from the given dataset, and the fully connected layers are responsible for classification, which in this method will be finetuned to the new task for target data.End of Table [33][34][35][36][37][38][39][40][41][42] are the sample research publications, which have used this method for different data types such as image and tabular data as listed in Table 1.This technique is specific to the models consisting of CNN layers; however, it can be extended to other deep learning models by assuming the earlier and middle layers are acting similar to CNN layers for feature extraction.
Using well-known models such as VGG-Net, Alex-Net, and Res-Net, which has already been trained on ImageNet datasets [50], is a general approach for both of the techniques mentioned above since they are easily accessible, and they are pre-trained to the highest possible accuracy.It is worth mentioning that such training can take days of processing time even with clusters of GPUs/TPUs and the mentioned methods are skipping the pre-training step by simply downloading a publicly available pre-trained model.[43][44][45][46] are based on the progressive learning method, also known as progressive neural networks (PNNs), described earlier.[44] evaluates progressive learning effectiveness for common natural language processing (NLP) tasks: sequence labeling and text classification.Through evaluation and comparison of applying PNNs to various models, datasets, and tasks, they show how PNNs improve DL models' accuracy by avoiding catastrophic forgetting in finetuning techniques.[43,45,46] use PNNs for image and audio datasets and similarly finds tangible improvements in comparison to other DTL techniques.
[47] and [48] are examples of adversarial-based approaches that we found in the literature.In [47], they used conditional generative adversarial networks (CGAN) to expand limited target data of chest X-Ray images for detecting Covid-19 DTL model.[48] applies domain adversarial training to obtain the shared features between multiple source datasets.
Moreover, we found some tailored DTL methods for specific tasks and datasets like [49].The proposed method in [49] as they describe is based on "three-layer sparse auto-encoder to extract the features of raw data, and applies the maximum mean discrepancy term to minimizing the discrepancy penalty between the features from training data and target data."They tailor that method for smart industry fault diagnosis problems and achieve 99.82% accuracy which is better than other approaches like deep belief network, sparse filter, deep learning, and support vector machine.Such tailored DTL approaches are not usually easy to generalize for different tasks or datasets.Nonetheless, they can open the door to interesting and new techniques in deep transfer learning's future.

Experimental Analyzations of Deep Transfer Learning
In this section we review two remarkable experimental evaluations of DTL techniques.The tests' setup, analysis, and conclusions are noteworthy for applying DTL techniques in different scenarios."What is being transferred in transfer learning?"[51] is a recent experimental study which uses a series of tests on visual domain and deep learning models and tries to investigate what makes a successful transfer and which part of the network is responsible for that.To do so, they analyze networks in four different cases: (i) pre-trained network, (ii) random initialized network, (iii) finetuned network on target domain after pretraining on source domain, (iv) trained network from random initialization [51].Moreover, to characterize the role of feature reuse, they use a source (pre-train) domain containing natural images (IMAGENET), and a few target (downstream) domains with decreasing visual similarities from natural images: DOMAINNET real, DOMAINNET clipart, CHEXPERT (medical chest X-Rays) and DOMAINNET quickdraw [51].
The study shows that feature reuse plays a key role in deep transfer learning as a pre-trained model on IMAGENET shows the largest performance improvement on real domain, which shares similar visual features (natural images) with IMAGENET in comparison to randomly initialized models.Also, they run a series of experiments by shuffling the image blocks (different block sizes).These experiments prove that feature reuse plays a very important role in transfer learning, particularly when the target domain shares visual features with the source domain.However, they realize that feature reuse is not the only reason for deep transfer learning success since even for distant targets such as CHEXPERT and quickdraw, they still observe performance boosts from deep transfer learning.Additionally, in all cases pre-trained models converge way faster than random initialized models.[51] Further, they manually analyze common and uncommon mistakes in the training of randomly initialized versus pre-trained models.They observe that data samples marked incorrect in the pre-trained model and correct in the randomly initialized model are mostly ambiguous samples.On the other hand, the majority of the samples that a pre-trained model marked correct and a randomly initialized model marked incorrect are straightforward samples.This means that a pre-trained model has a stronger prior, and it is harder to adapt to the target domain.Moreover, using centered kernel alignment to measure feature similarities, they conclude that the initialization point drastically impacts feature similarity, and two networks with high accuracy can have a different feature space.Also, they discover similar results for distance in parameter space, which two random-initialized models are farther from each other compared to two pre-trained models.[51] In regard to performance barriers and basins in the loss landscape, they have concluded that the network stays in the same basin of solution when finetuning a pre-trained network.They reach to this conclusion by training pre-trained models from two random runs as well as training random initialized models twice and comparing.Even when training a random initialized model two times with the same random values the models end up in different basins.[51] Module criticality is an interesting analysis of deep learning models.Usually, in a deep CNN model each layer of CNN considers a module, while in some models a component of network can be considered as a module.To measure criticality of a module, it is possible to take a trained model and re-initialize each module at once and compare the amount of model accuracy drop.Adopting this technique, the authors of [51] discovered: (i) fully connected layers (near to model output) become critical for P-T model, and (ii) module criticality increases moving from the input side of model towards output, which is consistent with the concept of earlier layers (near input) extracting more general features while lateral layers have features that are more specialized for the target domain.
[52] is another experimental analysis of transfer learning in visual tasks with the title of "Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types".Three factors of influence are investigated in this study: (i) image domain, the difference in image domain between source and target tasks, (ii) task type, the difference in task type, and (iii) dataset size, the size of the source and target training sets.They perform over 1200 transfer learning experiments on 20 datasets spanning seven diverse image domains (consumer, driving, aerial, underwater, indoor, synthetic, closeups) and four task types (semantic segmentation, object detection, depth estimation, keypoint detection).[52] They use data normalization (e.g., Illumination normalization) and augmentation techniques to improve models' accuracy.They adopt recent highresolution backbone HRNetV2, which consists of 69M parameters.This backbone is easily adjustable for different datasets by simply replacing the head of the backbone.To make a fair comparison they pre-trained (to be used for transfer learning) their models from scratch and evaluated their performance using top-1 accuracy on the ILSVRC'12 validation set.[52] The transfer learning experiments are mainly divided into two settings of (i) transfer learning with small target training set and (ii) with the full target set.The evaluation of transfer learning models is based on the gain obtained from finetuning from a specific source model compared to finetuning from ILSVRC'12 image classification with the main question of "are additional gains possible, by picking a good source?".Furthermore, they added a series of experiments for multi-source training to investigate the impact of using multi-source training for a specific task.[52] Such an exhaustive experimental analysis resulted in following observations: (i) all experiments proved that transfer learning outperforms training from scratch (random initialization); (ii) for 85% of target tasks there exists a source task which tops ILSVCR'12 pre-training; (iii) the most transfer gain happens when the source and target tasks are in the same image domain (within-domain), which is even more important than source size; (iv) positive transfer gain is possible when the source image domain includes the target domain; (v) although multisource models bring good transfer, they are outperformed by the largest within-domain source; (vi) "for 65% of the targets within the same image domain as the source, cross-task-type transfer results in positive transfer gains"; (vii) as naturally expected, the larger datasets positively transfer towards the smaller datasets; (viii) transfer effects are stronger for a small target training set, which helps the process of choosing the transfer learning model by testing several models with a small section of target data. [52]

Discussion
The Deep Transfer Learning (DTL) research field is thriving because of the motivation to handle the limitations of Deep Learning (DL) models, which are the dependency on extensive labeled data and training costs.The main idea is to use obtained knowledge from source data in the training process on target data.Another possible impactful outcome of the DTL research line is to achieve continual learning, which brings Artificial General Intelligence [1] a step closer to reality.Continual learning can be achieved simply through a chain of transfer learning processes while the end model is still valid on all previous training sources.
As we reviewed in previous sections, model-based approaches are the most commonly used approaches in DTL since deep learning models have the capacity to be adjusted to transfer knowledge.However, there are two main constraints in such approaches-catastrophic forgetting dilemma and an overly biased pre-trained model.
In the case of finetuning a pre-trained model, there is a high chance of drastic changes of weights through the whole model resulting in the catastrophic forgetting dilemma.Therefore, the obtained knowledge could be partially or even completely wiped out, resulting in unsuccessful training and no possibility of continual learning.This constraint limits the success of the finetuning approach to tightly related source and target data.Also, a very well-known technique to reduce the forgetting effect is to add a limited number of source samples to the target training data.
Freezing the pre-trained CNN layers technique tries to tackle the catastrophic forgetting by freezing the obtained knowledge on earlier layers and finetuning the fully-connected lateral layers to achieve transfer learning for target data.Given the fact that earlier layers in DL models extract detailed features and move towards the output, more abstract knowledge is extracted [9]; freezing the earlier layers limits the ability of the model to learn any new features from target data, which is known as an overly biased pre-trained model.Having extensive source data or access to a pre-trained model on a large dataset is critical for a successful transfer using this technique.In this way, there is a high chance that the pre-trained model has already learned any possible detailed features, and simply by finetuning the lateral layers can perform on target data.However, even tackling the first obstacle, this solution is still imperiled by the catastrophic forgetting in lateral layers.This technique is still successful in the case of related source and target data and tasks despite the limitations mentioned above.
Progressive learning tries to find a middle ground between catastrophic forgetting and a biased model by adding a new layer(s) to the end of a frozen pre-trained model.This technique is successful in the case of task transfer for related source and target data.It can not deal with distant source and target data since the earlier layers are frozen and cannot learn new features; however, the new lateral layer helps the model adjust to a new task.
A possible solution to address both catastrophic forgetting and an overly biased pre-trained model in DTL is to increase the learning capacity of a pre-trained model by vertically expanding it.In another research paper we propose expanding the model vertically in training on target data, adding new nodes on frozen pre-trained layers throughout the model instead of adding a new layer(s) to the end of the model [53].The vertical expansion increases the model learning capacity while keeping the previously obtained knowledge intact.Therefore, not only do we achieve successful transfer learning, our final model is still valid on source data opening the door to deep continual learning.[53]

Conclusion
This paper reviews the taxonomy of deep transfer learning (DTL) and the definitions of different approaches.Also, we review, list, categorize and analyze over thirty recent applied DTL research studies.Then, we investigate the methodology and limitations of the three most common model-based deep transfer learning methods: (i) Finetuning, (ii) Freezing CNN Layers, and (iii) Progressive Learning.These techniques have proven their ability and effectiveness for various machine learning problems.The simplicity of finetuning publicly available pre-trained models on extensive datasets is the reason for it being the most common transfer learning technique.Moreover, two thorough experimental studies in DTL are summarized; their discoveries clarify the details of a successful deep transfer learning approach for different scenarios.Last but not least, the limitations of current DTLs, catastrophic forgetting dilemma, and overly biased pre-trained models are discussed, along with possible solutions.

Fig. 1
Fig. 1 Taxonomy of Transfer Learning which is extendable to Deep Transfer Learning as well.

Fig. 2
Fig. 2 Most common Deep Transfer Learning approaches.

Table 1 :
List of selected recent deep transfer learning (DTL) publications.