Deep Transfer Learning for Modality Classification of Medical Images

Medical images are valuable for clinical diagnosis and decision making. Image modality is an important primary step, as it is capable of aiding clinicians to access required medical image in retrieval systems. Traditional methods of modality classification are dependent on the choice of hand-crafted features and demand a clear awareness of prior domain knowledge. The feature learning approach may detect efficiently visual characteristics of different modalities, but it is limited to the number of training datasets. To overcome the absence of labeled data, on the one hand, we take deep convolutional neural networks (VGGNet, ResNet) with different depths pre-trained on ImageNet, fix most of the earlier layers to reserve generic features of natural images, and only train their higher-level portion on ImageCLEF to learn domain-specific features of medical figures. Then, we train from scratch deep CNNs with only six weight layers to capture more domain-specific features. On the other hand, we employ two data augmentation methods to help CNNs to give the full scope to their potential characterizing image modality features. The final prediction is given by our voting system based on the outputs of three CNNs. After evaluating our proposed model on the subfigure classification task in ImageCLEF2015 and ImageCLEF2016, we obtain new, state-of-the-art results—76.87% in ImageCLEF2015 and 87.37% in ImageCLEF2016—which imply that CNNs, based on our proposed transfer learning methods and data augmentation skills, can identify more efficiently modalities of medical images.


Introduction
With the ease of Internet access, the size of the medical literature has grown exponentially over the past few years [1].Medical images in articles provide basic knowledge in visualization of body parts, in their treatment, and in tracking disease, which makes the clinical care and diagnosis of diseases practicable [2][3][4].Different sorts of medical image technologies provide an enormous amount of images with various medical modalities and other image types, such as Computerized Tomography, X-ray, or generic biomedical illustrations [5].To aid the clinician and the researcher to retrieve required images, many tools have been developed to formulate and execute queries based on the visual content [6].Content-based medical image retrieval systems, such as OPENi [7], could be improved by filtering our non-relevant image types using the modality information [6,8], but not all medical images are annotated appropriately.To overcome the limited number of labeled images with reliable modality information, one way is to assign manually modalities to all medical images.However, this is both time consuming and costly.Another possibility is to perform automatic modality classification of the images using feature engineering methods [9][10][11][12][13][14] (see Section 2).The performance of these approaches is good but limited to the choice of "hand-crafted" features and a clear awareness of prior domain knowledge.Learning features from data has become popular both in academia and in industry, because many interesting priors can be conveniently captured by a learner [15].
Convolutional neural networks (CNNs) are designed to learn features from data that come in the form of multiple arrays, for example, a color image.They achieve many practical successes [16][17][18][19] (see Section 2) in ImageNet benchmark [20] and have recently been widely adopted by the computer vision community [21].When large scale training datasets are available, CNNs are capable of learning more expressive representations of image data in general object recognition [20].However, in real world applications (e.g., the medical images modality classification), it is expensive or impossible to re-collect the needed training data and train CNNs from scratch.CNNs may be hindered to disentangle the factors of variation by the limited samples with highly variable [22].
Transfer learning between task domains would be desirable.For the task of medical image classification, the training dataset is not large (thousands), therefore it is a good choice [23,24] to pre-train a CNN on a very large dataset (e.g., ImageNet, which contains 1.2 million natural images with 1000 categories), and then use the pre-trained CNN either as an initialization for further fine-tuning [25,26] or a fixed feature extractor [27] (see Section 2).Since the current medical dataset is small, it is likely best to only train a linear classifier rather than to fine-tune the pre-trained CNN due to overfitting concerns.Because the medical dataset is very different from the original natural dataset of ImageNet, another classifier needs to be trained from activations somewhere earlier [28] in the network.Features extracted from the top of the pre-trained network may be too dataset-specific and would not be able to distinguish medical images.
In this article, we first train from scratch a deep CNN without too many layers on medical data to capture domain-specific information.Then, we explore another transfer learning framework to capture both generic and domain-specific features.We reserve generic characteristics by fixing most layers of deeper CNNs pre-trained on ImageNet and learn new specific representations through replacing and retraining the classifier (top layers) on top of the pre-trained network on medical datasets.To address the greatest challenge that is the small scale of current dataset.We employ two methods of data augmentation to aid CNNs to reach their full potential and further improve modality classification performance.After evaluating our proposed model on the subfigure classification task in ImageCLEF2015 and ImageCLEF2016, we obtain better performance than the state of the art visual methods-76.87% in ImageCLEF2015 and 87.37% in ImageCLEF2016.

Related Work
Due to its importance in detecting modalities of medical images, a lot of research has been proposed for the task of modality classification, including feature engineering methods [9][10][11][12][13][14] and deep learning-based approaches [25,27].Since deep learning-based methods do not need hand-crafted features, they have shown promising potential in dealing with modality classification task.
Over the years, various effective feature engineering techniques for medical image classification have been developed.De Herrera et al. [9] combine SIFT (Scale Invariant Feature Transform) [29] with BoC (Bag-of-Colors) [30] features to represent medical images.Pelka et al. [10,14] extract eight kinds of low-level features from images to train a multiclass linear SVM and a obtain state-of-the-art visual result (60.91%) in ImageCLEF2015.Koitka et al. [11] apply many state-of-the-art visual descriptors to describe an image with color, texture, and shape information.Valavanis [12] adopts various visual features, such as Bag-of-Visual-Words [31] and Quad-Tree [32] BoC (Bag-of-Colors).Li, P. et al. [13] also apply a hierarchical classifier using multiple visual descriptors.The performance of these approaches depends on the quality of features hand-crafted by domain experts.It is hard to capture a substantial number of possible input variations very well.
CNNs have led to a series of breakthroughs for image classification [16][17][18][19][20].There are four key ideas [21] behind CNNs: local connections, shared weights, pooling, and the use of many layers.Different feature maps are responsible to detect local distinctive motifs.Sharing the same weights among units at different locations tend to detect the same pattern in different parts of the image.The role of the pooling layer is to merge semantically similar features into one, reduce the dimension of the representation, and create an invariance to small shifts and distortions.Recently, with the popularity of CNNs, deeper and deeper networks have been proposed, e.g., AlexNet [16], VGGNet [17], GoogLeNet [18], and ResNet [19].The initial landmark breakthrough of Krizhevsky et al. [16] is achieved by their AlexNet CNN with eight weight layers.Simonyan et al. [17] propose VGGNet with 16 weight layers to investigate how the CNNs' depth affects their accuracy in the large-scale image recognition setting.Szegedy et al. [18] introduce GoogLeNet architecture with more weight layers but much fewer parameters than AlexNet and VGGNet.He et al. [19] present deep residual networks (ResNet) with a depth of up to 152 weight layers, which address the degradation problem by introducing a deep residual learning framework.CNNs generally require a large-scale dataset to reach their full potential.It is difficult to acquire large, expertly labeled training datasets in consideration of the time and labor cost involved.Take ImageCLEF medical [5,33,34] as an example; it provides thousands of labeled medical images for modality classification, which is a much smaller amount than the ImageNet dataset [20], which contains 1.2 million natural images.
Our previous work [22] in ImageCLEF2013 is the first attempt to train from scratch multiple CNNs to learn features from medical images for describing their modalities, and it achieves a competitive result.Since deep CNNs take several weeks to train across multiple GPUs on ImageNet, it is common to see people release their final network checkpoints for the benefit of others who can use the networks as a fixed-feature extractor or for fine-tuning.Koitka et al. [12] extract visual features from the top of the pre-trained ResNet [19] to train another classifier to predict modality and achieve state-of-the-art performance (85.38%) in ImageCLEF2016.Kumar et al. [25] combine fine-tuned AlexNet [16] and GoogLeNet [18] to distinguish subtle differences between image modalities.Zhang et al. [26] use the synergic signal system to combine dual ResNets, which are pre-trained on large scale natural images and fine-tuned on medical figures.However, it is hard to capture nuances between modalities by fine-tuning CNNs with enormous parameters under the circumstance that there are not sufficient training samples.
The ImageCLEF dataset is small and very different from the ImageNet dataset.We keep some of earlier layers fixed (due to overfitting concerns) and only retrain the higher-level portion of the network.This is motivated by the observation that the earlier features of a CNN contain more generic features (e.g., edge detectors or color blob detectors) that may be useful to current task, but later layers of the CNN become progressively more specific to the details of the classes contained in the ImageNet dataset.To address the difficulty of learning from the imbalanced dataset with limited samples, other CNN-based methods for modality classification have expanded the training dataset [22,25,27].

Methods
This section describes the architecture of our proposed model including three types of deep convolutional neural networks (CNNs) with different depths and a different voting system (see Figure 1).

Convolutional Neural Networks
We first took two types of very deep CNNs (VGGNet-16 and ResNet-50, shown in Figure 1) with different depths that had been pretrained (initialised) on natural image dataset (ImageNet).Then we trained from scratch a "shallower" CNN (CNN-6) on the medical dataset.The softmax function is implemented at the final layer to output the prediction probabilities, to determine the class of the image.
We used the following different CNNs, with their own different capabilities, to explore the central importance of networks' depth:

CNN-6
This CNN has only six weight layers similar to [22,35,36].The first two convolutional layers contain 32 kernels of size 3 × 3, and the second two convolutional layers have 64 kernels of size 3 × 3. The second and fourth convolutional layers are interleaved with max pooling layers of dimension 2 × 2 with a dropout of 0.25.Then a full-connected layer with 512 neurons and a dropout of 0.5 is followed by a full-connected layer with 30 neurons.The ReLU activation function is applied to all four convolutional layers and the first full-connected layer.We use Glorot [37] uniform to initial weights and train the model from scratch.

VGGNet-16
This deeper CNN has a depth of 16 weight layers proposed by the Visual Geometry Group [17], which not only achieves excellent accuracy on the ImageNet classification task [20] but is also applicable to other image recognition datasets.Very small 3 × 3 filters are used in all convolutional layers to reduce the number of parameters in such deep networks.

ResNet-50
This extremely deep residual networks is presented by He et al. [19] and obtains state-of-the-art results on the ImageNet classification task [20].We use ResNet-50-a deep residual network of a depth of 50 weight layers-as our preliminary work in modality classification.
Pretrained VGGNet and ResNet are designed for 1000 classes; therefore, we replace the last full-connected layer with 30 neurons to output thirty posterior probabilities.We implement our methods in Python, using the Keras library for our implementation of deep CNNs.For our experiments, we load weights of pre-trained CNNs provided by Keras.

Transfer Learning
It is natural to use the transfer learning method to apply the knowledge gained while solving the problem of natural image recognition to solve a different problem of medical images classification.One transfer learning method is to remove the last fully-connected layer on the top of

Convolutional Neural Networks
We first took two types of very deep CNNs (VGGNet-16 and ResNet-50, shown in Figure 1) with different depths that had been pretrained (initialised) on natural image dataset (ImageNet).Then we trained from scratch a "shallower" CNN (CNN-6) on the medical dataset.The softmax function is implemented at the final layer to output the prediction probabilities, to determine the class of the image.
We used the following different CNNs, with their own different capabilities, to explore the central importance of networks' depth:

CNN-6
This CNN has only six weight layers similar to [22,35,36].The first two convolutional layers contain 32 kernels of size 3 × 3, and the second two convolutional layers have 64 kernels of size 3 × 3. The second and fourth convolutional layers are interleaved with max pooling layers of dimension 2 × 2 with a dropout of 0.25.Then a full-connected layer with 512 neurons and a dropout of 0.5 is followed by a full-connected layer with 30 neurons.The ReLU activation function is applied to all four convolutional layers and the first full-connected layer.We use Glorot [37] uniform to initial weights and train the model from scratch.

VGGNet-16
This deeper CNN has a depth of 16 weight layers proposed by the Visual Geometry Group [17], which not only achieves excellent accuracy on the ImageNet classification task [20] but is also applicable to other image recognition datasets.Very small 3 × 3 filters are used in all convolutional layers to reduce the number of parameters in such deep networks.

ResNet-50
This extremely deep residual networks is presented by He et al. [19] and obtains state-of-the-art results on the ImageNet classification task [20].We use ResNet-50-a deep residual network of a depth of 50 weight layers-as our preliminary work in modality classification.
Pretrained VGGNet and ResNet are designed for 1000 classes; therefore, we replace the last full-connected layer with 30 neurons to output thirty posterior probabilities.We implement our methods in Python, using the Keras library for our implementation of deep CNNs.For our experiments, we load weights of pre-trained CNNs provided by Keras.

Transfer Learning
It is natural to use the transfer learning method to apply the knowledge gained while solving the problem of natural image recognition to solve a different problem of medical images classification.One transfer learning method is to remove the last fully-connected layer on the top of the pre-trained DNN on ImageNet, because this layer's outputs are the 1000 class scores for a different task like ImageNet, and treat the rest of the network as a fixed feature extractor for the current dataset.The features extracted are used to train a linear classifier (e.g., Softmax or SVM).Another transfer learning method is to not only replace and retrain the classifier on top of the network on the new dataset, but also to fine-tune the weights of the pre-trained network by continuing the back-propagation.
Consider two facts, as follows: firstly, the scale of the medical dataset (thousands) is much smaller than the natural dataset (millions); secondly, two datasets contain images from completely different domains-that is, they have a different data distribution.We employ another form of transfer learning similar to the first one described above for modality classification.We adjust the transfer learning method by fixing most earlier layers to reserve generic information and only retraining from scratch the last full-connected layer(s) of VGGNet-16 and ResNet-50 to capture domain-specific features.Then, we train from scratch CNN-6 on the medical dataset to capture more domain-specific information.
Specifically, VGGNet-16 and ResNet-50 we used are pre-trained on the ImageNet [20] natural image dataset.After taking pre-trained CNNs, we first replace the last full-connected layer with 30 neurons.Then, we use the Glorot [37] uniform to reinitialize weights of the last three full-connected layers of VGGNet-16 and the last one of ReNet-50, but fix all other layers of the networks.We use the Admax [38] optimizer and the Categorical Cross-Entropy loss function to train the model on the ImageCLEF dataset over shuffled mini-batches of 32.
Let X be the medical training dataset of n images.Training top layer(s) from scratch is an iterative process that finds weights w that minimize the CNN's empirical loss.
where x i is the ith image of X, f (x i , w) is the predicted probability of the class y i of x i (given w, y i is the ground-truth class of the ith image, and l y i , y i is a penalty function for predicting y i instead of y i l y i , where The updated weights are calculated from the gradient of the loss L when applied to the mini-batch using the current weights.We use Admax [38] to compute individual adaptive learning rates to controlling the size of the updates to the weights.

Data Augmentation
The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations [16,25].Furthermore, some image categories are represented by few annotated examples; thus, we introduce new images in order to counteract the imbalanced dataset.Additional datasets created are described below: DS_Original: The original training collection distributed for the subfigure classification task in ImageCLEF2015 and ImageCLEF2016, described in Section 4.1.
DS_Aug1: Similar to [10,27], we use the sub-collection with 1800 single-modality figures for the Modality Classification task in ImageCLEF 2013 without the 'COMP' category to expand the original training set in ImageCLEF2015 and ImageCLEF2016.
DS_Aug2: Generating batches of label-preserved images with real-time data augmentation including rotation, zoom, shift, and flip transformations.

Voting System
The Voting System receives the intensities computed for each modality by three CNNs.The combination of the outputs of the CNNs is responsible for producing the final intensity for each modality.Our Voting System is an adaptation of the weighted majority vote from Kuncheva et al. [39], where we use a combination rule called average vote and give different weights to the intensities produced by each CNNs (see Formula (3)).
where y is the prediction class label, the function of σ(•) returns the mean of the input predicted probabilities of k-th models with a weight of w k and the function of argmax(•) refers to the input x, at which the output of the average is maximum.

Experiments
In this section, we describe baseline methods, which get the highest accuracies of the subfigure classification task in ImageCLEF2015 and ImageCLEF2016 in comparison with our proposed model.Then, we present the experimental results of our approaches, as well as of the baselines.

Datasets
To facilitate research and development in this field, the Image Cross-Language Evaluation Forum (ImageCLEF) has run the medical task since 2004.The subtask of subfigure classification was first introduced in ImageCLEF2015 [33] and continued in ImageCLEF2016 [34], but was similar to the modality classification subtask organized in ImageCLEF2013 [5].This subtask aims to classify images into the 30 modalities of the hierarchy.The images of the training and test datasets are subfigures extracted from compound figures from the medical literature.Figure 2 shows six subfigures from a compound figure with three different modalities.For this subtask, visual and textual methods are possible; however, visual features play a major role when making predictions based on cross-media [5,33,34].In this article, we focus on implementing and evaluating visual approaches for the task of subfigure classification.
For our experiments, we utilize ImageCLEF2015 and ImageCLEF2016 subfigure classification datasets [33,34] created from a subset of PubMed Central.This task makes training data and test data available containing subfigures extracted from compound figures of the biomedical literatures in PubMed (see Figure 2).Similar to the modality classification task organized in ImageCLEF2013, thirty hierarchical modality classes proposed by [41], except for compound images (COMP), are used in subfigure classification task, shown in Figure 3.In ImageCLEF2015 [33], the training set contains 4532 figures and the test set 2244 figures.In ImageCLEF2016 [34], they expand the training set to 6776 figures and the test set to 4166 figures.
Before inputting figures into CNNs, we resize them to a square of 224 × 224 pixels.After loading an image into PIL (Python Imaging Library) format, we convert a PIL image instance to a NumPy array.Using preprocessing tools of Keras, we prepare inputs of shape (N, 3, 224, 224) for CNNs, where N is the number of instances.
images into the 30 modalities of the hierarchy.The images of the training and test datasets are subfigures extracted from compound figures from the medical literature.Figure 2 shows six subfigures from a compound figure with three different modalities.For this subtask, visual and textual methods are possible; however, visual features play a major role when making predictions based on cross-media [5,33,34].In this article, we focus on implementing and evaluating visual approaches for the task of subfigure classification.For our experiments, we utilize ImageCLEF2015 and ImageCLEF2016 subfigure classification datasets [33,34] created from a subset of PubMed Central.This task makes training data and test data available containing subfigures extracted from compound figures of the biomedical literatures in PubMed (see Figure 2).Similar to the modality classification task organized in ImageCLEF2013, thirty hierarchical modality classes proposed by [41], except for compound images (COMP), are used in subfigure classification task, shown in Figure 3.In ImageCLEF2015 [33], the training set contains 4532 figures and the test set 2244 figures.In ImageCLEF2016 [34], they expand the training

Baselines
This section describes the baseline methods, and their results in both ImageCLEF2015 and

Baselines
This section describes the baseline methods, and their results in both ImageCLEF2015 and ImageCLEF2016.
FHDO BCSG [10,14]-The FHDO Biomedical Computer Science Group in the University of Applied Science and Arts, Dortmund-obtained the best visual result with an accuracy of 60.91% in ImageCLEF2015 labeled as Baseline_2015 (see Table 1).They extracted eight kinds of low-level features from image and fed them to a classifier of the multiclass linear kernel Support Vector Machine (SVM).
FHDO BCSG [27] achieved the best result of 85.38% in ImageCLEF2016 labeled as Baseline_2016 (see Table 2).They leveraged Resnet152 [19] to extract features and fed them to a classifier.

Expeimental Setup
We compare accuracies of three CNNs (CNN-6, VGGNet-16, ResNet-50) to baselines (Baseline_2015 and Baseline_2016) and also inspect the performance of our proposed model based on the voting system.
In accordance with the evaluation criterion of the benchmark, we evaluate our approach based on 30-classes of classification accuracy for all experiments, unless otherwise stated.Cross-validation is generally used to select the optimal CNN training parameters.Given that CNN training can take an extensive amount of time, we choose original small training set to implement fully independent experiments with 10-fold cross validation (10FCV) for model selection (see Tables 1 and 2).But when training multiple CNNs on DS_Aug2, the number of epochs are reduced due to the fact that the running time is too long (see Section 4.4.3) and validation accuracy doesn't change much after 5 epochs.
Most codes are modified from our previous work [22,35,36] and are implemented with the neural network library of Keras, running on top of Theano.After loading the Theano version of weights, not including the top layer(s), we add new full-connected layer(s) at the top of the CNNs (VGGNet-16 and ResNet-50) and initialize their weights using Glorot [37] uniform.All default parameters are used, except for those parameters mentioned in Section 2. Our networks are trained on one NVIDIA Tesla K20c GPU-4 G memories-in a 64 bit Dell computer with two 2.40 GHz CPUs, 64 G main memories, and Ubuntu 12.04.

Deep Transfer Learning
We obtained good performance [36] using CNN-6 in the Compound Figure Detection Task [33,34].The first experiment of training CNN-6 from scratch (with random initialization) on the subfigure classification task is designed for rapidly getting results rather than optimal performance.Similar to our previous work [22,35,36], although combining more networks provides more performance gain, and considering the huge cost time of the training network on DS_Aug2 (described in Section 3.3), we train only 5 networks for each CNN.From Tables 1 and 2, we can see that the results of CNN-6 are promising-66.13%in ImageCLEF2015 and 81.86% ImageCLEF2016.The results of CNN-6 demonstrate that it can capture domain-specific information from the medical dataset, which benefits from the training approach, even with only six weight layers.Now that CNN-6 is effective in this task, we attempt to train deeper CNNs.We train five CNNs for each of CNN-6, VGGNet-16, or ResNet-50, using parameters described in Table 3. Tables 1 and 2 demonstrate that the performance of three CNNs exist in a positive correlation to the depth of the networks in all three datasets described in Section 3.3.These results give obvious evidence of the central importance of network depth.Both in ImageCLEF2015 and ImageCLEF2016, Resnet-50 has a higher accuracy than baseline, which indicates that it plays a leading role in our proposed model.
Unlike the similar accuracies in ImageCLEF2016, all pre-trained CNNs achieve better performance than the baseline method in ImageCLEF2015, although it uses several state-of-the-art traditional features engineering methods.Especially, CNN-6 with only six layers beat Baseline_2015, which provides more evidence that the feature learning method is also very effective in current task.

Data Augmentation
The original training set of the subfigure classification task in ImageCLEF2015 (4532 images) and in ImageCLEF2016 (6776 images) is much smaller than the ImageNet training set (about 1.2 million images).To address this problem, we augment the dataset from two perspectives: bringing new images into the original training set (DS_Aug1) and transforming original images (DS_Aug2).
By horizontal comparison of the results in Tables 1 and 2, we find that our proposed model and three CNNs achieve higher accuracies after introducing new images (DS_Aug1) and create new label-preserved images (DS_Aug2).More evidence of the positive effects of the two data augmentation strategies is that ResNet-50 has surpassed the baseline, not only in ImageCLEF2015, but also in ImageCLEF2016 after expanding data.Specially, on the one hand, it is effective to increase the data variety to some extent by introducing figures from ImageCLEF2013 with single-modality, so accuracies of our proposed models increase by more than 3 percentage points (see Tables 1 and 2).On the other hand, we implement image data transformation in real time with Keras API of the ImageDataGenerator, and use the following parameters: random rotations ([0, 20] degrees), random shift horizontally ([0, 20] of total width and height), random zoom ([80%, 120%]), and random flip horizontally and vertically.Although not performing transformation parameters tuning, our proposed model achieved acceptable results-accuracies increased from 76.07 to 76.87% in ImageCLEF2015, and from 86.07 to 87.37% in ImageCLEF2016 (see Tables 1 and 2).There is room to improve the performance of our model when choosing parameters based on a grid search or introducing new transformation techniques.
Our proposed model takes advantage of the different depths of the networks and the two data augmentation methods to achieve better performances (76.78% in 2015 and 86.92% in 2016) than the baselines (60.91% in 2015 and 85.38% in 2016) described in Section 3.3.With weights of [0.1, 0.2, 0.7] based on the grid search, our fusion models achieve accuracies (76.87% and 87.37%) beyond baselines (60.91% and 85.38%) in ImageCLEF2015 and ImageCLEF2016.

Running Time
The running time of our networks is listed in Table 4.For comparative purposes, we present the running time on training or testing one sample from the DS_Original dataset, excluding data preprocessing, and record the training time in one epoch.Without surprise, we find that VGGNet-16 tends to need more running time than CNN-6 (see Table 4), because VGGNet-16 has more parameters.Although ResNet-50 has more layers, it relies on its advantage of having a smaller parameter size, and has a roughly equivalent training time to VGGNet-16, and a lower testing time.At the same time, we record our training time of all the samples in one epoch on two expanded datasets.From Table 5, we can see that it takes much more time to train networks when applying common real-time data augmentation with rotation, zoom, shift, and flip transformation.

Conclusions
We have presented a model for medical image modality classification that is composed of three CNNs with different depths, which are combined by weighted averaging of the prediction probabilities.The depth of network is of central importance for current task, as is demonstrated by the dominance performance of ResNet in ImageCLEF2015 and ImageCLEF2016.Our proposed transfer learning method can benefit from generic features captured by CNNs pre-trained on ImageNet, and domain-specific features captured by the top layers of extremely deep CNNs and another "shallower" CNN, which are trained from scratch on medical images.Our model-based on this transfer learning method and two data augmentation strategies-could identify efficiently the modality of medical images.We hope to include more powerful CNNs such as ResNet with 152 layers or other new state-of-the-art models for image classification into our system, and to focus on improving the performance for this task.Furthermore, we plan to explore more complicated fusion strategies, such as using the MKL (Multiple Kernel Learning) algorithm to fuse models in feature level or introduce the synergic signal system to fuse results in the model level.

Figure 1 .
Figure 1.Architecture of our proposed model for subfigure classification.Deep CNNs are denoted as "network name-(depth)".

Figure 1 .
Figure 1.Architecture of our proposed model for subfigure classification.Deep CNNs are denoted as "network name-(depth)".


°the test set to 4166 figures.Before inputting figures into CNNs, we resize them to a square of 224 × 224 pixels.After loading an image into PIL (Python Imaging Library) format, we convert a PIL image instance to a NumPy array.Using preprocessing tools of Keras, we prepare inputs of shape , 3,224,224 for CNNs, where is the number of instances.

Figure 3 .
Figure 3. Thirty modality classes, along with the class codes in brackets.

Figure 3 .
Figure 3. Thirty modality classes, along with the class codes in brackets.

Table 1 .
Accuracy of visual methods in ImageCLEF2015.

Table 2 .
Accuracy of visual methods in ImageCLEF2016.

Table 3 .
Training parameters in ImageCLEF2015 and ImageCLEF2016.

Table 4 .
Training and testing time of CNNs on DS_Original dataset.

Table 5 .
Training time of CNNs on augmented dataset.