Deep Features for Training Support Vector Machines

Features play a crucial role in computer vision. Initially designed to detect salient elements by means of handcrafted algorithms, features now are often learned using different layers in convolutional neural networks (CNNs). This paper develops a generic computer vision system based on features extracted from trained CNNs. Multiple learned features are combined into a single structure to work on different image classification tasks. The proposed system was derived by testing several approaches for extracting features from the inner layers of CNNs and using them as inputs to support vector machines that are then combined by sum rule. Several dimensionality reduction techniques were tested for reducing the high dimensionality of the inner layers so that they can work with SVMs. The empirically derived generic vision system based on applying a discrete cosine transform (DCT) separately to each channel is shown to significantly boost the performance of standard CNNs across a large and diverse collection of image data sets. In addition, an ensemble of different topologies taking the same DCT approach and combined with global mean thresholding pooling obtained state-of-the-art results on a benchmark image virus data set.


Introduction
Extracting salient descriptors from images is the mainstay of many computer vision systems.Typically, these handcrafted descriptors are tailored to overcome specific problems in image classification, the goal being to achieve the best classification accuracy possible while maintaining computational efficiency.Some descriptors, such as Scale Invariant Feature Transform (SIFT) [1], are valued for their robustness, but they can be too computationally expensive for practical purposes.As a consequence, variants of popular handcrafted descriptors, such as some fast variants of SIFT [2], continue to be created in an attempt to overcome inherent shortcomings.
In contrast to computer vision systems that rely on the extraction of handcrafted descriptors are those that depend on deep learners [3], as exemplified in computer vision by Convolutional Neural Networks (CNNs).Deep learning involves designing complex networks composed of specialized layers, and the descriptors or features calculated by these layers are learned from the training samples [4].Layers in deep learners, like CNN, are known to discover many low-level representations of the data in the early stages that become useful to subsequent layers in charge of providing higher-level features representing the semantics of the data [5].Close to the input, edges and texture are usually detected [6].Higher up, features like contours and image patches are discerned.Layer by layer, representations of the data in deep learners become more and more complex.An advantageous characteristic of these deep features is that they are generalizable.Once extracted, they can be treated like other handcrafted features in traditional computer vision systems and applied to many different image problems.
Interest in research investigating feature sets extracted from different layers of pretrained CNNs has grown in recent years.Lower level features extracted from sets of CNN topologies have been explored in [7] and [8] and top layers in [9] and [10].In [9], for example, images are represented as strings of CNN features with similarities compared with novel distance measures.Convolutional features are extracted in [11], where they are used as a filter bank.In [12], deep activation features are extracted from local patches at multiple scales, with convolutional features taken from the seventh layer of a CNN trained on ImageNet.In [13] and [14], features are extracted from the last convolutional layers of a CNN and in [14] combined with the fully connected (FC) layer.In [15], images are represented using five convolutional layers and two FC layers.Similarly, in [16] convolutional features are extracted from multiple layers combined with FC features.In [17], features are extracted from the penultimate layer of pre-trained CNNs and merged with the outputs of deep layers as well as with CNN scores.Finally, in [18], features are investigated layer by layer and discovered to provide quality information about the texture of images at multiple depths.
This work aims to exploit both the deeper and shallower layers of pre-trained CNNs for representing images with fixed-length feature vectors that can then be trained on a set of Support Vector Machines (SVMs) [19].Extracting features from the inner layers of CNNs poses a difficulty because they are characterized by high dimensionality, making them unsuitable for training statistical classifiers like SVM.To reduce dimensionality, experiments are run that test the following approaches: • Classic dimensionality reduction methods: viz., discrete cosine transform (DCT) and principal component analysis (PCA); • Feature selection approaches (chi-square feature selection); • Texture descriptors extracted from local binary patterns (LBP) followed by feature selection; • Co-occurrence among elements of the channels of inner layers; • Global pooling measurements.Experiments demonstrate that combining feature sets extracted from inner and outer CNN layers and applying as many different dimensionality reduction techniques as needed obtains close to if not state-of-the-art results on an extensive collection of crossdomain image data sets.In addition, an ensemble of different topologies (DenseNet201 and ResNet50) is tested for virus classification to test generalizability, and this ensemble obtains state-of-the-art results.Performance differences are verified using the Wilcoxon signed-rank test, and all experiments can be replicated using The MATLAB source code available at https://github.com/LorisNanni.

Feature extraction from convolutional neural networks
In this work, we extract features from Convolutional Neural Networks [20] pretrained on the ImageNet dataset [21].These features are taken from multiple layers of a CNN and then individually trained on separate SVMs (see Figure 1).The CNN architectures investigated in this study are GoogleNet (Inception IV) [22], ResNet50 [23], and Densenet201 [24].GoogleNet, winner of the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2014, is a CNN with twenty-two layers.To create deeper layers, GoogleNet uses 1 × 1 convolution and global average pooling.ResNet50, the winner of the ILSVRC2015 contest, is a CNN with fifty layers.To overcome the vanishing gradient problem with deep networks, ResNet incorporates a residual connection.DenseNet201 is extremely deep with 201 layers.This architecture replaces the residual connection with densely connected convolutional layers that are concatenated rather than added to each other as with ResNet.All layers are interconnected in DenseNet, a technique that produces strong gradient flow and that shares low-level information across the entire network.
Unlike many other studies focused on the extraction of features from the output layer, we examine features extracted from deeper layers, as in [17].The layers considered for extracting features are selected starting from the middle layer of the network and then by considering one layer after every ten, going toward the output layer, with the last four layers always considered.Since deep layers encode high-dimension features, dimensionality reduction methods are also used, as shown in Figure 1, depending on the feature size (i.e., when more than 5000 features are extracted by a given layer).
Figure 1.Feature extraction from inner layers.The output of each layer is treated as a feature vector, with a dimensionality reduction method applied depending on the vector size (< 5000).All vectors are then processed by a separate SVM and summed for a final decision.
All configurations are investigated considering the possible combinations of the following elements: • Tuning (with/without): either the CNN used to extract features is pre-trained on ImageNet without any tuning, or it is tuned on the given training set; • Scope of dimensionality reduction (local/global): either dimensionality reduction is performed separately on each channel of a layer (with results combined), or reduction is applied to the whole layer; • PCA postprocessing (with/without): either PCA projection is performed after dimensionality reduction, or PCA is not applied.The dimensionality reduction methods considered in this work are presented in the remainder of this section.

Feature Reduction Transforms (PC and DC)
Dimensionality is reduced by applying two classic transforms: PCA and DCT.In the experimental section, PCA is labeled as PC and DCT as DC.
PCA [25] is a well-known unsupervised technique that projects high-dimensional data into a lower-dimensional subspace.This is accomplished by mapping the original feature vectors into a smaller number of uncorrelated directions that preserve the global Euclidean structure.
The DCT transform [26] balances information packing and computational complexity.DCT components tend to be small in magnitude because the most important information lies in the coefficients with low frequencies.As with PCA, removing small coefficients produces small errors when the transform is reversed to reconstruct the original images.

Chi-Square Feature Selection (CHI)
Univariate feature ranking for classification using chi-square tests is a popular feature selection method.In the experimental section, CHI is the label used for chi-square feature selection.
The chi-square test in statistics tests the independence between two events A and B.
If P(AB) = P(A)P(B), then the two events are said to be independent.The same holds when P(A|B) = P(A) and P(B|A) = P(B).
The formula for the chi-square test is where  is the degrees of freedom,  is the observed values, and   are the expected values.The degrees of freedom are the maximum number of logically independent values (the total number of observations minus the number of imposed constraints).Applied to feature selection, chi-square is calculated between every feature variable and the target variable (the occurrence of the feature and the occurrence of the class).If the feature variable is independent of the class, it is discarded; otherwise, it is selected.

Local Binary Patterns
This approach to feature reduction is based on the uniform Local Binary Pattern (LBP), a popular texture descriptor.LBP is defined across each pixel value (  ) on a local circular neighborhood of radius  of size  pixels, thus: where () = 1 if  ≥ 0; 0 otherwise.A histogram of the resulting binary numbers describes the texture of a given image.
When calculating (2), two types of patterns are distinguished, those with less than three transitions between 0 and 1, known as uniform patterns, and the remainder, which are called nonuniform.
In this work,  = 8 and  = 1 and only uniform patterns, as already mentioned, are considered.After LBP extraction from each channel of a CNN layer, dimensionality is reduced with the chi-square feature selection method.In the experimental section, the dimensionality method based on LBP combined with chi-square feature selection is labeled LB.

Deep Co-Occurrence of Deep Representation
A deep co-occurrence representation can be obtained from a deep convolutional layer as proposed in [14].A co-occurrence is said to occur when the values of two separate activations located inside a given region are greater than a certain threshold.The resulting representation is a tensor with the same dimensions as the activation tensor and can be implemented with convolutional filters.
A convolutional filter can be defined as  ∈ ℝ ××× , where  is the number of channels in the activation tensor and where the size of the co-occurrence window is  = 2 •  + 1, with  the radius defining the co-occurrence region.Filters are initially set to 1 except for the filter that is related to a given channel; such a filter is initialized to 0 or some very small value .
Given the activation tensor  of size  ×  with  channels where  ∈ ℝ ×× and where  is the last convolution operator in a CNN, the co-occurrence tensor   ∈ ℝ ×× can be considered as a convolution between the activation tensor after thresholding the co-occurrence filter, thus: where    =  •   , with   ∈ ℝ ×× , and   =  >  ̅ , with  ̅ the average mean of the activation map produced after the last convolutional layer.In other words, given the activation  ,  : For pseudo-code, see [14].In the experimental section, the representation based on co-occurrence representation is labeled CoOC.

Global pooling measurements
The input to a global pooling layer is a set of   () activation maps computed previously by layer , and the output is one global measurement (  () ) for each activation map   () (1 ≤  ≤   () ).The  () measurements then become the inputs to an FC layer.In [18], these pooling measurements are transformed into feature vectors.Two global pooling measurements are used for feature extraction in the experiments presented here: Global Entropy Pooling (GEP) and Global Mean Thresholding Pooling (GMTP).GEP computes the entropy value of   () .Given the probability distribution   () of   () , calculated first by normalizing values to [0, 255] and then by computing a histogram from the normalized activation map using 255 bins,   () is simply the resulting histogram divided by the sum of its elements: ∑   () [] = 1, (0 ≤  ≤ 255) Thus, GEP is defined as: GEP(  () ) = − ∑   () [] ln(  () [])  (6) Unlike GEP, GMPT includes more layer information into the feature extraction process.To compute GEP, a threshold   () must be obtained by averaging the value of the entire set of activation maps  () : where  and  are an element's position in the -th activation map computed previously by layer .Whereas   () , as already noted, represents the number of activation maps, ℎ  () and   () are the height and width of each map.Thus, GMPT is the proportion of elements in each  () with values below threshold   () .

Sequential Forward Floating Selection of Layers/Classifiers
In some of the experiments presented in this work, we examine the performance of a layer selection method (i.e., a classifier selection procedure) using Sequential Forward Floating Selection as described in [27].Selecting classifiers using SFFS is performed by including models in the final ensemble that produce the highest increment of performance compared to an existing subset of models.A backtracking step replaces the worst model from the actual ensemble using the better performing model.Since SFFS requires a training phase to select the best models for the task, we perform a leave-one-out-data set selection protocol.

Experimental Results
This section describes the experimental results on twelve publicly available medical image data sets: • CH (CHO data set [28]) contains 327 fluorescence microscope 512×382 images of Chinese Hamster Ovary cells divided into five classes; • HE (2D HeLa data set [28]) contains 862 fluorescence microscopy 512×382 images of HeLa cells stained with various organelle-specific fluorescent dyes.The images are divided into ten classes of organelles; • RN (RNAi data set [29]) contains 200 fluorescence microscopy 1024 x1024 TIFF images of fly cells (D. melanogaster) divided into ten classes; • MA (C.Elegans Muscle Age data set [29]) contains 237, 1600×1200 images for classifying the age of the nematode given twenty-five images of C. elegans muscles collected at four ages; • TB (Terminal Bulb Aging data set [29]) is the companion data set to MA and contains 970 768×512 images of C. elegans terminal bulbs collected at seven ages; • LY (Lymphoma data set [29]) contains 375, 1388×1040 images of malignant lymphoma representative of three types; • LG (Liver Gender Caloric Restriction (CR) data set [29]) contains 265, 1388×1040 images of liver tissue sections from six-month male and female mice on a CR diet; • LA (Liver Aging Ad-libitum data set [29]) contains 529, 1388×1040 images of liver tissue sections from female mice on an ad-libitum diet divided into four classes representing the age of the mice; • BGR (Breast Grading Carcinoma [30]): this is a Zenodo data set (record: 834910#.Wp1bQ-jOWUl) that contains 300 1280×960 annotated histological images of twenty-one patients with invasive ductal carcinoma of the breast representing three classes/grades; • LAR (Laryngeal data set [31]): this is a Zenodo data set (record: 1003200#.WdeQcnBx0nQ) containing 1320, 1280×960 images of thirty-three healthy and early-stage cancerous laryngeal tissues representative of four tissue classes; • LO (Locate Endogenous data set [32]) contains 502, 768×512 images of endogenous cells divided into ten classes.This data set is archived at https://integbio.jp/dbcatalog/en/record/nbdc00296 ; • TR (Locate Transfected data set [32]) is a companion data set to LO and contains 553, 768×512 images divided into the same ten classes as LO but with the addition of one more class for a total of eleven classes.Data sets 1-8 can be found at https://ome.grc.nia.nih.gov/iicbu2008/,data sets 9-10 are on Zenodo and can accessed by their record number provided in parentheses in the data set descriptions.Data sets 10 and 12 are available upon request.
The five-fold cross-validation protocol is applied to all data sets except for LAR, which uses a three-fold protocol.Although the size of the original images is provided above in the data set descriptions, all images were resized to fit the input size for the given CNN model.
In our experiments, we obtained better results tuning the CNN on each training set without PCA processing and the application of the methods locally (i.e., separately on each channel of a given layer).For this reason, most of the results reported in the following tables for the dimensionality reduction methods (unless otherwise specified) are based on tuning the CNNs without PCA postprocessing and the local application of methods.As noted in the introduction, the Wilcoxon signed-rank test [33] is the measure used to validate experiments.
Reported in Table 1 is the performance of all the approaches using ResNet50, and Reported in Table 2 are the most interesting results using GoogleNet.The row labeled CNN in Tables 1 and 2 reports the performance obtained using the standard standalone CNN, be it ResNet50 (Table 1) or GoogleNet (Table 2).The label TunLayer-x represents features extracted for SVM training using the  to last layer of the network tuned on the given training set.The best performance for TunLayer-x is obtained with  = 3; we also report, for comparison purposes, the performance of Layer-3 on the CNN pre-trained on ImageNet without tuning on the given data sets.The label TunFusLayer is the fusion by sum rule of the TunLayer-x classifiers.Row X+Y indicates the sum rule between X and Y, and the method named g-DC is DC applied globally as in [17].
The SVM classifiers were tested using LibSVM and fitecoc (MathWorks).Performance was higher using fitecoc (except for CoOC) and when the CNN was not tuned.For the last four layers, we trained SVM using the original features.
Feature vectors produced by a given layer of the CNN with a dimensionality higher than 5000 were processed by applying the dimensionality reduction techniques in the following ways: • DoOC, GEP, and GMTP use a single value extracted from each channel (see details in the previous section); • For the other approaches, the method is first applied separately on each channel; then 1000/(Number of Channels )features are extracted from each channel; • For g-DC, all the features from all channels are first concatenated; then, they are reduced to a 1000-dimension vector by applying DCT.• In the following Tables, some classifiers are labeled as: • Ens15CNN, the sum rule among fifteen standard ResNet50 CNNs or fifteen standard GoogleNets.This is a baseline approach since our method is an ensemble of classifiers; • (DC+GMTP)-2, the approach (DC+GMTP) where the two last layers of the CNN are not used for feeding SVM.Notice that DC and GMTP are extracted considering two different trainings of the CNN (this is done in order to increase the diversity of the features extracted by the two methods); • SFFS(X) means that we combine by sum rule  SVMs.For GoogleNet, only the most interesting approaches that were reported for Res-NEt50 are tested; this is done to reduce computation time.The analysis of the results reported in Tables 1 and 2 lead to the following set of observations: • DC clearly outperforms (p-value 0.0001) g-DC on both GoogleNet and ResNet50.
Applying DCT separately on each channel boosts performance with respect to a single application of DCT on the whole layer.• The best methods for reducing the dimensionality of the inner layers are DC, PC, GMTP, and GEP.• On average, the best approach is given by (DC+GMTP)-2, i.e., by the sum rule between DC and GMTP.• On average, discarding the SVMs trained with the two last layers slightly improves performance.• DC outperforms (p-value 0.01) on any TunLayer-x; this implies that the inner layers are also useful on the tuned networks.Both TunLayer-3 and DC strongly outperform (p-value 0.01) CNN on all the tested data sets.Using a GoogleNet/ResNet50 directly to classify images does not maximize performance, probably due to overfitting given the size of the training sets.Notice that we have trained an SVM classifier on each of the ten layers.Considering the size of GoogleNet and ResNet50, using larger CNNs with so many layers would not be the best choice.
To test the generalizability of our approach, Table 3 reports experiments run on a popular Virus benchmark data set [34] located at http://www.cb.uu.se/_Gustaf/virustexture/.This data set contains 1500, 41×41 Transmission Electron Microscopy (TEM) images of viruses belonging to fifteen species of viruses and is divided into two different data sets: 1) the object scale data set, so named because the radius of every virus in each image is 20 pixels, and 2) the fixed scale data set, so called because each virus image is represented such that the size of 1 pixel corresponds to 1nm.The first data set, used in the following experiments, is publicly available.The second is proprietary, so it is unavailable for testing due to copyright issues.It is the object scale data set that is widely reported in the literature.
Regarding object scale, two networks were trained: DenseNet201, that is the network providing the best performance on the object scale dataset in the literature, and ResNet50, because a large number of relevant papers report results using this network.Both CNNs were trained for fifty epochs, with all other parameters the same as those noted in the tests reported above.In Table 3, we report the performance obtained using both LibSVM and fitecoc classifiers.DenseNet produces a better performance using LibSVM.To reduce computation time, the combination (DC+PC+GMTP)-2 is not run on ResNet because, when coupled with DenseNet, it obtains a performance similar to (DC+GMTP)-2).The last column of Table 3 reports the fusion by sum rule between the two CNNs before the sum of the scores of each ensemble is normalized by dividing the score by the number of trained SVMs.
Finally, in Table 4, we compare our approach with the best performance reported in the literature.As can be observed in Table 4, our proposed method obtains state-of-theart performance.In [34], the reported performance is obtained using the fixed scale data set.Since that data set is not publicly available, comparisons with [34] cannot be made.By combining features computed on the object scale with the fixed scale data sets, an accuracy of 87.0 % is obtained in their work.

Table 4. Comparison with the literature
Note: the method notated with * combines descriptors based on both object scale and fixed scale images.

Conclusion
The objective of this work was to explore the power of using both the intermediate and the last layers of three pre-trained CNNs to evaluate features with fixed-length feature vectors that can be used to train an ensemble of SVMs.To overcome the high dimensionality of the features extracted from inner layers, experiments tested many different dimensionality reduction techniques, including two classic feature transforms (DCT and PCA), a feature selection approach (chi-square feature selection), a texture descriptor (local binary pattern) followed by feature selection and a representation based on the cooccurrence among elements of the channels of inner layers.
The best ensemble reported here is shown to significantly boost the performance of standard CNN on a large and diverse group of image data sets as well as on a popular benchmark virus data set where the ensemble obtained state-of-the-art performance.
As future works, we plan on combining this approach with other deep neural networks and to test different methods for representing the inner layers in a compact way for feeding SVM.

Table 3 .
Performance in the Virus data set to test generalizability.