Closing the Performance Gap between Siamese Networks for Dissimilarity Image Classification and Convolutional Neural Networks

In this paper, we examine two strategies for boosting the performance of ensembles of Siamese networks (SNNs) for image classification using two loss functions (Triplet and Binary Cross Entropy) and two methods for building the dissimilarity spaces (FULLY and DEEPER). With FULLY, the distance between a pattern and a prototype is calculated by comparing two images using the fully connected layer of the Siamese network. With DEEPER, each pattern is described using a deeper layer combined with dimensionality reduction. The basic design of the SNNs takes advantage of supervised k-means clustering for building the dissimilarity spaces that train a set of support vector machines, which are then combined by sum rule for a final decision. The robustness and versatility of this approach are demonstrated on several cross-domain image data sets, including a portrait data set, two bioimage and two animal vocalization data sets. Results show that the strategies employed in this work to increase the performance of dissimilarity image classification using SNN are closing the gap with standalone CNNs. Moreover, when our best system is combined with an ensemble of CNNs, the resulting performance is superior to an ensemble of CNNs, demonstrating that our new strategy is extracting additional information.


Introduction
Interest in classification systems based on (dis)similarity spaces is resurging. Unlike the more common technique of classifying samples within a feature space, (dis)similarity classification estimates the class of an unknown pattern by examining its similarities and dissimilarities with a set of training samples and pairwise (dis)similarities between each of the members. This process has come to involve more than the application of standard distance measures; (dis)similarity classification is also a way to build new spaces.
Though the two terms of similarity and dissimilarity are rarely disambiguated in the literature, classification based on the notion of dissimilarity is an idea first proposed in [1], where the focus was on comparing differences between samples belonging to different classes. Dissimilarity classification can be tackled by using either dissimilarity vectors, as in [2][3][4][5][6], or dissimilarity spaces, as in [7][8][9][10][11][12][13][14]. In the former case, two samples are considered positive if they belong to the same class and negative if they belong to separate classes. The goal of the classifier is to decide which of these two cases a given vector was calculated on. For a more detailed discussion of this approach, see [15].
In contrast, dissimilarity methods that generate dissimilarity spaces, the approach taken here, produce classifiers from within feature vector spaces. Unlike traditional feature vectors representing samples as measured across all features, representation from feature Systems built with these strategies are compared, fused, and evaluated with previous work on dissimilarity classification. The versatility and robustness of the best ensemble developed using these techniques are demonstrated on five cross-domain image data sets representing medical imaging problems, animal vocalizations (spectrograms), and portrait images.

Proposed Approach
The basic system can be described as follows. The inputs into the system, as in [12][13][14], are the original images and HASC descriptors [19], extracted to produce a new processed image. If the original image is in color, Hasc is applied separately on each band; if it is grey level, the Hasc image is replicated three times to build an image with three bands.
Starting with the vector space representations, step 1 of the training process, as illustrated in Figure 1, begins by generating a set of clusters that produce a set of prototypes. The prototypes are centroids generated by k-means on the vector space representations. In step 2, a dissimilarity space is generated by an SNN that learns a distance measure from the prototypes that maximizes differences between pairs within class while also minimizing differences of pairs between other classes, a process that produces a feature vector that is trained on an SVM. In the testing stage, an unknown pattern is projected onto the dissimilarity space that was learned by the SNN, which generates the feature vector the prototypes that maximizes differences between pairs within class while also minimizing differences of pairs between other classes, a process that produces a feature vector that is trained on an SVM. In the testing stage, an unknown pattern is projected onto the dissimilarity space that was learned by the SNN, which generates the feature vector that is then fed into the trained SVM (we have not optimized the SVM hyperparametes, we have used a generic setting: Radial basis function kernel; C = 1000; gamma = 0.1) for a decision. The SNN, as illustrated in Figure 2, combines two identical deep learners whose outputs are subtracted, which produces a feature vector (the absolute value of the difference) that is passed to a sigmoid and a loss function as in [12][13][14]. In this way, the FC layer and sigmoid predict the dissimilarity of the two input images (Inputs 1 and 2). The feature vector (FC) is computed by subtracting the outputs (F1 and F2) as follows: Unlike [12][13][14], which used binary cross entropy, two different loss functions are tested here (binary cross entropy and triplet loss function), and the CNN subnets are optimized with Adam and some Adam variants.  The SNN, as illustrated in Figure 2, combines two identical deep learners whose outputs are subtracted, which produces a feature vector (the absolute value of the difference) that is passed to a sigmoid and a loss function as in [12][13][14]. In this way, the FC layer and sigmoid predict the dissimilarity of the two input images (Inputs 1 and 2). The feature vector (FC) is computed by subtracting the outputs (F1 and F2) as follows: the prototypes that maximizes differences between pairs within class while also minimizing differences of pairs between other classes, a process that produces a feature vector that is trained on an SVM. In the testing stage, an unknown pattern is projected onto the dissimilarity space that was learned by the SNN, which generates the feature vector that is then fed into the trained SVM (we have not optimized the SVM hyperparametes, we have used a generic setting: Radial basis function kernel; C = 1000; gamma = 0.1) for a decision. The SNN, as illustrated in Figure 2, combines two identical deep learners whose outputs are subtracted, which produces a feature vector (the absolute value of the difference) that is passed to a sigmoid and a loss function as in [12][13][14]. In this way, the FC layer and sigmoid predict the dissimilarity of the two input images (Inputs 1 and 2). The feature vector (FC) is computed by subtracting the outputs (F1 and F2) as follows: Unlike [12][13][14], which used binary cross entropy, two different loss functions are tested here (binary cross entropy and triplet loss function), and the CNN subnets are optimized with Adam and some Adam variants.  Unlike [12][13][14], which used binary cross entropy, two different loss functions are tested here (binary cross entropy and triplet loss function), and the CNN subnets are optimized with Adam and some Adam variants.
Though some variations are indicated in Figures 1 and 2, they only show the output of one SNN fed into one SVM. In [12][13][14] and this work, many SNNs and SVMs are trained, tested, and combined. Eight CNN topologies form the backbone of the SNNs. These are the identical topologies described in [14] (for the reader's convenience, the table in [14] that details the topologies is reprinted in the Appendix A). Thus, a large number of SNNs are trained using the different topologies, the two loss functions, and the Adam optimization algorithms. Each of these systems is tested, fused, and evaluated to build the best-performing system empirically.
The pseudocode for each step in Figure 1 can be found in the following sources: [12][13][14] (see as well the companion source code for this paper available at https://github.com/ LorisNanni (accessed on 25 August 2021)).
Below, we focus on the new techniques proposed in this work: the application of two methods for generating the dissimilarity space (Section 2.1), the two different loss functions (Section 2.2) and the Adam optimization methods, including a new one proposed here (Section 2.3).

Methods for Generating the Dissimilarity Spaces
Both methods for generating the dissimilarity space follow the same basic process used in [12][13][14]: first, k-means is applied on a vector space representation of the training images, with prototypes calculated as the k centroids of the clusters produced. Second, a feature vector F ∈ R k is extracted by calculating the distances of image x from each of the prototypes, where the distance for each F i between x and prototype p i is given as The two methods for generating the dissimilarity space are labeled FULLY and DEEPER. With FULLY, the distance between a pattern and a prototype is obtained directly by comparing the two images using the Siamese network. With DEEPER, each pattern is described using a deeper layer than the fully connected backbone network of the Siamese network. To reduce the high dimensionality of this deeper layer, the Discrete Cosine Transform (DCT) is applied separately to each channel of that layer (see Section 2.2). Finally, the distance between a pattern and a prototype is given by the cosine distance. In other words, the backbone of the Siamese network is used as the feature extractor.
For the sake of space, the layers used in DEEPER are reported in the MATLAB toolbox available at https://github.com/LorisNanni (accessed on 25 August 2021) (for the reader's convenience, these layers are also reported in the Appendix A of this paper). This step is not optimized. We have chosen the layer before the last ReLu or fully connected layer to prevent overfitting the results rather than selecting layers optimized for each data set. Optimal layers could have been discovered using a leave-one-out data set, but this procedure was not feasible given the computational power of our GPUs. In Figure 3 we report the scheme of DEEPER.
Though some variations are indicated in Figures 1 and 2, they only show the output of one SNN fed into one SVM. In [12][13][14] and this work, many SNNs and SVMs are trained, tested, and combined. Eight CNN topologies form the backbone of the SNNs. These are the identical topologies described in [14] (for the reader's convenience, the table in [14] that details the topologies is reprinted in the Appendix A). Thus, a large number of SNNs are trained using the different topologies, the two loss functions, and the Adam optimization algorithms. Each of these systems is tested, fused, and evaluated to build the best-performing system empirically.
The pseudocode for each step in Figure 1 can be found in the following sources: [12][13][14] (see as well the companion source code for this paper available at https://github.com/LorisNanni (accessed on 25 August 2021)).
Below, we focus on the new techniques proposed in this work: the application of two methods for generating the dissimilarity space (Section 2.1), the two different loss functions (Section 2.2) and the Adam optimization methods, including a new one proposed here (Section 2.3).

Methods for Generating the Dissimilarity Spaces
Both methods for generating the dissimilarity space follow the same basic process used in [12][13][14]: first, k-means is applied on a vector space representation of the training images, with prototypes calculated as the k centroids of the clusters produced. Second, a feature vector ∈ is extracted by calculating the distances of image x from each of the prototypes, where the distance for each between x and prototype is given as = (x, ). The resulting feature vector is fed into the SVM. The two methods for generating the dissimilarity space are labeled FULLY and DEEPER. With FULLY, the distance between a pattern and a prototype is obtained directly by comparing the two images using the Siamese network. With DEEPER, each pattern is described using a deeper layer than the fully connected backbone network of the Siamese network. To reduce the high dimensionality of this deeper layer, the Discrete Cosine Transform (DCT) is applied separately to each channel of that layer (see Section 2.2). Finally, the distance between a pattern and a prototype is given by the cosine distance. In other words, the backbone of the Siamese network is used as the feature extractor.
For the sake of space, the layers used in DEEPER are reported in the MATLAB toolbox available at https://github.com/LorisNanni (accessed on 25 August 2021) (for the reader's convenience, these layers are also reported in the Appendix A of this paper). This step is not optimized. We have chosen the layer before the last ReLu or fully connected layer to prevent overfitting the results rather than selecting layers optimized for each data set. Optimal layers could have been discovered using a leave-one-out data set, but this procedure was not feasible given the computational power of our GPUs. In Figure 3 we report the scheme of DEEPER.

DCT Dimensionality Reduction
Because DEEPER uses a deeper layer compared to the fully connected backbone to generate the dissimilarity space, a method is needed to reduce dimensionality on each channel (with results combined) of the deeper layer. DCT [20] is the dimensionality transform selected here because (1) its components are typically small in magnitude (most information is located in the low-frequency coefficients), and (2) it balances information packing and computational complexity.
DCT can be expressed as where N is the number of row/columns of the image (input of CNN is a square matrix); p and q are the pixel indices of the input image; x and y are the indices of the DCT matrix. Each channel is reduced to a dimension of 9 × 9. All the features extracted from each channel are concatenated into a single vector that represents a given pattern/prototype.

Binary Cross Entropy Loss (Cross)
In the training phase, every pair of images in the training set is fed into the backbone of the Siamese architecture to obtain a feature vector F. Calculated next is Z = |F 1 − F 2 |, where F 1 and F 2 are the feature vectors of the two images in the pair. Z is passed through a fully connected layer and a sigmoid function that returns the probability Y that the two images belong to the same class. Cross is then used for the two-class problem.
In the testing phase, for every sample in the training set, we compute F. Then, we evaluate N centroids using k-means clustering. Every image in the training set is expressed as the vector of the distances between its features and the centroids. After that, we train an SVM on those vectors. We then apply this inference algorithm to the images in the test set.

Triplet Loss (Triplet)
With Triplet, we take three images as the inputs, labelled A, P, and N. It is assumed that A and P have the same label and A and N have different labels.
In the training phase, for every Triplet in the training set, feature vectors F A , F P , F N are computed and then passed through a sigmoid to obtain Y A , Y P , Y N . At that point, the loss function is: where ξ is a positive number, and |x| 2 is the Euclidean norm of the vector. In other words, the loss function encourages the network to create similar representations for samples in the same class and different representations for samples in different classes. ξ is the margin, the value used is 1 because in the fixed margin tests carried out it was the one that returned the best results. In the testing phase, the process is exactly the same as described for the testing phase of cross-entropy loss.

Adam Variants
Introduced in [20,21], the widely used optimization method Adam (referred to as Base Adam in the experimental section) takes advantage of adaptive gradient and momentum to compute adaptive learning rates for each parameter. It makes use of the gradient at the current step, the exponential moving average of the gradient (first order moment), and the exponential moving average of the square of the gradient (second order moment).
Thus, the first moment m t and the second moment u t are defined as: where the hyperparameters ρ 1 and ρ 2 represent the exponential decay rate for the first and second moment (set respectively to 0.9 and 0.99), g t is the gradient at time t, and the square on g t is meant to be calculated component-wise. The moments are initialized as m 0 = u 0 = 0.
To avoid small values of the moving averages due to being initialized to zero, Adam includes a bias-corrected version of the first and second order moments: The parameter update is computed as follow: where λ is the learning rate and is a very small positive number used to avoid any division by zero (usually set to 10 −8 ). The operations are supposed to be component-wise.
As noted in [22], Adam performs reasonably well in practice compared to other adaptive learning methods; however, Adam does not utilize the change in immediate past gradient information, a utilization that is incorporated in [22,23].

DGrad
This variant, proposed in [23], makes use of the absolute difference between the current gradient g t and the moving average of the element-wise squares of the gradients: where avg t is the moving average of the component-wise squares of the gradient. The absolute difference ∆ag t is then normalized by its maximum component as follows: Then, ξ t is defined as: where Sig(∆) is the sigmoid function: Each parameter of the network is finally updated following the equation: wherem t andû t are the first and second order moments seen in Adam.

DecayDGrad (New)
This DGrad variant introduces a learning rate decay, both locally and in the whole training process. The local decay can be achieved with a periodic impulse, defined as follows: where s = 10 is the period (number of iterations between each impulse). The impulse imp t is then multiplied by a global decay factor d t , shown in the equation: where n iter is the total number of iterations in the training process. The parameter c = 0.25, multiplied by n iter , determines the iteration whereby d t assumes its maximum value. The parameter ξ t is therefore defined as: Each parameter of the network is updated as shown in (12). Notice that imp t only has values in range 0 to 1, and its maximum value is assumed for iterations, which are multiples of s. The purpose of these restraints is to attenuate the value calculated by DGrad locally, namely progressively in the span of s iterations, to get a better evaluation of the local minimum, thereby avoiding an eventual overshoot of the global minimum.
The reason behind the learning rate decay factor d t is to keep the learning rate high in the initial part of the training, which accelerates training and avoids the memorization of noisy data while at the same time extending the decay in later iterations. In this way, DGrad can learn complex patterns, as shown in [24]. The plot of d t and imp t ·d t is reported in Figure 4.
This DGrad variant introduces a learning rate decay, both locally and in the whole training process. The local decay can be achieved with a periodic impulse, defined as follows: where = 10 is the period (number of iterations between each impulse).
The impulse is then multiplied by a global decay factor , shown in the equation: where is the total number of iterations in the training process. The parameter = 0.25, multiplied by , determines the iteration whereby assumes its maximum value.
The parameter is therefore defined as: Each parameter of the network is updated as shown in (12).
Notice that only has values in range 0 to 1, and its maximum value is assumed for iterations, which are multiples of . The purpose of these restraints is to attenuate the value calculated by DGrad locally, namely progressively in the span of iterations, to get a better evaluation of the local minimum, thereby avoiding an eventual overshoot of the global minimum.
The reason behind the learning rate decay factor is to keep the learning rate high in the initial part of the training, which accelerates training and avoids the memorization of noisy data while at the same time extending the decay in later iterations. In this way, DGrad can learn complex patterns, as shown in [24]. The plot of and • is reported in Figure 4.

Data Sets
The following five image data sets, representing very different classification tasks, were selected to demonstrate the versatility of the proposed method: • BIRDz [25]: This balanced data set is a real-world benchmark for bird species vocalizations. The testing protocol is ten runs using the data split in [25]. The audio tracks were extracted from the Xeno-Canto Archive (http://www.xeno-canto.org/ (accessed on 25 August 2021)). BIRDz contains a total of 2762 acoustic samples from eleven

Data Sets
The following five image data sets, representing very different classification tasks, were selected to demonstrate the versatility of the proposed method: • BIRDz [25]: This balanced data set is a real-world benchmark for bird species vocalizations. The testing protocol is ten runs using the data split in [25]. The audio tracks were extracted from the Xeno-Canto Archive (http://www.xeno-canto.org/ (accessed on 25 August 2021)). BIRDz contains a total of 2762 acoustic samples from eleven North American bird species, along with 339 unclassified audio samples (consisting of noise and unknown bird vocalizations). The bird classes vary in size from 246 to 259. Each observation is represented by five spectrograms: (1) constant frequency, (2) frequency modulated whistles, (3) broadband pulses, (4) broadband with varying frequency components, and (5) strong harmonics. • CAT [26,27]: This data set has ten balanced classes of cat vocalizations, with each one containing~300 samples for a total of 2962 samples taken from Kaggle, Youtube, and Flickr. The testing protocol is 10-fold cross-validation. The average duration of each sample is 4 s.
• InfLar [28]: This data set contains eighteen Narrow-Band Imaging (NBI) endoscopic videos of eighteen different patients with laryngeal cancer. The videos were retrospectively analyzed and categorized into four classes (informative, blurred, containing saliva or specular reflections, and underexposed). The average video length is 39 s. The videos were acquired with an NBI endoscopic system (Olympus Visera Elite S190 video processor and an ENF-VH rhino-laryngo videoscope) with a frame rate of 25 fps and an image size of 1920 × 1072 pixels. A total of 720 video frames, 180 for each of the four classes, were extracted and labeled. The testing protocol is three-fold cross-validation with data separated at the patient level to ensure that the frames from the same class were classified based on the features characteristic of each class and not due to features linked to the individual patient (e.g., vocal fold anatomy). • RPE [29]: This is a medical image classification data set that intends to distinguish the maturation of human stem cell-derived retinal pigmented epithelium. RPE is based on 195 images that were divided into sixteen subwindows. These subwindows were then assigned to one of four classes: (1) Fusifors, (2) Epithelioid, (3) Cobblestone, and (4) Mixed. Subwindows that were out of focus or that contained background information exclusively were discarded. This division of images into four and the exclusion process produced a total of 1862 images. • Port [30]: This data set contains 927 paintings from six different art movements: (1) High Renaissance, (2) Impressionism, (3) Northern Renaissance, (4) Post-Impressionism, (5) Rococo, and (6) Ukiyo-e. Ten-fold cross-validation is the testing protocol.
The same testing protocol presented in the papers introducing each data set is used in the experimental section, with accuracy being the performance indicator.

Experimental Results
The default settings in the MATLAB framework for Siamese networks were used to train the SNNs in all experiments to ensure no overfitting for any given data set. For Adam optimization and its variants, the number of iterations was set to 3000 with no stop criterion, the gradient decay factor to 0.9, the squared gradient decay factor to 0.99, and the learning rate to 0.0001.
The first run of experiments is reported in Table 1. In these tests, we used all the data sets. Each performance cell in Table 1 contains three rows of values for each data set:

1.
Top: The performance obtained using the method named FULLY for SVM input; 2.
Middle: The performance obtained using the method named DEEPER for SVM input; 3.
Bottom: The fusion by average rule of the SVMs in 1 and 2. The last row in Table 1 reports average performance of each approach of that column. The clustering method is k-means for all methods, and the number of prototypes is in the set (15,30,45,60). Thus, four networks are trained using the four numbers of prototypes in the set; the four SVMs trained in this way are combined by average rule.
For the sake of computation time, we used a single network topology in this test, which is the first topology tested in [14] and the Siamese topology recommended by Mathworks (see the Appendix A).

•
The columns of Table 1 report the following approaches: • Cross: Binary Cross Entropy loss function coupled with base Adam (this is the best approach proposed [14]); • CrossDD: Binary Cross Entropy loss function coupled with our new Adam variant DecayDGrad; • Triplet: Triplet loss function coupled with base Adam. • X + Y (columns 5 and 6): the fusion between X and Y.
From the results reported in Table 1, the following conclusions can be drawn: • Triplet produces a result that is similar to Cross on three data sets but performs better than Cross in InfLar and worst in CAT; • The fusion between Cross and Triplet boosts the performance of the base loss functions, except in the case of CAT; • The fusion among all the different approaches (see bottom cells in the column Triplet+Cross and Triplet+Cross+CrossDD) produces the best average performance. Table 2 reports results using combinations of the two loss functions on all eight topologies. Because running experiments on all five data sets was computationally too expensive, we chose to run them only on InfLar and Port because they are very different application problems.
In each cell of Table 2, the following four results are reported:

1.
Top: Cross function coupled with FULLY for SVM input (the best approach proposed in [14]); 2.
Upper: Triplet loss function coupled with FULLY for SVM input; 3.
Lower: Fusion by average rule among Cross coupled with FULLY, Cross coupled with DEEPER, Triplet coupled with FULLY, and Triplet coupled with DEEPER; 4.
Bottom: This is the fusion by average rule of SVMs 1 and 2 described for the method reported at the bottom of Table 1 but with the addition of CrossDD coupled with both FULLY and DEEPER.
The last row of Table 2 reports the fusions of #4 above for the numbered topologies.
In [13], we showed that combining more than four networks using the same topology (but varying the clustering algorithm) failed to improve performance. Examining Table 2, we discovered that changing the loss function and the method for building the dissimilarity space is beneficial when making an ensemble. We also observed that for all topologies except #6 in the Portrait data set (Port), the best performance is not obtained by contrastive loss coupled with FULLY (as was the case in [14]); instead, on average, the new method DEEPER succeeds in boosting performance. Finally, we learned that adding CrossDD, our new Adam variant, to the ensemble for InfLar generally does not increase performance; CrossDD works very well with the first topology but performs worst with the other topologies. On Port, however, the addition of CrossDD generally does improve performance.
In Table 3, we compare our best results on InfLar and Port with the best ensembles reported in [12][13][14] that tested ensembles of SNNs and CNN subnets using all eight topologies. In addition, the performance of four well-known CNNs is reported for baseline comparison, along with their fusion (eCNN) by average rule. The fine-tuning of the CNNs pretrained on ImageNet was performed with the following training options: batch size: 30; max epoch: 20 (for all the networks with no freezing). The row "Fusion x-y + eCNN" is the sum rule between Fusion x-y (see Table 2) and eCNN. Before the fusion, the score of Fusion x-y and eCNN are normalized to mean 0 and standard deviation 1. Table 2. Performance varying the network topologies (topologies are described in [14] and reprinted in the Appendix A; boldface represents the best performance).

Topology
InfLar Port As can be observed in Table 3, the proposed ensembles outperform previous methods based on Siamese networks and boosts the performance of the ensemble of CNNs. On the data set InfLar, the performance of the best standalone topology (see Table 2) is 92.78, which is comparable with the performance obtained by a CNN; however, on the Port data set, where our new Adam variant increased performance, the performance gap between the CNNs and Siamese networks is still significant. The approach proposed in this work also greatly improves previous Siamese methods applied to this data set.
Finally, in Table 4 we report the training time (seconds) of Siamese networks, in the InfLar data sets, considering the different topologies. The training time is computed using a GTX1080. Both the loss functions here used are considered in Table 4. Table 4. Computation time for training a single Siamese network, each column reports the computation time of a given topology network, numbered 1-8 (topologies are described in [14] and reprinted in the Appendix A).

Conclusions
This paper proposes an image classification system that, like several recent studies, generates dissimilarity spaces from which features are extracted and trained on a set of SVMs. The objective of this study was to produce a high performing ensemble of Siamese networks based on combining different topologies, loss functions, and optimization methods (with one new Adam variant proposed here) from which features could be extracted for training the SVMs.
Results on five cross-domain image data sets demonstrate the superior power of the proposed approach compared with previous works using ensembles of Siamese networks. Comparison with the state-of-the-art confirms that the fusion of the different topologies, loss functions, and optimization approach methods is a feasible way for generating a robust and highly generalizable image classification system.
In the future, we intend to validate our approach on additional cross-domain image data sets and investigate more techniques for building an ensemble of Siamese networks. Data Availability Statement: All data sets are publicly available and the source code is located at https://github.com/LorisNanni/Closing-the-performance-gap-between-siamese-networks-fordissimilarity-image-classification-and-conv (accessed on 24 August 2021).