Next Article in Journal
Modular Software Architecture for Local Smart Building Servers
Next Article in Special Issue
Hybrid Sol-Gel Surface-Enhanced Raman Sensor for Xylene Detection in Solution
Previous Article in Journal
Adaptive Unscented Kalman Filter for Target Tacking with Time-Varying Noise Covariance Based on Multi-Sensor Information Fusion
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Closing the Performance Gap between Siamese Networks for Dissimilarity Image Classification and Convolutional Neural Networks

Department of Information Engineering (DEI), University of Padova, 35131 Padova, Italy
Department of Information Technology and Cybersecurity, Missouri State University, 901 S, National Street, Springfield, MO 65804, USA
Department of Computer Science and Engineering (DISI), University of Bologna, Via dell’Università 50, 47521 Cesena, Italy
Author to whom correspondence should be addressed.
Sensors 2021, 21(17), 5809;
Received: 2 August 2021 / Revised: 18 August 2021 / Accepted: 25 August 2021 / Published: 29 August 2021
(This article belongs to the Special Issue 800 Years of Research at Padova University)


In this paper, we examine two strategies for boosting the performance of ensembles of Siamese networks (SNNs) for image classification using two loss functions (Triplet and Binary Cross Entropy) and two methods for building the dissimilarity spaces (FULLY and DEEPER). With FULLY, the distance between a pattern and a prototype is calculated by comparing two images using the fully connected layer of the Siamese network. With DEEPER, each pattern is described using a deeper layer combined with dimensionality reduction. The basic design of the SNNs takes advantage of supervised k-means clustering for building the dissimilarity spaces that train a set of support vector machines, which are then combined by sum rule for a final decision. The robustness and versatility of this approach are demonstrated on several cross-domain image data sets, including a portrait data set, two bioimage and two animal vocalization data sets. Results show that the strategies employed in this work to increase the performance of dissimilarity image classification using SNN are closing the gap with standalone CNNs. Moreover, when our best system is combined with an ensemble of CNNs, the resulting performance is superior to an ensemble of CNNs, demonstrating that our new strategy is extracting additional information.

1. Introduction

Interest in classification systems based on (dis)similarity spaces is resurging. Unlike the more common technique of classifying samples within a feature space, (dis)similarity classification estimates the class of an unknown pattern by examining its similarities and dissimilarities with a set of training samples and pairwise (dis)similarities between each of the members. This process has come to involve more than the application of standard distance measures; (dis)similarity classification is also a way to build new spaces.
Though the two terms of similarity and dissimilarity are rarely disambiguated in the literature, classification based on the notion of dissimilarity is an idea first proposed in [1], where the focus was on comparing differences between samples belonging to different classes. Dissimilarity classification can be tackled by using either dissimilarity vectors, as in [2,3,4,5,6], or dissimilarity spaces, as in [7,8,9,10,11,12,13,14]. In the former case, two samples are considered positive if they belong to the same class and negative if they belong to separate classes. The goal of the classifier is to decide which of these two cases a given vector was calculated on. For a more detailed discussion of this approach, see [15].
In contrast, dissimilarity methods that generate dissimilarity spaces, the approach taken here, produce classifiers from within feature vector spaces. Unlike traditional feature vectors representing samples as measured across all features, representation from feature vector spaces is the distance between pairs of samples. In [1], which introduced this approach, the authors applied prototype selection for training classifiers on dissimilarity spaces. The dissimilarity representations were used as a vector space. This method was applied to image retrieval by [8] using a prototype-based dissimilarity space. In [10], a compact representation based on prototype selection methods was derived from deep convolutional features and learned distance measures.
A loss function commonly used in dissimilarity classification is the Maximum Mean Discrepancy (MMD). In [11], the application of MMD enabled the source and target data in the dissimilarity space to harness the intra-class and inter-class distributions to produce a pairwise matcher. This version of MMD was also shown to work well across several data sets. A modification of the contrastive loss function for a Siamese Neural Network (SNN) [16,17] was proposed in [18] for brain image classification. The correlation distance of this variant of the loss function predicted the output features of image pairs. This method was expanded for audio classification in [12,13]. The audio samples, represented as spectrograms, were transformed by clustering methods into a set of centroids that generated dissimilarity spaces via SNN. The audio samples were then projected into the dissimilarity spaces to obtain a vector space representation that could be used to train Support Vector Machines (SVMs). An improved version of this method was developed for generic image classification in [14], where dissimilarity spaces were produced by a set of clustering methods and a set of SNNs with different CNN backbones. This approach was shown to compete well against state-of-the-art classifiers on several image data sets and obtained the highest classification score on one of them.
This work further expands [14] by proposing additional techniques for improving the performance of an ensemble of SNNs. As in the earlier work, each Siamese network, composed of eight different CNN topologies, generates a dissimilarity space whose features train an SVM, and the SVMs are then combined by sum rule. The strategies investigated here for improving performance further are the following:
  • Two different loss functions are used to train the Siamese networks: the binary cross entropy loss function and the triplet loss function.
  • Two different approaches for building the dissimilarity spaces are proposed for extracting features: the first is based on the fully connected layer and the latter on a deeper layer where the size of each channel is reduced by the Discrete Cosine Transform (DCT).
  • SNNs are optimized using different variants of Adam, with a new Adam variant proposed in this work.
Systems built with these strategies are compared, fused, and evaluated with previous work on dissimilarity classification. The versatility and robustness of the best ensemble developed using these techniques are demonstrated on five cross-domain image data sets representing medical imaging problems, animal vocalizations (spectrograms), and portrait images.

2. Proposed Approach

The basic system can be described as follows. The inputs into the system, as in [12,13,14], are the original images and HASC descriptors [19], extracted to produce a new processed image. If the original image is in color, Hasc is applied separately on each band; if it is grey level, the Hasc image is replicated three times to build an image with three bands.
Starting with the vector space representations, step 1 of the training process, as illustrated in Figure 1, begins by generating a set of clusters that produce a set of prototypes. The prototypes are centroids generated by k-means on the vector space representations. In step 2, a dissimilarity space is generated by an SNN that learns a distance measure from the prototypes that maximizes differences between pairs within class while also minimizing differences of pairs between other classes, a process that produces a feature vector that is trained on an SVM. In the testing stage, an unknown pattern is projected onto the dissimilarity space that was learned by the SNN, which generates the feature vector that is then fed into the trained SVM (we have not optimized the SVM hyperparametes, we have used a generic setting: Radial basis function kernel; C = 1000; gamma = 0.1) for a decision.
The SNN, as illustrated in Figure 2, combines two identical deep learners whose outputs are subtracted, which produces a feature vector (the absolute value of the difference) that is passed to a sigmoid and a loss function as in [12,13,14]. In this way, the FC layer and sigmoid predict the dissimilarity of the two input images (Inputs 1 and 2). The feature vector (FC) is computed by subtracting the outputs (F1 and F2) as follows:
F C = | F 1 F 2 |
Unlike [12,13,14], which used binary cross entropy, two different loss functions are tested here (binary cross entropy and triplet loss function), and the CNN subnets are optimized with Adam and some Adam variants.
Though some variations are indicated in Figure 1 and Figure 2, they only show the output of one SNN fed into one SVM. In [12,13,14] and this work, many SNNs and SVMs are trained, tested, and combined. Eight CNN topologies form the backbone of the SNNs. These are the identical topologies described in [14] (for the reader’s convenience, the table in [14] that details the topologies is reprinted in the Appendix A). Thus, a large number of SNNs are trained using the different topologies, the two loss functions, and the Adam optimization algorithms. Each of these systems is tested, fused, and evaluated to build the best-performing system empirically.
The pseudocode for each step in Figure 1 can be found in the following sources: [12,13,14] (see as well the companion source code for this paper available at (accessed on 25 August 2021)).
Below, we focus on the new techniques proposed in this work: the application of two methods for generating the dissimilarity space (Section 2.1), the two different loss functions (Section 2.2) and the Adam optimization methods, including a new one proposed here (Section 2.3).

2.1. Methods for Generating the Dissimilarity Spaces

Both methods for generating the dissimilarity space follow the same basic process used in [12,13,14]: first, k-means is applied on a vector space representation of the training images, with prototypes calculated as the k centroids of the clusters produced. Second, a feature vector F     R k is extracted by calculating the distances of image x from each of the prototypes, where the distance for each F i between x and prototype p i is given as F i = d ( x ,   p i ) . The resulting feature vector F i is fed into the SVM.
The two methods for generating the dissimilarity space are labeled FULLY and DEEPER. With FULLY, the distance between a pattern and a prototype is obtained directly by comparing the two images using the Siamese network. With DEEPER, each pattern is described using a deeper layer than the fully connected backbone network of the Siamese network. To reduce the high dimensionality of this deeper layer, the Discrete Cosine Transform (DCT) is applied separately to each channel of that layer (see Section 2.2). Finally, the distance between a pattern and a prototype is given by the cosine distance. In other words, the backbone of the Siamese network is used as the feature extractor.
For the sake of space, the layers used in DEEPER are reported in the MATLAB toolbox available at (accessed on 25 August 2021) (for the reader’s convenience, these layers are also reported in the Appendix A of this paper). This step is not optimized. We have chosen the layer before the last ReLu or fully connected layer to prevent overfitting the results rather than selecting layers optimized for each data set. Optimal layers could have been discovered using a leave-one-out data set, but this procedure was not feasible given the computational power of our GPUs. In Figure 3 we report the scheme of DEEPER.

DCT Dimensionality Reduction

Because DEEPER uses a deeper layer compared to the fully connected backbone to generate the dissimilarity space, a method is needed to reduce dimensionality on each channel (with results combined) of the deeper layer. DCT [20] is the dimensionality transform selected here because (1) its components are typically small in magnitude (most information is located in the low-frequency coefficients), and (2) it balances information packing and computational complexity.
DCT can be expressed as
D C T i m a g e ( x , y ) = 1 2 N C ( x ) C ( y ) p , q = 1 N I m a g e ( p , q ) cos ( 2 p + 1 ) x π 2 N cos ( 2 q + 1 ) y π 2 N ,
C ( u ) = { 1 2 , u = 0 1 , u > 0
where N is the number of row/columns of the image (input of CNN is a square matrix); p and q are the pixel indices of the input image; x and y are the indices of the DCT matrix.
Each channel is reduced to a dimension of 9 × 9. All the features extracted from each channel are concatenated into a single vector that represents a given pattern/prototype.

2.2. Loss Functions

2.2.1. Binary Cross Entropy Loss (Cross)

In the training phase, every pair of images in the training set is fed into the backbone of the Siamese architecture to obtain a feature vector F . Calculated next is Z = | F 1 F 2 | , where F 1 and F 2 are the feature vectors of the two images in the pair. Z is passed through a fully connected layer and a sigmoid function that returns the probability Y that the two images belong to the same class. Cross is then used for the two-class problem.
In the testing phase, for every sample in the training set, we compute F . Then, we evaluate N centroids using k-means clustering. Every image in the training set is expressed as the vector of the distances between its features and the centroids. After that, we train an SVM on those vectors. We then apply this inference algorithm to the images in the test set.

2.2.2. Triplet Loss (Triplet)

With Triplet, we take three images as the inputs, labelled A, P, and N. It is assumed that A and P have the same label and A and N have different labels.
In the training phase, for every Triplet in the training set, feature vectors F A ,   F P ,   F N are computed and then passed through a sigmoid to obtain Y A ,   Y P ,   Y N . At that point, the loss function is:
L = m a x ( | Y A Y P | 2 | Y A Y N | 2 , ξ ) , .
where ξ is a positive number, and | x | 2 is the Euclidean norm of the vector. In other words, the loss function encourages the network to create similar representations for samples in the same class and different representations for samples in different classes. ξ is the margin, the value used is 1 because in the fixed margin tests carried out it was the one that returned the best results.
In the testing phase, the process is exactly the same as described for the testing phase of cross-entropy loss.

2.3. Adam Variants

Introduced in [20,21], the widely used optimization method Adam (referred to as Base Adam in the experimental section) takes advantage of adaptive gradient and momentum to compute adaptive learning rates for each parameter. It makes use of the gradient at the current step, the exponential moving average of the gradient (first order moment), and the exponential moving average of the square of the gradient (second order moment).
Thus, the first moment m t and the second moment u t are defined as:
m t = ρ 1 m t 1 + ( 1 ρ 1 ) g t
u t = ρ 2 u t 1 + ( 1 ρ 2 ) g t 2
where the hyperparameters ρ 1 and ρ 2 represent the exponential decay rate for the first and second moment (set respectively to 0.9 and 0.99), g t is the gradient at time t , and the square on g t is meant to be calculated component-wise. The moments are initialized as m 0 = u 0 = 0 .
To avoid small values of the moving averages due to being initialized to zero, Adam includes a bias-corrected version of the first and second order moments:
m ^ t = m t ( 1 ρ 1 t )
u ^ t = u t ( 1 ρ 2 t )
The parameter update is computed as follow:
θ t = θ t 1 λ m ^ t u ^ t + ϵ ,
where λ is the learning rate and ϵ is a very small positive number used to avoid any division by zero (usually set to 10−8). The operations are supposed to be component-wise.
As noted in [22], Adam performs reasonably well in practice compared to other adaptive learning methods; however, Adam does not utilize the change in immediate past gradient information, a utilization that is incorporated in [22,23].

2.3.1. DGrad

This variant, proposed in [23], makes use of the absolute difference between the current gradient g t and the moving average of the element-wise squares of the gradients:
Δ a g t = | g t a v g t |
where a v g t is the moving average of the component-wise squares of the gradient.
The absolute difference Δ a g t is then normalized by its maximum component as follows:
Δ a g ^ t = Δ a g t max ( Δ a g t )
Then, ξ t is defined as:
ξ t = S i g ( 4 · Δ a g ^ t )
where S i g ( Δ ) is the sigmoid function:
S i g ( x ) = 1 1 + e x
Each parameter of the network is finally updated following the equation:
θ t + 1 = θ t λ · ξ t m ^ t u ^ t + ϵ
where m ^ t and u ^ t are the first and second order moments seen in Adam.

2.3.2. DecayDGrad (New)

This DGrad variant introduces a learning rate decay, both locally and in the whole training process. The local decay can be achieved with a periodic impulse, defined as follows:
i m p t = e ( 2 × m o d ( t , s ) s ) 2
where s = 10 is the period (number of iterations between each impulse).
The impulse i m p t is then multiplied by a global decay factor d t , shown in the equation:
d t = e 2 × ( t c · n i t e r ) 2 n i t e r 2
where n i t e r is the total number of iterations in the training process. The parameter c = 0.25 , multiplied by n i t e r , determines the iteration whereby d t assumes its maximum value.
The parameter ξ t is therefore defined as:
ξ t = S i g ( 4 · Δ a g t ^ ) · i m p t · d t .
Each parameter of the network is updated as shown in (12).
Notice that i m p t only has values in range 0 to 1, and its maximum value is assumed for iterations, which are multiples of s . The purpose of these restraints is to attenuate the value calculated by DGrad locally, namely progressively in the span of 𝑠 iterations, to get a better evaluation of the local minimum, thereby avoiding an eventual overshoot of the global minimum.
The reason behind the learning rate decay factor d t is to keep the learning rate high in the initial part of the training, which accelerates training and avoids the memorization of noisy data while at the same time extending the decay in later iterations. In this way, DGrad can learn complex patterns, as shown in [24]. The plot of d t and i m p t · d t is reported in Figure 4.

3. Data Sets

The following five image data sets, representing very different classification tasks, were selected to demonstrate the versatility of the proposed method:
  • BIRDz [25]: This balanced data set is a real-world benchmark for bird species vocalizations. The testing protocol is ten runs using the data split in [25]. The audio tracks were extracted from the Xeno-Canto Archive ( (accessed on 25 August 2021)). BIRDz contains a total of 2762 acoustic samples from eleven North American bird species, along with 339 unclassified audio samples (consisting of noise and unknown bird vocalizations). The bird classes vary in size from 246 to 259. Each observation is represented by five spectrograms: (1) constant frequency, (2) frequency modulated whistles, (3) broadband pulses, (4) broadband with varying frequency components, and (5) strong harmonics.
  • CAT [26,27]: This data set has ten balanced classes of cat vocalizations, with each one containing ~300 samples for a total of 2962 samples taken from Kaggle, Youtube, and Flickr. The testing protocol is 10-fold cross-validation. The average duration of each sample is 4 s.
  • InfLar [28]: This data set contains eighteen Narrow-Band Imaging (NBI) endoscopic videos of eighteen different patients with laryngeal cancer. The videos were retrospectively analyzed and categorized into four classes (informative, blurred, containing saliva or specular reflections, and underexposed). The average video length is 39 s. The videos were acquired with an NBI endoscopic system (Olympus Visera Elite S190 video processor and an ENF-VH rhino-laryngo videoscope) with a frame rate of 25 fps and an image size of 1920 × 1072 pixels. A total of 720 video frames, 180 for each of the four classes, were extracted and labeled. The testing protocol is three-fold cross-validation with data separated at the patient level to ensure that the frames from the same class were classified based on the features characteristic of each class and not due to features linked to the individual patient (e.g., vocal fold anatomy).
  • RPE [29]: This is a medical image classification data set that intends to distinguish the maturation of human stem cell-derived retinal pigmented epithelium. RPE is based on 195 images that were divided into sixteen subwindows. These subwindows were then assigned to one of four classes: (1) Fusifors, (2) Epithelioid, (3) Cobblestone, and (4) Mixed. Subwindows that were out of focus or that contained background information exclusively were discarded. This division of images into four and the exclusion process produced a total of 1862 images.
  • Port [30]: This data set contains 927 paintings from six different art movements: (1) High Renaissance, (2) Impressionism, (3) Northern Renaissance, (4) Post-Impressionism, (5) Rococo, and (6) Ukiyo-e. Ten-fold cross-validation is the testing protocol.
The same testing protocol presented in the papers introducing each data set is used in the experimental section, with accuracy being the performance indicator.

4. Experimental Results

The default settings in the MATLAB framework for Siamese networks were used to train the SNNs in all experiments to ensure no overfitting for any given data set. For Adam optimization and its variants, the number of iterations was set to 3000 with no stop criterion, the gradient decay factor to 0.9, the squared gradient decay factor to 0.99, and the learning rate to 0.0001.
The first run of experiments is reported in Table 1. In these tests, we used all the data sets. Each performance cell in Table 1 contains three rows of values for each data set:
  • Top: The performance obtained using the method named FULLY for SVM input;
  • Middle: The performance obtained using the method named DEEPER for SVM input;
  • Bottom: The fusion by average rule of the SVMs in 1 and 2.
The last row in Table 1 reports average performance of each approach of that column.
The clustering method is k-means for all methods, and the number of prototypes is in the set (15, 30, 45, 60). Thus, four networks are trained using the four numbers of prototypes in the set; the four SVMs trained in this way are combined by average rule.
For the sake of computation time, we used a single network topology in this test, which is the first topology tested in [14] and the Siamese topology recommended by Mathworks (see the Appendix A).
  • The columns of Table 1 report the following approaches:
  • Cross: Binary Cross Entropy loss function coupled with base Adam (this is the best approach proposed [14]);
  • CrossDD: Binary Cross Entropy loss function coupled with our new Adam variant DecayDGrad;
  • Triplet: Triplet loss function coupled with base Adam.
  • X + Y (columns 5 and 6): the fusion between X and Y.
From the results reported in Table 1, the following conclusions can be drawn:
  • Triplet produces a result that is similar to Cross on three data sets but performs better than Cross in InfLar and worst in CAT;
  • The fusion between Cross and Triplet boosts the performance of the base loss functions, except in the case of CAT;
  • The fusion among all the different approaches (see bottom cells in the column Triplet+Cross and Triplet+Cross+CrossDD) produces the best average performance.
Table 2 reports results using combinations of the two loss functions on all eight topologies. Because running experiments on all five data sets was computationally too expensive, we chose to run them only on InfLar and Port because they are very different application problems.
In each cell of Table 2, the following four results are reported:
  • Top: Cross function coupled with FULLY for SVM input (the best approach proposed in [14]);
  • Upper: Triplet loss function coupled with FULLY for SVM input;
  • Lower: Fusion by average rule among Cross coupled with FULLY, Cross coupled with DEEPER, Triplet coupled with FULLY, and Triplet coupled with DEEPER;
  • Bottom: This is the fusion by average rule of SVMs 1 and 2 described for the method reported at the bottom of Table 1 but with the addition of CrossDD coupled with both FULLY and DEEPER.
The last row of Table 2 reports the fusions of #4 above for the numbered topologies.
In [13], we showed that combining more than four networks using the same topology (but varying the clustering algorithm) failed to improve performance. Examining Table 2, we discovered that changing the loss function and the method for building the dissimilarity space is beneficial when making an ensemble. We also observed that for all topologies except #6 in the Portrait data set (Port), the best performance is not obtained by contrastive loss coupled with FULLY (as was the case in [14]); instead, on average, the new method DEEPER succeeds in boosting performance. Finally, we learned that adding CrossDD, our new Adam variant, to the ensemble for InfLar generally does not increase performance; CrossDD works very well with the first topology but performs worst with the other topologies. On Port, however, the addition of CrossDD generally does improve performance.
In Table 3, we compare our best results on InfLar and Port with the best ensembles reported in [12,13,14] that tested ensembles of SNNs and CNN subnets using all eight topologies. In addition, the performance of four well-known CNNs is reported for baseline comparison, along with their fusion (eCNN) by average rule. The fine-tuning of the CNNs pretrained on ImageNet was performed with the following training options: batch size: 30; max epoch: 20 (for all the networks with no freezing). The row “Fusion x-y + eCNN” is the sum rule between Fusion x-y (see Table 2) and eCNN. Before the fusion, the score of Fusion x-y and eCNN are normalized to mean 0 and standard deviation 1.
As can be observed in Table 3, the proposed ensembles outperform previous methods based on Siamese networks and boosts the performance of the ensemble of CNNs. On the data set InfLar, the performance of the best standalone topology (see Table 2) is 92.78, which is comparable with the performance obtained by a CNN; however, on the Port data set, where our new Adam variant increased performance, the performance gap between the CNNs and Siamese networks is still significant. The approach proposed in this work also greatly improves previous Siamese methods applied to this data set.
Finally, in Table 4 we report the training time (seconds) of Siamese networks, in the InfLar data sets, considering the different topologies. The training time is computed using a GTX1080. Both the loss functions here used are considered in Table 4.

5. Conclusions

This paper proposes an image classification system that, like several recent studies, generates dissimilarity spaces from which features are extracted and trained on a set of SVMs. The objective of this study was to produce a high performing ensemble of Siamese networks based on combining different topologies, loss functions, and optimization methods (with one new Adam variant proposed here) from which features could be extracted for training the SVMs.
Results on five cross-domain image data sets demonstrate the superior power of the proposed approach compared with previous works using ensembles of Siamese networks. Comparison with the state-of-the-art confirms that the fusion of the different topologies, loss functions, and optimization approach methods is a feasible way for generating a robust and highly generalizable image classification system.
In the future, we intend to validate our approach on additional cross-domain image data sets and investigate more techniques for building an ensemble of Siamese networks.

Author Contributions

Conceptualization, A.L. and L.N.; methodology, L.N.; software, G.M., D.S. and L.N.; validation, L.N.; formal analysis, A.L.; resources—S.B.; writing—original draft preparation, A.L., S.B. and L.N.; and writing—review and editing A.L., S.B. and L.N. All authors have read and agreed to the published version of the manuscript.


This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data sets are publicly available and the source code is located at (accessed on 24 August 2021).


The authors are grateful to NVIDIA Corporation for supporting this research with the donation of a Titan Xp GPU.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

For ease of reference, a description of the eight SN networks in [14] are reprinted below. CNN Siamese Networks (1–8) layers. The bold layers are those used for feeding DCT.
Siamese Network 1
LayersActivationsLearnableFilter SizeNum. of Filters
Input Layer224 × 224
2D Convolution215 × 215 × 64646410 × 1064
ReLU215 × 215 × 640
Max Pooling107 × 107 × 6402 × 2
2D Convolution101 × 101 × 128401,5367 × 7128
ReLU101 × 101 × 1280
Max Pooling50 × 50 × 12802 × 2
2D Convolution47 × 47 × 128262,2724 × 4128
ReLU47 × 47 × 1280
Max Pooling23 × 23 × 12802 × 2
2D Convolution19 × 19 × 64204,8645 × 564
ReLU19 × 19 × 640
Fully Connected409694,638,080
Siamese Network 2
LayersActivationsLearnableFilter SizeNum. of Filters
Input Layer224 × 2240
2D Convolution220 × 220 × 6416645 × 564
LeakyReLU220 × 220 × 640
2D Convolution216 × 216 × 64102,4645 × 564
LeakyReLU216 × 216 × 640
Max Pooling108 × 108 × 6402 × 2
2D Convolution106 × 106 × 12873,8563 × 3128
LeakyReLU106 × 106 × 1280
2D Convolution104 × 104 × 128147,5843 × 3128
LeakyReLU104 × 104 × 1280
Max Pooling52 × 52 × 12802 × 2
2D Convolution49 × 49 × 128262,2724 × 4128
LeakyReLU49 × 49 × 1280
Max Pooling24 × 24 × 12802 × 2
2D Convolution20 × 20 × 64204,8645 × 564
LeakyReLU20 × 20 × 6405 × 5
Fully Connected204852,430,848
Siamese Network 3
LayersActivationsLearnableFilter SizeNum. Filters
Input Layer224 × 224
2D Convolution55 × 55 × 12864007 × 7128
Max Pooling27 × 27 × 12802 × 2
2D Convolution23 × 23 × 256819,4565 × 5256
ReLU23 × 23 × 2560
2D Convolution19 × 19 × 128819,3285 × 5128
Max Pooling9 × 9 × 12802 × 2
2D Convolution7 × 7 × 6473,7923 × 364
ReLU7 × 7 × 640
Max Pooling3 × 3 × 6402 × 2
Fully Connected40962,363,392
Siamese Network 4
LayersActivationsLearnableFilter SizeNum. of Filters
Input Layer224 × 224
2D Convolution218 × 218 × 12864007 × 7128
Max Pooling54 × 54 × 12804 × 4
ReLU54 × 54 × 1280
2D Convolution50 × 50 × 256819,4565 × 5256
ReLU50 × 50 × 2560
2D Convolution48 × 48 × 64147,5203 × 364
Max Pooling24 × 24 × 6402 × 2
2D Convolution22 × 22 × 12873,8563 × 3128
ReLU22 × 22 × 1280
2D Convolution18 × 18 × 64204,8645 × 564
Fully Connected409684,938,752
Siamese Network 5
LayersActivationsLearnableFilter SizeNum. of Filters
Input Layer224 × 224
2D Convolution215 × 215 × 64646410 × 1064
Max Pooling107 × 107 × 6402 × 2
ReLU107 × 107 × 640
2D Convolution26 × 26 × 128401,5367 × 7128
ReLU26 × 26 × 1280
2D Convolution9 × 9 × 128409,7285 × 5128
ReLU9 × 9 × 1280
2D Convolution6 × 6 × 64131,1364 × 464
ReLU6 × 6 × 640
Fully Connected40969,441,280
Siamese Network 6
LayersActivationsLearnableFilter SizeNum. of Filters
Input Layer224 × 224
2D Convolution218 × 218 × 6432007 × 764
Max Pooling109 × 109 × 6402 × 2
ReLU109 × 109 × 640
2D Convolution107 × 107 × 12873,8563 × 3128
Max Pooling53 × 53 × 12802 × 2
ReLU53 × 53 × 1280
2D Convolution53 × 53 × 6482561 × 164
ReLU53 × 53 × 640
2D Convolution51 × 51 × 12873,8563 × 3128
ReLU51 × 51 × 1280
Max Pooling25 × 25 × 12802 × 2
2D Convolution25 × 25 × 12816,5121 × 1128
ReLU25 × 25 × 1280
2D Convolution22 × 22 × 64131,1364 × 464
Max Pooling11 × 11 × 6402 × 2
ReLU11 × 11 × 640
Fully Connected409631,723,520
Siamese Network 7
LayersActivationsLearnableFilter SizeNum. of Filters
Input Layer224 × 224
Dropout Layer224 × 2240
2D Convolution218 × 218 × 6432007 × 764
Max Pooling109 × 109 × 6402 × 2
2D Convolution105 × 105 × 128204,9285 × 5128
Max Pooling52 × 52 × 12802 × 2
2D Convolution48 × 48 × 64204,8645 × 564
Max Pooling24 × 24 × 6402 × 2
2D Convolution22 × 22 × 256147,7123 × 3256
Max Pooling11 × 11 × 25602 × 2
Fully Connected409616,781,312
Siamese Network 8
LayersActivationsLearnableFilter SizeNum. of Filters
Input Layer224 × 224
2D Convolution215 × 215 × 32323210 × 1032
Max Pooling107 × 107 × 3202 × 2
ReLU107 × 107 × 320
2D Grouped Convolution101 × 101 × 6450,2407 × 764
2D Convolution97 × 97 × 128204,9285 × 5128
Max Pooling48 × 48 × 12802 × 2
ReLU48 × 48 × 1280
2D Grouped Convolution46 × 46 × 256147,7123 × 3256
Fully Connected40962,218,790,912


  1. Pękalska, E.; Duin, R.P. The Dissimilarity Representation for Pattern Recognition—Foundations and Applications; World Scientific: Singapore, 2005. [Google Scholar]
  2. Cha, S.; Srihari, S. Writer Identification: Statistical Analysis and Dichotomizer. In Proceedings of the SSPR/SPR, Alicante, Spain, 1 September 2000. [Google Scholar]
  3. Oliveira, L.; Justino, E.; Sabourin, R. Off-line Signature Verification Using Writer-Independent Approach. In Proceedings of the 2007 International Joint Conference on Neural Networks, Orlando, FL, USA, 29 October 2007; pp. 2539–2544. [Google Scholar]
  4. Hanusiak, R.K.; Oliveira, L.; Justino, E.; Sabourin, R. Writer verification using texture-based features. Int. J. Doc. Anal. Recognit. 2011, 15, 213–226. [Google Scholar] [CrossRef]
  5. Zottesso, R.H.D.; Costa, Y.M.G.; Bertolini, D.; Oliveira, L.E.S. Bird species identification using spectrogram and dissimilarity approach. Ecol. Inform. 2018, 48, 187–197. [Google Scholar] [CrossRef]
  6. Souza, V.L.F.; Oliveira, A.; Sabourin, R. A Writer-Independent Approach for Offline Signature Verification using Deep Convolutional Neural Networks Features. In Proceedings of the 2018 7th Brazilian Conference on Intelligent Systems, São Paulo, Brazil, 22–25 October 2018; pp. 212–217. [Google Scholar]
  7. Pękalska, E.; Duin, R.P. Dissimilarity representations allow for building good classifiers. Pattern Recognit. Lett. 2002, 23, 943–956. [Google Scholar] [CrossRef]
  8. Nguyen, G.; Worring, M.; Smeulders, A. Similarity learning via dissimilarity space in CBIR. In Proceedings of the MIR’06, Santa Barbara, CA, USA, 26–27 October 2006. [Google Scholar]
  9. Theodorakopoulos, I.; Kastaniotis, D.; Economou, G.; Fotopoulos, S. HEp-2 cells classification via sparse representation of textural features fused into dissimilarity space. Pattern Recognit. 2014, 47, 2367–2378. [Google Scholar] [CrossRef]
  10. Hernández-Durán, M.; Calaña, Y.P.; Vazquez, H.M. Low-Resolution Face Recognition with Deep Convolutional Features in the Dissimilarity Space. In Proceedings of the IWAIPR, Chiang Mai, Thailand, 7–10 January 2018. [Google Scholar]
  11. Mekhazni, D.; Bhuiyan, A.; Ekladious, G.; Granger, É. Unsupervised Domain Adaptation in the Dissimilarity Space for Person Re-identification. In Proceedings of the ECCV, Glasgow, Scotland, 23 August 2020. [Google Scholar]
  12. Nanni, L.; Rigo, A.; Lumini, A.; Brahnam, S. Spectrogram classification using dissimilarity space. Sensors 2020, 10, 4176. [Google Scholar] [CrossRef]
  13. Nanni, L.; Brahnam, S.; Lumini, A.; Maguolo, G. Animal sound classification using dissimilarity spaces. Appl. Sci. 2020, 10, 8578. [Google Scholar] [CrossRef]
  14. Nanni, L.; Minchio, G.; Brahnam, S.; Maguolo, G.; Lumini, A. Experiments of image classification using dissimilarity spaces built with siamese networks. Sensors 2021, 21, 1573. [Google Scholar] [CrossRef] [PubMed]
  15. Costa, Y.M.G.; Bertolini, D.; Britto, A.S.; Cavalcanti, G.D.C.; Oliveira, L. The dissimilarity approach: A review. Artif. Intell. Rev. 2019, 53, 2783–2808. [Google Scholar] [CrossRef]
  16. Chicco, D. Siamese neural networks: An overview. In Artificial Neural Networks. Methods in Molecular Biology; Cartwright, H., Ed.; Springer Protocols: New York, NY, USA, 2020; pp. 73–94. [Google Scholar]
  17. Bromley, J.; Bentz, J.W.; Bottou, L.; Guyon, I.; LeCun, Y.; Moore, C.; Säckinger, E.; Shah, R. Signature Verification Using A “Siamese” Time Delay Neural Network. Int. J. Pattern Recognit. Artif. Intell. 1993, 7, 669–688. [Google Scholar] [CrossRef][Green Version]
  18. Agrawal, A. Dissimilarity learning via Siamese network predicts brain imaging data. arXiv 2019. Available online: (accessed on 25 August 2021).
  19. San Biagio, M.; Crocco, M.; Cristani, M.; Martelli, S.; Murino, V. Heterogeneous auto-similarities of characteristics (hasc): Exploiting relational information for classification. In Proceedings of the IEEE Computer Vision (ICCV13), Sydney, Australia, 1–8 December 2013; pp. 809–816. [Google Scholar]
  20. Feig, E.; Winograd, S. Fast algorithms for the discrete cosine transform. IEEE Trans. Signal. Process. 1992, 49, 2174–2193. [Google Scholar] [CrossRef]
  21. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. CoRR 2015, 1412, 6980. [Google Scholar]
  22. Dubey, S.; Chakraborty, S.; Roy, S.K.; Mukherjee, S.; Singh, S.K.; Chaudhuri, B. diffGrad: An Optimization Method for Convolutional Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4500–4511. [Google Scholar] [CrossRef] [PubMed][Green Version]
  23. Nanni, L.; Maguolo, G.; Lumini, A. Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks. arXiv. 2021. Available online: (accessed on 25 August 2021).
  24. You, K.; Long, M.; Jordan, M.I. How Does Learning Rate Decay Help Modern Neural Networks. arXiv. 2019. Available online: (accessed on 25 August 2021).
  25. Zhang, S.-H.; Zhao, Z.; Xu, Z.; Bellisario, K.; Pijanowski, B.C. Automatic Bird Vocalization Identification Based on Fusion of Spectral Pattern and Texture Features. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal, Calgary, AB, Canada, 15–20 April 2018; pp. 271–275. [Google Scholar]
  26. Pandeya, Y.R.; Kim, D.; Lee, J. Domestic cat sound classification using learned features from deep neural nets. Appl. Sci. 2018, 8, 1949. [Google Scholar] [CrossRef][Green Version]
  27. Pandeya, Y.R.; Lee, J. Domestic Cat Sound Classification Using Transfer Learning. Int. J. Fuzzy Logic. Intell. Syst. 2018, 18, 154–160. [Google Scholar] [CrossRef][Green Version]
  28. Moccia, S.; Vanone, G.O.; Momi, E.D.; Laborai, A.; Guastini, L.; Peretti, G.; Mattos, L.S. Learning-based classification of informative laryngoscopic frames. Comput. Methods Programs Biomed. 2018, 158, 21–30. [Google Scholar] [CrossRef] [PubMed][Green Version]
  29. Nanni, L.; Paci, M.P.; Santos, F.L.C.d.; Skottman, H.; Juuti-Uusitalo, K.; Hyttinen, J. Texture descriptors ensembles enable image-based classification of maturation of human stem cell-derived retinal pigmented epithelium. PLoS ONE 2016, 11, e0149399. [Google Scholar] [CrossRef] [PubMed]
  30. Liu, S.; Yang, J.; Agaian, S.S.; Yuan, C. Novel features for art movement classification of portrait paintings. Image Vision Comput. 2021, 108, 104121. [Google Scholar] [CrossRef]
Figure 1. Schematic of the basic dissimilarity architecture using one SNN with the output fed into one SVM.
Figure 1. Schematic of the basic dissimilarity architecture using one SNN with the output fed into one SVM.
Sensors 21 05809 g001
Figure 2. Schematic of SNN.
Figure 2. Schematic of SNN.
Sensors 21 05809 g002
Figure 3. Schematic of DEEPER.
Figure 3. Schematic of DEEPER.
Sensors 21 05809 g003
Figure 4. Plot of d t and i m p t · d t .
Figure 4. Plot of d t and i m p t · d t .
Sensors 21 05809 g004
Table 1. Performance of the two tested loss functions (boldface represents the best performance).
Table 1. Performance of the two tested loss functions (boldface represents the best performance).
CrossCrossDDTripletTriplet + CrossTriplet + Cross + CrossDD
Table 2. Performance varying the network topologies (topologies are described in [14] and reprinted in the Appendix A; boldface represents the best performance).
Table 2. Performance varying the network topologies (topologies are described in [14] and reprinted in the Appendix A; boldface represents the best performance).
Topology 186.9470.99
Topology 285.5668.73
Topology 379.4460.23
Topology 487.5069.69
Topology 584.0360.00
Topology 687.6473.48
Topology 779.4466.03
Topology 886.3965.58
Fusion 1–492.7875.09
Fusion 1–691.5374.45
Fusion 1–891.8174.98
Table 3. Performance accuracy obtained considering different standard CNNs and other Siamese approaches (xxx * means that it does not converge, and boldface represents best performance).
Table 3. Performance accuracy obtained considering different standard CNNs and other Siamese approaches (xxx * means that it does not converge, and boldface represents best performance).
[12]74.86xxx *
Fusion 1–492.7875.09
Fusion 1–691.5374.45
Fusion 1–891.8174.98
Fusion 1–4 + eCNN94.4486.84
Fusion 1–6 + eCNN94.4486.84
Fusion 1–8 + eCNN94.3186.84
Table 4. Computation time for training a single Siamese network, each column reports the computation time of a given topology network, numbered 1–8 (topologies are described in [14] and reprinted in the Appendix A).
Table 4. Computation time for training a single Siamese network, each column reports the computation time of a given topology network, numbered 1–8 (topologies are described in [14] and reprinted in the Appendix A).
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Nanni, L.; Minchio, G.; Brahnam, S.; Sarraggiotto, D.; Lumini, A. Closing the Performance Gap between Siamese Networks for Dissimilarity Image Classification and Convolutional Neural Networks. Sensors 2021, 21, 5809.

AMA Style

Nanni L, Minchio G, Brahnam S, Sarraggiotto D, Lumini A. Closing the Performance Gap between Siamese Networks for Dissimilarity Image Classification and Convolutional Neural Networks. Sensors. 2021; 21(17):5809.

Chicago/Turabian Style

Nanni, Loris, Giovanni Minchio, Sheryl Brahnam, Davide Sarraggiotto, and Alessandra Lumini. 2021. "Closing the Performance Gap between Siamese Networks for Dissimilarity Image Classification and Convolutional Neural Networks" Sensors 21, no. 17: 5809.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop