Hybrid-Margin Softmax for the Detection of Trademark Image Similarity

: The detection of image similarity is critical to trademark (TM) legal registration and court judgment on infringement cases. Meanwhile, there are great challenges regarding the annotation of similar pairs and model generalization on rapidly growing data when deep learning is introduced into the task. The research idea of metric learning is naturally suited for the task where similarity of input is given instead of classification, but current methods are not targeted at the task and should be upgraded. To address these issues, loss-driven model training is introduced, and a hybrid-margin softmax (HMS) is proposed exactly based on the peculiarity of TM images. Two additive penalty margins are attached to the softmax to expand the decision boundary and develop greater tolerance for slight differences between similar TM images. With the HMS, a Siamese neural network (SNN) as the feature extractor is further penalized and the discrimination ability is improved. Experiments demonstrate that the detection model trained on HMS can make full use of small numbers of training data and has great discrimination ability on bigger quantities of test data. Meanwhile, the model can reach high performance with less depth of SNN. Extensive experiments indicate that the HMS-driven model trained completely on TM data generalized well on the face recognition (FR) task, which involves another type of image data.


Introduction
Trademarks (TMs) are distinctive designations registered to identify products and sources.The exclusivity of TM provides rules for orderly marketing [1].However, the high incidence of TM misappropriation causes plenty of revenue and reputation loss to legitimate owners.Consumers can be misled to purchase counterfeit products, especially when the right-infringing TM image is similar to a legal one [2].Meanwhile, the rapidly growing TM image database is massive itself, which brings great pressure on the governing body.
What further complicates this situation is that there are no defined criteria to conduct the test named 'likelihood of confusion' [3].The test is a critical part of the procedure to determine whether a disputed trademark is similar to another one.Thus, there is a chance that inconsistent judgments are declared by courts of different levels or districts.
Generally, appearance, characters, and sound are taken into consideration during the test [4].The repetition rate of characters is convenient to assess, and a TM can be pronounced differently among regions.As the most common and important forms of TM, the appearance, by contrast, is more consistent and controversial in judgment.
The feature extraction of TM images is crucial to the above issues.Conventional feature engineering involves manually designed descriptors to detect and match features, e.g., SIFT [5] and ORB [6].SFIT is a local invariant feature descriptor based on keypoints and local image gradient directions.ORB is a fast binary descriptor based on FAST keypoint detector and binary BRIEF descriptor.These extraction methods focus on some specific image features such as points and edges [7], which makes it expensive to detect TM image similarity comprehensively with several manual descriptors.The great improvement in deep learning, that features which might be omitted by human beings can be extracted efficiently by convolutional kernels, makes introducing computer 'opinions' to the procedure of human judgment on TM image similarity a convincing prospect.
There is a great challenge in building training data when deep learning is introduced in the detection of TM image similarity (TMISD).The performance of the detection model is highly correlated with supervised information, while the annotation of similar TM image pairs takes intensive work with skilled labor involved.Furthermore, the model generalization on millions of new TM designs proposes a higher requirement for training data preparation where extensive TM images are supposed to be covered.Metric learning is naturally suitable for solving the problem of limited training shots [8].A metric function of similarity can be learned to detect whether inputs are similar, instead of classifying the input samples.
Siamese neural networks (SNNs) are widely used in metric learning to extract pairs of input image features [9,10] and metric functions, e.g., Euclidean distance, Manhattan distance, and cosine similarity are used to compare embedded feature vectors [11,12].Usually, contrastive loss is used in an SNN to minimize the distance between feature vectors of samples in the same classes and maximize the distance between samples in the different classes.There is a hyper-parameter in contrastive loss to control the threshold of distance.A data-driven triplet network was proposed on the basis of an SNN with an additional CNN branch [13].The triplet loss function is used to decrease the feature vector distance between the anchor sample and the positive sample, and at the same time increase the distance between the anchor sample and the negative sample.The discrimination ability of the triplet network is improved while the training cost is greatly increased with the combinatorial explosion of the building of triplets, i.e., the input data of the triplet network.In this way, the pressure on training data quantity is transferred to the cost of existing annotated data mining by the more elaborate network architecture.
Another research idea in metric learning is loss-driven training methods such as recent works on face recognition (FR) [14,15].Instead of building a large-scale dataset for training, these metric learning methods transformed the softmax function to conduct a margin penalty on the decision boundary, aiming to develop the discrimination ability of SNNs.The typical SphereFace [16], CosFace [17,18], and ArcFace [19] are all designed to expand intraclass space and reduce interclass space.Some of the reasons are that the performance of a data-driven model relies on the quality of information contained in training data excessively, and manual annotation is a major expenditure of human efforts.Furthermore, the SNN trained on close-set data shows bad performance in generalizing on open-set data [19].
More specifically, for the TMISD task, Setchi proposed a TM similarity analysis system to conduct the 'likelihood of confusion' test with three models [20].Global and local shape feature descriptors, i.e., Zernike moment and an edge gradient co-occurrence matrix are used to extract TM image features.Euclidean distance is used to compute similarity.On this basis, Trappey introduced SNNs into the feature extraction of TM images [21].VGG16 is used to build an SNN.Alshowaish used pre-trained CNNs to build an SNN including VGG16 and ResNet50 [22].Most of these works focused on data-driven metric learning methods.However, the training database encounters a great challenge of annotation and covering the rapid growth in new TM designs.
We choose to research the TMISD task from the perspective of a loss-driven metric learning method.Here is a brief introduction to the frequently used loss function, i.e., the softmax function in the classification.The expression of softmax is as follows: where N is the class number, W and b are weight and bias terms, x i is the embedded feature vector belonging to the y i -th class, and W j is the j-th column of the weight W.
Then, by fixing the weight ∥W∥ = 1 and feature ∥x i ∥ = 1 by ℓ 2 normalization, and by fixing the bias b j = 0, the decision boundary is transferred to the angular space.
The transformed softmax function is as follows: where θ is the included angle between the normalized weight and feature vector.The prediction will depend only on the angle, and the decision boundary can be optimized by margin penalties.
The main contributions of this study are as follows: (1) We researched the TMISD task with prevalent methods in metric learning including data-driven and loss-driven.The performance of these methods was investigated from several evaluation aspects regarding the TMISD task, including accuracy, F1 score, training cost, and generalization ability.
(2) According to the peculiarity of TM images, a hybrid-margin softmax (HMS) is proposed.Two additive margins are attached to the cosine term and the angular term of softmax, respectively, to expand the decision boundary in the angular space.The magnitudes of the weight and feature vector are preserved to retain the input information as much as possible.The metric function used to calculate the similarity is replaced by a classifier, i.e., a fully connected layer.(3) Experiments indicate that the detection model penalized by HMS can be trained on small numbers of annotated data and reaches high detection accuracy with fewer layers of SNN.Furthermore, the HMS detection model trained completely on TM data generalizes well on the face recognition (FR) task, which indicates that the model trained on HMS has great input image discrimination ability.

Hybrid-Margin Softmax
The peculiarity of TM images is crucial to the TMISD task.We compared the FR and the TMISD task to have a better view of the latter: (1) The compositions of images in an FR task are constant.The principal parts of the input pairs of samples are human faces that always come from one exact person or different ones.The features extracted from the input are fixed generally, such as the shapes of faces, eyes, and noses.Plus, there are external interfering terms that should be considered including gestures, illuminations, ages, image noises, etc. (2) The TMISD task is aimed at detecting the similarity of TM images.Generally, a TM design consists of a single element or several ones.The elements of the disputed TM image will not be identical to the legal one but partly similar in contours, colors, and textures, as shown in Figure 1.It is common for there to be both similar and different elements between two TM images in disputed cases.It should be noted that new outlines can be formed by the varying placements of elements.Furthermore, interfering terms mentioned in the FR task are no longer to be considered, since TM images are artificially designed in most cases.To sum up, compared to the FR task, there are supposed to be more margin penalties on the decision boundary to tolerate a wide variety of element design changes in pairs of similar TM images.Meanwhile, the detection model should extract more information from input images to have a full understanding of the similarity degree and avoid false alarms.Therefore, the detection model should be further penalized, and the learnable parameters should be preserved as much as possible.
Given the characteristics of the TMISD task, a hybrid-margin softmax (HMS) is proposed as follows: where s is a global scale factor.W y i are the weights of the fully connected layer, and x i is the feature vector of i-th sample extracted by SNN.The weights and feature vectors are not normalized, and the biases are set to zero.Additive margin d 1 and d 2 are attached to the angle term and the cosine term, respectively.The decision boundary of HMS loss is as follows: The decision boundary still can be considered as laying in the angular space with a varying amplitude of the cosine curve, as shown in Figure 2a.

Interpretation of HMS
Several margins are attached to the inner product form of the logit to tolerate unpredictable little changes between similar elements of TM images.The magnitudes of extracted feature vectors, and the weight vectors, which are learnable parameters, are preserved to retain information from the input as much as possible.The model can benefit from this information when similar elements are contained in the pairs of dissimilar TM images.Plus, the normalization changes the magnitude and direction of the logit, as shown in Figure 2b.The SNN has to constantly adapt to these changes during the training.
Considering a group of TM images with an anchor sample, a similar one and a dissimilar one are given, suppose the class center of the anchor in the feature space is W 2 , the class center of the dissimilar one is W 1 , and the feature vector of a similar one is x, as shown in Figure 2c.The model can make the right prediction by the following calculation: where θ i (i = 1, 2) denote the angle between feature x and class center W i .
The value of the cosine term is decreased by two margins d 1 and d 2 .The model is penalized further to improve discrimination ability.The magnitude of the weight vector can be scaled for better prediction during model training.
There is a toy experiment, as shown in Figure 3, to describe the distribution of features extracted by the SNN trained on different transformations of the softmax function.These features are sent to the classifier to give a prediction.The SNN discrimination ability of TM images is described visually in this way.Red and blue spots are visualized features of two input TM images.Spots in the first row are from dissimilar TM images and spots in the second row are from similar TM images.In the first row, the first four feature spots are loose and chaotic.It is not solid enough for the classifier to judge they are not similar.The last feature spots in the first row extracted by the SNN trained on HMS are oriented intensively and separable in the meantime.The spot distributions in the second row also indicate that features learned from the SNN with HMS are compact and adequate for making a judgment.

Results
We conducted two branches of experiments.The comparison of loss-driven methods includes detection models based on an SNN trained on different transformations of softmax.The comparison of data-driven methods includes detection models based on the typical SNN (trained by contrastive loss), triplet network, and fine-tuning method.

Datasets
There were two types of image data involved in the experiments: TM images and human faces.The face data were used to train the feature extractor in the fine-tuning method and used for testing in the loss-driven methods.The TM image training data were compiled from real-world trademarks and annotated, consisting of 300 pairs of similar samples.The TM test data were collected from real court-disputed cases and cleaned manually, consisting of 1000 pairs of similar samples.The dissimilar TM image pairs were randomly selected and paired, consisting of equivalent numbers of similar pairs.The public LFW dataset was used as human face data, consisting of 3000 pairs of positive samples and 3000 pairs of negative samples [23,24].
The data preprocessing included input images cropped to a fixed-size shape and preserved colors.The TM training data were normalized with corresponding statistics.The TM test data and LFW data were normalized with mean value X = 0.5 and variance σ = 0.5.

Experimental Setup
The details of the detection process are given in this section.The detection based on the loss-driven methods includes feature extracting and classifying.
The process of feature extracting is as follows: the backbone of the feature extractor is an SNN consisting of two identical CNNs that have the same structure and weight, as shown in Figure 4. Images input through the SNN can be encoded to vectors in the same feature space.Several CNN structures are implemented, including a simple self-defined six-layer CNN and resnet18, 34, 50, 101, and 152 [25].The fully connected layer in resnet is removed, and the six-layer CNN has a similar structure to resnet, including a batchnorm (BN) layer and pooling layer, as shown in Figure 5.There is no residual module in the six-layer CNN.The process of classifying is as follows: output vectors of the feature extractor are concatenated, activated by the transformation of softmax, and then sent to the fully connected layer, i.e., the classifier with the sigmoid activation function.Output judgment of similarity, the same as the input label, is one-hot encoded.
The detection processes of the data-driven methods are as follows: ( The six-layer CNN is excluded from the data-driven methods since a shallow-depth CNN is not able to meet the demand of fitting in the triplet network model and finetuning method.
Other experiment setups are as follows: The scale factor s in softmax was set to 90, the angular margin in SphereFace was 4, the additive margin of cosine term in CosFace was 0.006, the additive margin of angular term in ArcFace was 0.003, the additive margins of cosine term and angular term in HMS were 0.006 and 0.003, respectively.All experiments were conducted in the pytorch framework.Cuda was used to accelerate training.The learning rate was set to 0.001, the batch size was 16, the optimizer in the triplet network was Adam, and the optimizer in other methods was SGD (momentum was 0.9).

Loss-Driven Method Experiments
To prove that HMS enables the SNN to learn separable enough features of a TM image, we compared the detection models based on different transformations of softmax under the same experimental conditions.The accuracy and F1 score of detecting similar and dissimilar TM image pairs are shown in Tables 1 and 2. We also tested the detection model on the LFW dataset while the SNN was still trained with TM image data.The results are shown in Tables 3 and 4.
For the TMISD task, SphereFace achieves up to 96.39% accuracy, which outperforms CosFace and ArcFace greatly.However, when the depth of SNN increases, the model is overfitted severely.The accuracy of HMS regarding normalization of feature and weight vectors is slightly better than that of CosFace and ArcFace.HMS achieves the best performance, with up to 98.97% accuracy, which is a 2.58% improvement over SphereFace.Another notable thing is that a simple six-layer SNN trained on HMS works well on the TMISD task, with 97.45% accuracy and an F1 score of 0.9516.For the FR task, the detection model trained on SphereFace with TM data is overfitted.The performance of HMS (normalized) is also better than that of CosFace and ArcFace.The model trained on HMS generalizes well on the LFW dataset with up to 90.57% accuracy and an F1 score of 0.9002.A simple six-layer SNN trained on HMS can reach 80.45% accuracy and an F1 score of 0.8173.
The performances of HMS with different depths of SNN were tested, as shown in Tables 1-4.ResNet18 was adequate for meeting the demand of the TMISD task and generalizing on the FR task.This also indicates that the SNN penalized by HMS is adequate to learn sufficient and critical information with fewer network layers.The training expenses are reduced as a result.
The detection accuracy of the model (resnet18) trained on HMS with different hyperparameters is shown in Figure 6.The accuracy fluctuation caused by scale factor s is higher than the margin d 1 and d 2 .

Data-Driven Method Experiments
The performance of detection models based on the typical SNN, triplet network, and fine-tuning methods is shown in Table 5.The triplet network achieved 92.27% accuracy with an F1 score of 0.9282 on the TMISD task, which is a significant improvement over a typical SNN.However, the performance gap came with greatly increased training costs in terms of memory and time.The performance of the fine-tuning method on the TMISD task was not satisfactory considering the training cost of the transferred knowledge.But when a model for a similar task is readily available, the fine-tuning method makes for a good choice with minimal training cost and fine performance.
For the detection models based on the triplet method and fine-tuning method, the performance changes rapidly with the depths of the network.These data-driven methods obtain a large gain in performance only under the condition that an adaptive depth CNN is employed for the task.

Discussion
Metric learning is an appropriate research idea for the TMISD task since the requirement for annotation data during the training is reduced.With the same numbers of similar TM pairs, the triplet network and fine-tuning data-driven methods can improve performance greatly compared to a simple SNN model.The triplet model enhances the discrimination ability with an additional input during training.The fine-tuning method transfers the learned information from other tasks and alleviates the pressure of data annotation.
The advantage of the detection models based on the data-driven methods is not prominent compared to the typical loss-driven models since the training is complicated and expensive.The performance gaps between the SphereFace model and the other two models, CosFace and ArcFace, are huge, but the performance cannot be sustained when SNN depth is increased or a new type of image is input for detection.The SphereFace model can be damaged by the diversity of TM images.
The HMS model outperformed other methods in the following aspects: (1) the compactness of similar TM pairs is tightened obviously; (2) the discrimination ability for another type of image, i.e., face data, is improved, which indicates the model trained on HMS is robust; (3) the training cost is reduced as a result of the requirements of annotated data and deep SNN depth being loosened.
In general, the introduction of the loss-driven model training idea is meaningful to the TMISD task.The challenges of training data-building and generalization on new data are dealt with in a low-cost way.

Conclusions
The detection of TM image similarity (TMISD) is an essential procedure for court judgments on TM infringement cases and TM legal registration, while the training databuilding of similar TM pairs and model generalization on fast-growing numbers of new TM designs are huge challenges for the task.To address these issues, similarity detection models based on loss-driven metric learning methods were researched.Compared to data-driven methods, including the triplet network model and fine-tuning method, the optimization of the softmax loss function had a larger gain in performance, with less data preparation and training cost.
A hybrid-margin softmax (HMS) is proposed based on the peculiarity of TM images.Additive margins are attached to the cosine and angular term of softmax in the angular space to tolerate the slight differences between the similar parts of similar TM image pairs.The weights of the classifier and extracted feature vectors in the softmax are not normalized, aiming to best preserve the information of input images.
The detection model trained on HMS is further penalized to improve the discrimination ability of TM images.The model can be trained on small numbers of TM training data.Experiments indicate that the model trained on HMS achieves the best performance on the TMISD task with up to 98.97% accuracy and an F1 score of 0.9746, compared to other transformations of softmax.The model can also achieve high performance with fewer SNN layers.Furthermore, the HMS-driven model trained completely on TM image data generalized well on the FR task, with up to 90.57% accuracy and an F1 score of 0.9002.

Figure 1 .
Figure 1.Some examples of TM images: (a) Similar pairs.Similar in element shapes and general color.(b) Similar pairs.Similar in contour, some elements.(c) Dissimilar pairs.Similar in partial contour, dissimilar in other factors.

Figure 4 .
Figure 4.The procedure of TM image similarity detection.

Figure 5 .
Figure 5.The structure of the 6-layer CNN.
1) Fine-tuning method: An SNN to be transferred is trained on the LFW training dataset.The backbone of the SNN is composed of an original series of resnet.When the SNN reaches 95% or more accuracy on the LFW test dataset, the fully connected layer is removed and the rest of the weights are frozen.The trained and frozen SNN and a new fully connected layer compose the TM detection model.Then, the model is further trained on the TM training dataset and tested on the TM test dataset.(2) Triplet model: Each input consists of two similar TM images, a dissimilar one, and corresponding labels.The triplet is built from the TM training dataset by attaching a random TM image to the pairs of similar samples.

Figure 6 .
Figure 6.The accuracy of the detection model (resnet18) trained with different hyper-parameters in HMS.

Table 1 .
The accuracy of the TMISD task.

Table 2 .
The F1 score of the TMISD task.

Table 3 .
The accuracy of the FR task.

Table 4 .
The F1 score of the FR task.

Table 5 .
The performance of data-driven methods on the TMISD task.