A Photo Identiﬁcation Framework to Prevent Copyright Infringement with Manipulations

: In recent years, copyright infringement has been one of the most serious problems that ham-per the development of the culture and arts industry. Due to the limitations of existing image search services, these infringements have not been properly identiﬁed and the number of infringements has been increasing continuously. To uncover these infringements and handle big data extracted from copyright photos, we propose a photo copyright identiﬁcation framework to accurately handle manipulations of stolen photos. From a collage of cropped photos, regions of interest (RoIs) are detected to reduce the inﬂuence of cropping and identify each photo by Image RoI Detection . Binary descriptors for quick database search are generated from the RoIs by Image Hashing robustly to geometric and color manipulations. The matching results of Image Hashing are veriﬁed by measuring their similarity using the proposed Image Veriﬁcation to reduce false positives. Experimental results demonstrate that the proposed framework outperforms other image retrieval methods in identiﬁcation accuracy and signiﬁcantly reduces the false positive rate by 2.8%. This framework is expected to identify copyright infringements in practical situations and have a positive effect on the copyright market.


Introduction
Copyright is one of the legal protections used to encourage creative activities as the creator's exclusive right to use or allow others to use their content. Despite the legal protections, numerous copyright violations still occur worldwide by pirates. According to Copytrack [1], approximately 65M images in 115 countries suffer from copyright infringement and the damage from the stolen image is estimated at approximately USD 63M. If an illegal use is detected, corrective recommendations can be taken to compensate copyright holders by public institutions such as the Copyright Commission. In particular, as shown in Figure 1, the number of corrective recommendations for copyright infringement in South Korea is increasing. In 2019, more than 670,000 corrective recommendations were made and the total amount of compensation is approximately USD 28M [2].
More than 90% of the violations are images and videos, and it takes a lot of cost to manually identify and uncover hundreds of thousands of infringements. Considering the large number of photos and videos that have not yet been detected on the web, it is very inefficient to capture all of the infringements with human labor. An automated copyright photo identification method is required to efficiently capture these copyright infringements [3][4][5][6] by finding a perfectly identical original photo the same as an query image. When a subject is photographed at different angle at the same time, the photos are not identical but very similar, and it is a very challenging task to distinguish these photos considering the trade-off between speed and accuracy. Web image search services [7][8][9] correspond to automated solutions that can perform the aforementioned functions. They provide results of not only visually similar images but also identical images for image queries. However, existing web image search services such as Google, TinEye, Yandex, and Bing do not accurately find the original photo from manipulations such as image cropping or image collage, which frequently occurs when photos are stolen. For example, as shown in Figure 2, if a shopping website steals photos to promote sunglasses, it only crops the person wearing sunglasses from the background (image cropping) and combines several images of sunglasses (image collage). In addition, image manipulations such as resizing, color distortion, flipping, and compression can occur during image editing, and these can also lead to degradation in identification performance. Image cropping and image collages have problems, loss of information from original photos and misidentification of multiple photo copyrights, respectively. These problems are difficult to solve with existing web image search services.   Example of a photo collage with two cropped photos. In these photos, the sunglasses are the target object for unauthorized use. The source of these photos is https://unsplash.com (accessed on 27 September 2021), which provides copyright-free images.
As shown in Figure 3, since existing image search services do not consider the characteristics of copyright infringements, existing web image search services show the poor responses to the aforementioned manipulations. Figure 3a,b show that Google image search and TinEye matched only one of the match results correctly, and Yandex matched only one photo as shown in Figure 3c. The limitations of existing image search services emphasize the need for a new photo copyright identification and database. In addition to technical limitations, the absence of a service capable of providing copyright information also makes it difficult to handle infringements. The service in Figure 4 can facilitate the legal use of copyrighted photos by connecting users and copyright holders, and helps to find copyright infringements based on the photo copyright identification framework.   Figure 2a by Google [7], TinEye [9], and Yandex [8]. We additionally tried with Bing, but did not obtain any results.  . Service block diagram for the photo copyright search. The yellow box on the right side is the proposed photo copyright identification framework in this paper.

Photo Copyright Search
In this paper, we propose a photo copyright identification framework that can handle the manipulations that appear in Figure 2a. The proposed framework in Figure 4 is composed of three modules, Image RoI Detection, Image Hashing, and Image Verification, where the original photo and the manipulated photo are matched accurately. Image RoI Detection divides a collage of cropped images into RoIs, which represent the characteristics of each image well. This module reduces the influence of geometric manipulations to detect RoIs consistently from original photos and manipulated photos. The image collage dataset is constructed to train optimized detectors for Image RoI Detection. The most efficient detector is applied through benchmarking the state-of-art object detectors for this dataset.
Image Hashing generates a binary descriptor from each detected RoI to search the photo copyright DB. By training the image hashing network through image augmentation with diverse manipulations, the generated binary descriptor is robust to manipulations such as color transformation, blurring, and cropping that ImageRoI Detection did not handle. However, severe manipulations make it very difficult to identify photos correctly with their binary descriptors, leading to occurrences of false positives. Since it is not easy to check whether identification results are correct in the field test, lots of false positives can be identified as incorrect copyrights.
To tackle this problem, we propose Image Verification to confirm whether the identification results of Image Hashing are correct. This module geometrically aligns a query image and its matched photo, and obtains the overlapping region between them for pixelto-pixel comparison. The image similarity Siamese network calculates the similarity of the overlapping region and verifies whether the images are identical.
This stable framework composed of the series of three modules improves the identification accuracy and minimizes the false positive rate for manipulated photos. Through exhaustive evaluations in experimental results, we verify the benefits of each module in the proposed framework for photo copyright identification. Furthermore, it is demonstrated that the proposed framework significantly improves identification accuracy and decreases false positive rate, outperforming conventional and state-of-the-art retrieval methods [10,11].

Contributions
The contributions of this paper is summarized as follows. To accomplish Image Verification, we introduce an geometric alignment method to accurately align an query image with the matched one in the database obtained by Image Hashing method and propose a similarity measurement network to distinguish whether the aligned image infringes copyright robustly to manipulations. • By applying the aforementioned three steps, we improve the identification accuracy and significantly reduce the false positive rate for manipulated photos. The experimental results demonstrate advantages of each module in the proposed framework and show notable performance improvements of the proposed framework over the state-of-the-art image retrieval method [10].

Efficient Object Detection
Object detectors [12][13][14][15][16][17] have become more accurate and precise with the development of deep learning techniques such as backbone networks, multi-scale methods, and prediction heads. However, the state-of-the-art detector [18] requires huge network size and expensive computational cost, making it difficult to apply to real-world applications such as mobile, robotics, and self-driving cars. As demands for real-time detectors increase, the scheme of one-stage detectors [12,16,17] achieved sufficient performance for real-time applications. Compared to two-stage detectors [13][14][15] with two separated steps, localization and classification, the one-stage detectors perform very fast due to them having single tensor output for these multi-tasks [16,17]. However, the localization precision of one-stage detectors is lower than that of two-stage detectors due to information abstraction caused by pooling operations of convolutional neural networks. The feature pyramid network (FPN) [19] aggregates multi-scale features and uses them as the final output tensor computation. This method leverages different sizes of receptive fields to effectively reflect multi-scale object information to improve performance. In recent years, the state-of-art one-stage detector, EfficientDet [12], has utilized bi-directional FPN to share information between low-level and high-level layers. In addition, compound scaling, which considers various aspects of CNN architectures, was proposed to design efficient object detectors, EffcientDet-D0 to D7. We deploy this model as it is suitable for object and image detection through the benchmark of object detectors.

Image Hashing
To generate a binary descriptor reflecting image semantics, supervised hashing methods [20,21] utilized label information, but most collected images do not have label since the labeling is expensive. On the other hand, unsupervised hashing methods [10,22,23] use image contents only to generate binary descriptors. Although it is difficult to extract highly refined semantics from images without label information, unsupervised hashing has been studied due to its utility. BinGAN [22] applied regularization to GAN (Generative Adversarial Network) [24] for image hashing, and binaryGAN [10] also utilized adversarial loss along with autoencoder architecture to learn image distribution. However, since existing image hashing methods have not considered image manipulation such as cropping, color distortion, and blur, it is highly likely that these manipulations degrade their accuracies. We propose image hashing scheme to handle various manipulations that degrade the performance of photo copyright identification.

Image Classification
Image classification has been developed with fundamental techniques for deep learning. The backbone networks such as AlexNet [25], VGGNet [26], GoogleNet [27], and ResNet [28] have been studied and improved performances of image classification, object detection, and image segmentation. Moreover, many image datasets [29][30][31][32] have been introduced. ImageNet [25] is a large-scale dataset containing more than 1.2 million images, which have greatly influenced many image classification tasks in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The aforementioned image classification networks also utilized ImageNet, providing the most generalized feature representation due to its overwhelming scale. This feature representation power of ImageNet pretraining shows better performance on a variety of tasks, networks, and other datasets. We utilize ImageNet pretraining for the image hashing network to generate binary descriptors.

Self-Supervised Learning
Many researchers have proposed self-supervised learning methods [33][34][35][36] to learn semantics of images without labels. Self-supervised learning methods have been developed to a level comparable to supervised learning methods, and received more attention due to their transferability to other tasks. SimClr [34] learned a large number of negative samples with NT-Xent loss, and SimSiam [35] learned common features of an image pair through stop-gradient operation without a negative sample. We employ a self-supervised learning scheme to train the image hashing network.

Siamese Network
In recent years, one-shot learning, which learns to generate discriminative features for subjective task, has been successfully used for image similarity assessment. This is applicable not only to wide variation [37] but also to sensitive variation [38]. The core network of one-shot learning is Siamese structure [39]. When the number of training data is very small or additional learning such as online system is difficult, Siamese network can learn discriminative features and perform k-nearest neighbor classification. In addition, applying multimodal data to Siamese network allows learning of similar feature presentation from different domains. Koch et al. [39] performed one-shot learning on the Omniglot dataset [38] with a small number of data, achieving near-human-level accuracy. We propose a variant of the Siamese network to predict similarities of image pairs for image verification.

Image Verification Metric
The image verification metric helps to determine whether two images match exactly. Naive approaches such as mean squared error (MSE), mean absolute error (MAE), and peak signal-to-noise ratio (PSNR) do not reflect structural difference and correlation of images due to their simple arithmetical operations for local pixel dissimilarity. To complement this, Wang et al. [40] proposed Structural Similarity Index (SSIM) based on the Human Visual System (HVS). However, although SSIM capture local structural dissimilarities of each patch, the local differences tend to be ignored by averaging local similarities. Moreover, because the luminance and contrast term of SSIM are vulnerable to color distortion such as tone variation and JPEG compression, SSIM is not suitable with image verification metric. To overcome these limitations, DeepQA [41], DeepVQA [42], and DRF-IQA [43] has been proposed in a data-driven scheme using deep neural networks. For image similarity measure, we propose the image verification metric to capture both global and local dissimilarities.

Overall Framework
As shown as Figure 5, the copyright photo identification framework and searches photo assets in the database with an image query through three steps: Image RoI Detection, Image Hashing, and Image Verification. The first step, Image RoI Detection, detects image regions and object regions from an input image. The image detector separates an image collage into individual images, and the object detector recognizes object regions to reduce the effect of image cropping in backgrounds. The largest object is selected as an RoI to represent each separated image. If the object size is less than a certain percentage of the separated image or any object is not detected, the whole separated image is selected as an RoI. Through this module, a query image is refined to be similar to the RoI of the original photo. (2) Predict the similarity between image patches.

Fully Connected Layers
Averaged Similarity object image Figure 5. Overall framework for copyright photo identification. The source of these photos is Ima-geNet [29].
In Image Hashing, binary descriptors are generated from the RoIs selected in the previous step. The backbone network of Image Hashing is pretrained by ImageNet to utilize its representation power, and the SimSiam [35] is employed as the head network to finetune on copyright photos. The output vector of the head network is utilized as a binary descriptor, which constructs the photo copyright database with the information of an input photo. The binary descriptor helps to search quickly the most similar photo with Hamming distance.
The matched photo with the binary descriptor is verified with the query image by measuring their similarity in Image Verification. The query image and the matched photo are aligned geometrically by the affine matrix, and the overlapping region is derived according to their geometric relationship as shown in 3-(2) of Figure 5. The similarities of patch pairs sampled from the overlapping region are calculated by the proposed Siamese similarity measure network, and aggregated into the image similarity by multi-layer perceptron (MLP). This image similarity determines whether two photos are identical based on the threshold value. Details of each step is described in following Sections.

Image RoI Detection
In this step, the image collage is divided into a set of single images and then the corresponding image RoI representing a salient object in an image is determined for every single image. To detect the single image in an image collage and objects, we adopt two separate deep detector networks. The network architecture is determined by benchmarking three state-of-the-art object detection network: Faster R-CNN [14], Yolo v3 [17], and Efficient-Det [12]. Based on the benchmark presented in Table 1, our framework uses EfficientDet (D1) architecture for each detector, which outperforms the other detectors in predicting the RoIs from geometrically manipulated images. We trained these two networks with classification and localization losses. For a class t and corresponding network prediction p t ∈ [0, 1] with ground truth p * , the classification loss is defined by the alpha balanced focal loss [44] which is defined as weighted binary cross-entropy, where γ ∈ [0, 5] and α ∈ [0, 1] are the focusing weight for modulating easy/hard examples and balancing weight for positive/negative examples, respectively. The localization loss is smooth L1 loss [14] with predicted bounding box t u = {t x , t y , t w , t h } and ground truth bounding box t * , where smooth L 1 (x) is defined as The total loss function is expressed as a weighted summation of the classification loss and localization loss. Since the image detection is less dependent on the detection label, we put a larger weight on localization loss for the image detector while the key object detector is trained on evenly weighted losses.
Since there are many publicly available datasets for object detection such as MS-COCO [45], the object detector can be trained by using a public dataset. However, for the best of our knowledge, there is no dataset for detecting the image collage. Thus, we develop a data generation method to create collage images by using pristine image data from public image dataset. For a rectangular frame with a predefined resolution, the proposed method recursively partitions the frame and inserts the pristine images into each subframe. To avoid an unnatural image resizing, an image to be inserted into the subframe is selected based on the similarity of aspect ratio between them. Before insertion, the selected image is center-cropped to make it fit the subframe. From an empty frame with a predefined resolution, the recursive partitioning is performed on horizontal or vertical axes.
Algorithm 1: Generating the template by the partitioning frame.
Input: Subframe T = {x, y, w, h}, current partition level L, partition probability p div , and partition center parameter α Output: Frame partitioning information T out ∈ R K×4 , where K is the number of total subframes Function PartitionSubFrame(T, L, p div , α): P div ("True") = p div , P div ("False") = 1 − p div ; x ← SampleFromDistribution(P div ); if x = "True" or L > 0 then Axis ← SampleFromDistribution(P axis ) ; // Axis of partition if x = "True" or L > 0 then Axis ← SampleFromDistribution(P axis ) ; // Axis of partition if Axis = "V" then µ = (h/2); else µ = (w/2); σ = µ/(4 2 log (α)) ; // FWTM c ← SampleFromDistribution(N (µ, σ 2 )); if Axis = "V" then t node1 = {x, y, w, c} ; // Top t node2 = {x, y + c, w, h − c} ; // Bottom end else t node1 = {x, y, c, h} ; // Left t node2 = {x + c, y − c, w − c, h} ; // Right end T out ← Insert(PartitionSubFrame(t node1 ,L − 1, p 2 div , α)); T out ← Insert(PartitionSubFrame(t node2 , L − 1, p 2 div , α)); end end else T out ← Insert(T); // Leaf node end return T out For each partitioning step, the occurrence of partition, partitioning axis, and its center are randomly determined with appropriate probability distributions. Specifically, we sample the partition center from a Gaussian distribution centered on the mid-location of the partition axis to make the size of the subframe variable. Furthermore, to prevent the partition center set too far from the mid-location, the variance of the Gaussian distribution is computed based on full width tenth maximum (FWTM) with α = 10. Details of the recursive frame partitioning algorithm are shown in Algorithm 1. In Figure 6, an example of recursive frame partitioning is described. For an image collage produced from the proposed method, the bounding box information and its label named as 'image' for each image in the image collage are marked as the ground-truth label.  In our framework, by using the trained detectors, the image RoIs of the query image are determined based on detected images in an image collage and its objects. Note that the number of RoIs is the number of images used for image collage. Since the main purpose of copyrighted photos is to use a salient object, the image RoI is chosen to be a bounding box of the conspicuous object. If there is an object in a single image, a bounding box of the object which has a larger IoU (Intersection over Union) than the predefined IoU threshold is determined to the image RoI while the whole image is used as image RoI otherwise. By using image RoI instead of general objects or whole image regions, our framework is robust to the geometric manipulations.

Image Hashing
SimSiam [35], a self-supervised scheme for Siamese networks, is employed for Image Hashing. The image hashing network is composed of Backbone, Projection MLP, and Prediction MLP as shown in Figure 7. The original photo I o is augmented randomly and sampled into I 1 and I 2 . This augmentation includes geometric manipulations such as random aspect ratio and random cropping and color manipulations such as color distortion, grayscale, and Gaussian blur. This augmentation helps to learn common feature representation between the augmented image inputs by generating diverse pixel distributions for the same image. Without this augmentation, similar pixel distributions can cause overfitting and degrade the performance of the network. The backbone network is pretrained to utilize generalized features by ImageNet, which can help to prevent overfitting with fewer dataset images. The weights of the Projection MLP and the Prediction MLP are initialized randomly, and trained to minimize negative cosine similarity loss by fitting the copyright photos. The negative cosine similarity D is defined as: where p and z are the outputs of the Prediction MLP and the Projection MLP, respectively. Since there is no negative sample, the image hashing network collapses with D and all the outputs have same value. To prevent the collapse, the stopGrad operation is applied to z and D is defined again as: D(p 1 , stopGrad(z 2 )), The stopGrad operation prevents the gradients of network weights from being updated and the network is trained to be closer to the constant tensor stopGrad(z). For symmetry, negative cosine similarity loss L ncs is defined as: To generate a binary descriptor, sign function is applied to the output of the Prediction MLP and the binary descriptor b is defined as: The binary descriptor can search the database very quickly by using bit operations. The index of the closest photo to the query is defined as: where b Q , b i DB , and ⊕ are the binary descriptor of the query, i th binary descriptor of the database, and XOR operation, respectively. In order to minimize information loss due to the dimension reduction in output features from the backbone network and the binarization, the similarity preservation loss L sp is defined as: where F 1 , F 2 , b 1 , and b 2 are the output features from the backbone network and the binary descriptors, respectively. The final loss L is defined as: The image hashing network is trained to minimize the final loss L.

Image Alignment
In the image verification step, a matched image pair from the image hashing step in Section 3.3 is verified based on the measured similarity between them. Because the query image possibly has geometric manipulation such as cropping or resizing, it is necessary to geometrically align the query and its matched image. In our framework, since the photos taken in different views should be predicted as different, the transform between the query and matched image is assumed to be in the affine space rather than projective space. By using matched image features, the alignment process computes an transform matrix and then extract an overlapping region based on it. Consequently, the image structure including edge contrast between two images is deemed to be similar which is necessary for image similarity measurement in the next subsection.
As shown in Figure 8, we use image features by using Accelerated KAZE (AKAZE) [46] for image feature matching. The AKAZE is a more computationally efficient method than KAZE. It can computes robust features within milliseconds while having comparable feature matching accuracy with well-known feature extractors such as SIFT [47] and SURF [48]. From the feature descriptors computed on each image, the affine transform matrix M is estimated by using RANSAC (RANdom SAmple Consensus) [49] that computes the robust solution to the outlier matches. The detailed procedure of estimating the affine transform matrix by using AKAZE feature matches is described in the Algorithm 2.

Image Similarity Siamese Network
As mentioned in the previous section, the input image pair is assumed to have a similar geometric structure. Though the images have a similar structure, directly computing the image similarity by using a general pixel-wise metric would result in an inappropriate similarity score due to differences in color or other noise factors. Furthermore, because of the possible noise from misalignment, measuring similarity between images is not accurate. For instance in Figure 8, both the mean structure similarity (SSIM-S) which is structure term of SSIM of different images as in (b) and the same images as in (c) have the same high value (=0.96). To address this problem, we propose a robust image similarity metric based on a deep Siamese CNN model. In addition, to make the proposed model catch the small structural difference between images, we design the input to our CNN model as not simply using image pair but the patches that are sampled around the local minimum of the SSIM-S. The process outline of the proposed patch-based Siamese CNN model is shown in Figure 9. From the geometrically aligned image pair, the image similarity Siamese network predicts their image similarity score P image . If P image is less than the pre-defined threshold τ v , it returns a decision that the two input images are different to the framework. Let the patch pair extracted from two images be (x, y). Each patch pair is extracted on the local minimum of the SSIM-S map, which is defined as structure part of SSIM [40]. The loss function is defined as sum of patch similarity loss L patch on patch pairs (x, y) and image similarity loss on aggregated features F (i.e., L = L P + L I ). The patch similarity loss is expressed as: where P label (x, y), CONV, and FC P are the ground truth label, which is 1 for patches extracted from identical images and 0 otherwise, convolutional neural network for computing patch-wise features, and patch similarity network composed of fully-connected layers, respectively. The last layer of the patch similarity network FC P is sigmoid to make output value in [0, 1]. This loss term is introduced for the network to capture significant local dissimilarity within patch pairs. Meanwhile, the image similarity loss measures the loss in terms of entire image with aggregated feature F from intermediate feature vectors of FC P : where P label (X, Y) and FC A are the ground truth label, which is 1 for same images, and image similarity patch aggregated similarity network with fully-connected layers, respectively. To make the image similarity score have value in [0, 1], the patch aggregated similarity network FC A has a sigmoid layer as an output. In Figure 8a,b, the measured similarity scores using the proposed model are 0.0025 and 0.69. These values are more desirable compared to the mean structure similarity 0.96. Training the network solely using different images with simple augmentations may cause the proposed model to be less sensitive to the fine structure difference between images. Therefore, we sample image pairs which have barely visible differences from the Youtube 8M dataset [50]. For each video, the frame data whose SSIM-S values are larger than the specific value for adjacent frames are extracted and labeled as hard negative examples. From 150 pristine videos among the Youtube 8M dataset, we extract 200,000 frames based on SIMM-S difference. To simulate various image manipulations except for cropping, these frames are augmented by using color clipping, JPEG compression, and chromaticity adjustment. In training, we use both totally distinctive image pairs and the frame pairs extracted by using the above procedure.

Photo Copyright Dataset
To develop the photo copyright identification method, we used 35,824 copyrighted photos provided by copyright holders for purpose of research only. To leverage diverse images, images from ImageNet and MS-COCO were also randomly sampled to compose the dataset. In total, the dataset was composed of 289,111 images. We used this dataset in Image RoI Detection and Image Hashing. On the other hand, as mentioned in Section 3.4.2, 200,000 frames were sampled from the Youtube 8M dataset [50] to train the image similarity Siamese network of Image Verfication.

Image RoI Detection
EfficientDet-D1 [12] was deployed to train the object detector and the image detector with MS-COCO [45] and image collage dataset, respectively. We generated 300,000 image collages by using the image collage generation method in Section 3.2 from the photo copyright dataset as shown in Figure 10. 270,000 and 30,000 image collages were used for training and testing, respectively. In object detection, the objectness score S obj was used to select reliable objects among object predictions. We set the thresholds to Th obj = 0.7 and Th img = 0.9 for the object detector and the image detector. Note that the larger Th img yields tighter bounding box for the image detector. Furthermore, to filter out small objects, we set the threshold for IoU (Intersection over Union) to Th iou = 0.8 between an image region and an object region.

Image Hashing
The SimSiam [35] was employed for the image hashing network to generate binary descriptors. The ResNet-50 [28] pretrained by ImageNet [29] was used as the backbone network. The hidden unit size of Projection MLP and Prediction MLP was set as the same as the binary descriptor size, i.e., 128, 256, and 512.

Image Verification
For AKAZE [46] feature extraction, the threshold and number of octaves were empirically set to 10 −4 and 3, respectively, and the reprojection threshold e th in Algorithm 2 was set to 0.2.

Image Detector with Image Collage Dataset
For image detection from image collages, we benchmark Yolo v3 [17], Faster R-CNN [14], and EfficientDet [12]. Yolo v3 and EfficientDet are one-stage detectors whereas Faster R-CNN is a two-stage detector. Table 1 shows the results of image detection by using three detectors, which are trained on the image collage dataset. Avg. IoU and Avg. time indicate an averaged IoU between prediction and ground truth boxes and an average measure of time spent on a box prediction. The results of Yolo v3 and Faster R-CNN show the difference of one-stage and two-stage detectors. Faster R-CNN shows better average IoUs with ResNet-101 without Feature Pyramid Network (FPN) [19] than Yolo v3 while the average time of Faster R-CNN is more than ten times that of Yolo v3. EfficientDet shows much better average IoUs than those of Yolo v3 and Faster R-CNN and its inference speed is fast due to the characteristics of one-stage detectors. EfficientDet-D1 with the saturated average IoU is employed in this framework.

Image Hashing
To verify the performance of the image hashing network, identification accuracies are measured for geometric manipulation and color manipulation. To construct the photo copyright database, binary descriptors are generated from RoIs of original photos. For testing, random cropping is applied for the geometric manipulation and color distortion in HSV space; brightness change, and Gaussian blur are applied for the color manipulation. A Histogram of Gradients (HoG) [11] and BinaryGAN [10] are used for comparison with various binary descriptor sizes.

Geometric Manipulation
As shown in Table 2, the identification accuracies are measured according to IoUs between the original photos and randomly cropped images with various aspect ratios and sizes. The accuracies of all methods increase as an IoU and a descriptor size increase. The proposed image hashing achieves better accuracies than those of HoG and binaryGAN. These results show that the proposed image hashing network that learns various geometric manipulation is effective. HoG, a low-level handcrafted feature, shows a rapid decrease in accuracy as an IoU decreased, and BinaryGAN based on autoencoder network seems difficult to handle the geometric manipulation compared to the proposed image hashing.  Table 3 shows the identification accuracies of three methods. All methods show similar accuracies for color manipulations. HoG which is vulnerable to the geometric manipulations, shows slightly higher accuracies than BinaryGAN and the proposed image hashing for color distortion. Since HoG describes the input image with its graidents in detail, HoG obtains good accuracies if there is no geometric manipulation. BinaryGAN has the lowest accuracies for all color manipulations, and the proposed image hashing shows the highest accuracies for brightness change and Gaussian blur. These results show that the proposed method can handle color manipulations effectively by learning manipulated photos.

Image Verification
In Image Verification, the performance of the image alignment and the image similarity Siamese network is measured after applying random clipping (10-30%).

Image Alignment
We verify whether AKAZE [46] is precise enough to perform the image alignment, and the results are shown in Table 4. After the image alignment, the significantly high accuracies are shown with small deviations. These average IoUs are not visually distinctive, and AKAZE is sufficient to utilize for Image Verification.  Figure 11 shows the receiver operating characteristic (ROC) curve of the image similarity measure results. The proposed similarity measure shows better performance than that of MSE and SSIM. Since handcrafted image quality metrics such as MSE and SSIM take an average of local similarities for their final similarity scores, it is insensitive to capture local differences. Specifically, SSIM is robust to linear transforms such as contrast enhancement and mean shift, but its similarity score is greatly reduced when nonlinear transforms are such as JPEG compression or blurring occur. On the other hand, since the proposed network is designed from a cognitive perspective on geometric structures, it is able to handle color manipulations better than the other similarity metrics. Table 5 also provides F1-scores and AUC (Area Under the Curve) of Figure 11, showing the great performance of the proposed similarity measure. Figure 11. Receiver operating characteristic curves of similarity metrics with geometric coincident image pair or patch pairs. Patch means a similarity metric uses multiple patches sampled from an image as inputs. The pink curve (ROC of mse) is same as the cyan curve (ROC of mse_patch). In addition, to show efficiency of similarity measure based on significantly different patches, the results of patch-based MSE and SSIM are also provided. MSE of sampled patches is measured as the same as MSE of an image due to its consistent similarity statistics. The patch-based SSIM reflects only local patches with large structural differences and seems very sensitive to subtle differences. However, due to the sensitivity to nonlinear transforms, the results of SSIM are worse than those of the proposed similarity measure.

Overall Framework Test for Photo Copyright Identification
To demonstrate the effectiveness of each module in the proposed framework, as shown in Table 6, Image RoI Detection and Image Verification are progressively applied to Image hashing. There are three identification results, identified, unidentified, and misidentified. unidentified is only decided by Image Verification. Table 6. Ablation tests of photo copyright identification framework by using HoG [11], Binary-GAN [10], and the proposed image hashing with the binary descriptor size 512. The identification results are divided into identified, unidentified, and misidentified. Identified means the correct identification, and unidentified defers the image identification. Misidentified is the identification error (false positive rate), which should be reduced.

Method
(1) Image hashing only (1) Image hashing only shows the identification results without Image RoI Detection. Due to randomness, most image collages with one image are correctly identified, whereas image collages with more than two images are prone to be misidentified. In (2) Image RoI detection + Image hashing, images are separated well from images collages and this method achieves identified of more than 80%, but there still remains a misidentified rate of 11% at least. The last one, (3) Image RoI detection + Image hashing + Image verification, can distinguish unidentified from misidentified. By applying Image Verification, the proposed framework reduces the number of misidentified collages significantly and the number of misidentified collages which can be categorized as the false positive rate is reduced to 2.8% as a consequence. Figure 12 shows identification results of HoG, BinaryGAN, and the proposed framework. The number of results is the same as the number of images in a collage, meaning that the image detector works well to separate images from an image collage. The performance of the proposed framework is especially remarkable in comparison with the other methods in the second results of Figure 12. While the other methods give incorrect images similar to original images in terms of shape or color, the proposed framework correctly matches images by distinguishing image details. Lastly, only one image is determined as misidentified by reducing false positives.  [11], BinaryGAN [10], and the proposed hashing method, respectively. Green, Red, and blue boxes mean identified, misidentified, and unidentified results, respectively.

Conclusions
We proposed the photo identification framework to prevent copyright infringements. To handle manipulations by pirates, we developed preprocessing steps to identify all photo copyrights in images. First, we detected image RoIs from an image collage to distinguish multiple copyright photos. To train the image RoI detector, we collected the dataset for image collage and proposed the recursive partitioning method to build an image collage frame. Subsequently, we identified each RoI by using the proposed image hashing method that handles image manipulations. By augmenting images, we generated similar binary descriptors from an original image and a manipulated image. Consequently, we have verified identification results by measuring image similarities. This process reduced false positives and left them unidentified while reducing false positives. We experimentally showed the effectiveness of each step in the proposed photo copyright identification framework in comparison with the other methods.
The development of the identification framework is expected to automatically monitor the copyrights of infringed photos from websites, vitalize the photo copyright search service, and prevent copyright holders from copyright infringements. In addition to identification, an image descriptor can be developed for image retrieval, which is utilized in various web image search services. Users also will use legally copyrighted photos by paying royalty and be able to access a lot of available copyright photos. This virtuous cycle, creation of copyrighted works → use of creations and payment of royalties → recreation copyrighted works, will make it possible to build a ecosystem for copyrights.