Video Searching and Fingerprint Detection by Using the Image Query and PlaceNet-Based Shot Boundary Detection Method

This work presents a novel shot boundary detection (SBD) method based on the Place-centric deep network (PlaceNet), with the aim of using video shots and image queries for video searching (VS) and fingerprint detection. The SBD method has three stages. In the first stage, we employed Local Binary Pattern-Singular Value Decomposition (LBP-SVD) features for candidate shot boundaries selection. In the second stage, we used the PlaceNet to select the shot boundary by semantic labels. In the third stage, we used the Scale-Invariant Feature Transform (SIFT) descriptor to eliminate falsely detected boundaries. The experimental results show that our SBD method is effective on a series of SBD datasets. In addition, video searching experiments are conducted by using one query image instead of video sequences. The results under several image transitions by using shot fingerprints have shown good precision.


Introduction
With videos becoming more popular, important, and pervasive, video tasks, such as searching, retrieving, tracking, summarization, object detection, and copy detection, are becoming more challenging. Video searching (VS) has been a challenging research topic since the mid-1990s, and video copy detection (VCD) also started at that time [1]. Video fingerprinting is widely employed in VS and VCD. The tendency of VCD has been focused on the extraction of robust fingerprints. The state-of-the-art VCD methods are mostly based on video sequences, and image-query-based VS/VCD technology is still imperfect. Therefore, developing robust fingerprints for VS/VCD by using image queries has great importance.
Video content analysis is the most basic process for different video applications. Table 1 lists some common video features. It describes the characteristics of those features and lists their robustness and frailty for video fingerprint detection.
Most of the features are visual content features, such as color-based, gradient-based, and transform-coefficients-based features, that are global features, and the common features of them are their low complexity extraction and that they are weak under local operations. Local features are local descriptors, such as SIFT (Scale-Invariant Feature Transform) [2], SURF (Speed Up Robust Features) [3], and ORB (Oriented FAST and Rotated BRIEF) [4]. Local features can search for abrupt changes in intensity values and their relationships from their neighboring pixels. Motion-based features represent the temporal relations in video sequences, but in videos that have a few camera activity changes, they cannot represent high-level semantic information better than other features.
A set of spectral dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the spatial structure of a scene

Scaling, compression
Post-production or edition, cropping Transform-coefficients-based DWT [10] Discrete Wavelet Transform (DWT) coefficients by using a mean value, an STD (Standard deviation) value, and SVD (Singular Value Decomposition)

Compression, Flip
Post-production or edition, blur DCT [11] Discrete Cosine Transform (DCT) coefficients Scaling Post-production or edition Motion-based [12] Object motion and camera motion operated by a block matching consecutive frame blocks algorithm The state-of-the-art VS and VCD methods are mainly based on shot boundary detection (SBD) approaches because using SBD can process video more efficiently. The situation is that image-query-based VCD and deep-learning network-based SBD technologies have yet to be improved. Therefore, in this paper, we aim at developing a video searching/fingerprint detection system based on deep-learning networks and image queries. Figure 1 shows the overview of the image-query-based video searching and copy detection method. The first step is to build a new places-centric dataset and train a pre-trained PlaceNet for image classification. The places dataset should include both natural and computer-generated places-centric images because in many science fiction films the places are not common. The second step is to segment the videos into shots. The SBD method combines general, local, and deep-learning-based features. Then, shot-based fingerprints are built instead of using all frames or keyframes. Finally, several signal processing and geometric transitions are applied to build image queries for video searching and copy detection. Through these steps, the question of whether the image has been found in the video dataset is assessed and if it has, the location of it in the video is detected.
Introducing a new deep-learning-based SBD method. The method has three stages: candidate segment selection, places semantic-based segment, and segment verification. The Network is places-centric instead of object-centric.

2.
Developing a novel image-query-based video searching/fingerprint detection system.
The paper is organized as follows. Section 2 overviews the previous VCD and SBD methods. Section 3 introduces our SBD method and image-query-based video fingerprint method. Section 4 introduces the evaluation method and presents the experimental results on some famous datasets. Section 5 offers our conclusions. Our contributions include: 1. Introducing a new deep-learning-based SBD method. The method has three stages: candidate segment selection, places semantic-based segment, and segment verification. The Network is places-centric instead of object-centric. 2. Developing a novel image-query-based video searching/fingerprint detection system. The paper is organized as follows. Section 2 overviews the previous VCD and SBD methods. Section 3 introduces our SBD method and image-query-based video fingerprint method. Section 4 introduces the evaluation method and presents the experimental results on some famous datasets. Section 5 offers our conclusions.

VCD Approaches
The famous related work is the TRECVID [13] VCD track. TRECVID is the international conference on benchmarking technology for content-based video indexing and retrieval. From 2008 to 2011, the VCD task ran for four years. Before that, deep learning networks had not been proposed, so the methods are all based on low-level content features. As shown in Table 2, the methods can be divided into keyframe-based and shot-based methods. Among them, the keyframe-based methods are mainly used but use of the shot-based methods is on the increase. Furthermore, using a fusion of fingerprints continues to show an upward tendency. Some of those methods use a fusion of features of global content, local content, and deep-learning-based features, and some of those methods use a fusion of features of video and audio. Table 2. Description of the video copy detection (VCD) approaches.

Presenters/Year
Methods Characteristics Based Types INRIA-LEAR [14]/08 The method uses the SIFT features representing the uniform sampled query frames, uses K-means to Using keyframe classifiers for video processing to Keyframes

VCD Approaches
The famous related work is the TRECVID [13] VCD track. TRECVID is the international conference on benchmarking technology for content-based video indexing and retrieval. From 2008 to 2011, the VCD task ran for four years. Before that, deep learning networks had not been proposed, so the methods are all based on low-level content features. As shown in Table 2, the methods can be divided into keyframe-based and shot-based methods. Among them, the keyframe-based methods are mainly used but use of the shot-based methods is on the increase. Furthermore, using a fusion of fingerprints continues to show an upward tendency. Some of those methods use a fusion of features of global content, local content, and deep-learning-based features, and some of those methods use a fusion of features of video and audio.

SBD Approaches
A video contains a great amount of information at different levels in terms of scenes, shots, and frames. SBD is the first process to simplify applications, such as video indexing, retrieval, and fingerprinting. A shot is a set of continuous frames recorded by a single camera. SBD approaches can be divided into two categories: the compressed domain and the uncompressed domain. Some SBD algorithms are in the compressed domain and they are faster than those methods in the uncompressed domain. However, the uncompressed domain presents more challenges because of the vast amount of visual information in the video frames. Consequently, research on SBD is focused on the uncompressed domain rather than the compressed domain.
In recent years, the SBD approach has made rapid progress. Table 3 lists some representative types of SBD approaches.
The approaches are mainly in the following fields: pixel-based [30][31][32], histogram-based [33][34][35], edge-based [36,37], transform-based [38][39][40], motion-based [41], and statistical-based [42,43] approaches. The histogram-based method can be regarded as invariant to local motion or small global motion compared with the pixel-based methods. The edge-based method is simple to conduct. The processing of transform-based approaches usually transforms a signal (frame) from the time (spatial) domain into the transform domain. For motion-based approaches, the motion vectors are computed by block matching consecutive frame blocks. Transitions and camera operations, such as zoom or pan, can be differentiated. For statistics-based approaches, properties such as the mean, median, and standard deviation are often used. There are some other SBD approaches, such as the temporal slice coherency [44], fuzzy rules [45], and two-phased [46] approaches. Due to deep learning having become a hot topic in research work, deep-learning-based approaches have been increasingly applied to SBD works [47][48][49][50]. Those methods have accordingly shown better performance than convolutional methods. Table 3. Description of the SBD approaches.

Type Presenters Methods Characteristics
Pixel-based Kikukawa [30] The method uses the sum of the absolute differences of the total pixel with a threshold to locate the cut shot.
Easy and fast, but cannot give a satisfactorily result.
Zhang [31] The method uses the preprocessing method of average filtering before detecting shots.
False detection is the main problem.
Shahraray [32] The method divides the frame into regions, matches the regions between the current frame and its next frame, then chooses the best matches.
Real-time processing. Histogram-based Küçüktunç [33] The method uses fuzzy logic to generate a color histogram for SBD in the L*ab color space.
Robust to illumination changes and quantization errors, so it performs better than conventional color histogram methods.
Janwe [34] The method uses the just-noticeable difference (JND) to map the RGB color space into three orthogonal axes JR, JG, and JB, and the sliding window-based adaptive threshold is used.
Highly depends on the size of the sliding window and parameter values.
Li [35] The method uses a three-stage approach: the candidate shots are detected by two thresholds based on the sliding window at first; then, the local maximum difference of the color histogram is used to eliminate disturbances; finally, the HSV color space and Euclidean distance are employed.
Can deal with a gradual change and a cut change in the same way. Table 3. Cont.

Type Presenters Methods Characteristics
Edge-based Zheng [36] The method uses the Robert edge detector for gradual shot detecting; the fixed threshold is to determine the total number of edges that appear.
Fast but the performance for gradual transition detection is not good.
Adjeroh [37] The method uses locally adaptive edge maps for feature extraction and uses three-level adaptive thresholds for video sequencing and shot detection.
Fast, uses an adaptive threshold, and is slightly superior to the color-based and histogram-based methods. Transform-based Cooper [38] The method computes the self-similarity between the features of each frame and uses the DCT to generate low-order coefficients of each frame color channel for a similarity matrix.
Competitive with seemingly simpler approaches, such as histogram differences Priya [39] The method uses the Walsh-Hadamard operation to extract the edge strength frames.
Simple but the performance should be improved.

Motion-based
Porter [40] The method uses camera and object motion to detect transitions and uses the average inter-frame correlation coefficient and block-based motion estimation to track image blocks.
High computational cost and has dissolve detection as a weakness.
Bounthemy [41] The method estimates the dominant motion in an image represented by a two-dimensional (2D) affine model.
Applicable to MPEG videos.

Statistical-based
Ribnick [42] The method uses the mean, standard deviation, and skew of the color moments.
Simple but has dissolve detection as a weakness.
Bendraou [43] The method uses SVD updating and pattern matching for gradual transitions.
Reduces the process and has good performance in both cut and gradual transition detection Temporal Slice Coherency Ngo [44] The method constructs a spatial-temporal slice of the video and analyzes its temporal coherency. Slice coherency is defined as the common rhythm shared by all frames within a shot.
Capable of detecting various wipe patterns.
Fuzzy-rule-based Dadashi [45] The method calculates a localized fuzzy color histogram for each frame and constructs a feature vector using fuzzy color histogram distances in a temporal window. Finally, it uses a fuzzy inference system to classify the transitions.
Threshold-independent and has a weakness against gradual transition types.
Two-phased Bhaumik [46] The first phase detects candidate dissolves by identifying parabolic patterns in the mean fuzzy entropy of the frames. The second phase uses a filter to eliminate candidates based on thresholds set for each of the four stages of filtration.
Has good performance for detecting dissolve transitions, uses lots of sub-stages, and is threshold-dependent.

Deep learning based
Xu [47] The method implements three steps and uses CNN to extract the features of frames. The final decision is based on the cosine distance.
Suitable for the detection of both cut and gradual transitions boundaries; is threshold-dependent.
Baraldi [48] The method uses Siamese networks to exploit the visual and textual features from the transcript at first and then uses the clustering algorithm to segment the video.
Uses CNN-based features and can achieve good results.
Hassanien [49] The method uses a support vector machine (SVM) to merge the outputs of a three-dimensional (3D) CNN with the same labeling and uses a histogram-driven temporal differential measurement to reduce false alarms of gradual transitions.
Exploits big data to achieve high detection performance.
Liang [50] The method extracts the features using the AlexNet and ResNet-152 model. The method uses local frame similarity and dual-threshold sliding window similarity. Threshold-dependent.

Candidate Segment Selection
Generally, a video sequence has a lot of non-boundary frames and several boundary frames. To reduce the computational complexity, selecting the candidate segment is the first step in our scheme. Here, we adopt nine different features for candidate segment selection. We use a short video "Leon" as a test to find out the difference between them. "Leon" has 3920 frames with a frame rate of 24 fps. In the experiment, we let the first frame be the keyframe of each quarter frame rate distance, and then used those features to compute the difference in consecutive keyframes. The feature values in each keyframe are normalized to a total of 1. The dimensionality of each feature and the total computing time for "Leon" are listed in Table 4. The keyframe differences of each feature are represented by the stem plot. The red star marks in each figure label the reference shot location. Comparing the results of these methods with respect to the computing complexity, the Red, Green, Blue (RGB) histogram method is faster than the others and the GIST method is the slowest. The Singular Value Decomposition (SVD)-based HOG and LBP methods are faster than the Hue, Saturation, Value (HSV) histogram method. The Dual-Tree Complex Wavelet Transform (DTCWT) [51] is slower than DWT. Some of the larger differences are not marked by using the edge method, so a lower threshold is needed. HOG-SVD and GIST show better discriminability than the transform-coefficients-based method. Among them, color-based methods, such as C-RGB and Local Binary Pattern (LBP)-SVD, are superior to the others not only in time reduction but also in discriminating the differences. However, the C-RGB method has its discriminatory power reduced for black and white documentary programs. Therefore, LBP-SVD has been chosen for candidate segment selection. discriminating the differences. However, the C-RGB method has its discriminatory power reduced for black and white documentary programs. Therefore, LBP-SVD has been chosen for candidate segment selection. discriminating the differences. However, the C-RGB method has its discriminatory power reduced for black and white documentary programs. Therefore, LBP-SVD has been chosen for candidate segment selection. discriminating the differences. However, the C-RGB method has its discriminatory power reduced for black and white documentary programs. Therefore, LBP-SVD has been chosen for candidate segment selection.  The threshold should be adaptive for candidate segment selection because the threshold levels are different for different videos. A lower threshold will lead to a higher recall value and a lower precision value. Therefore, the threshold should be low enough to recall all shots. Here, we use some different types of videos with different cuts and the gradual transition ratio for training to identify suitable threshold. The training videos' information is listed in Table 5 and the sample video frames are shown in Figure 2. The threshold should be adaptive for candidate segment selection because the threshold levels are different for different videos. A lower threshold will lead to a higher recall value and a lower precision value. Therefore, the threshold should be low enough to recall all shots. Here, we use some different types of videos with different cuts and the gradual transition ratio for training to identify suitable threshold. The training videos' information is listed in Table 5 and the sample video frames are shown in Figure 2. "Scent of a Woman" is a clip of a dance scene and "Men's Basketball" is a basketball broadcast which contains a basketball game clip, so they have many gradual transitions. "LANEIGE" and "MISSHA" are advertisement videos. "Edward Snowden" includes an interview clip that has many camera switches between the interviewer and interviewee. Here, we take three videos as examples and set three different thresholds. Let M_K be the mean value of the total keyframe differences and S_K be the standard value of the total keyframe differences. The thresholds are set as: Threshold T1 = M_K, marked by the red line; Threshold T2 = M_K + 0.5 × S_K, marked by the green line; Threshold T3 = M_K + S_K, marked by the blue line.
As seen in Figure 3, a video that has a higher cut shot rate shows better discrimination. As shown in Figure 3c, the annotated shots values are higher than all three thresholds and much higher than the other keyframes. The video "LANEIGE" has a lower cut shot rate than the video "The Sound of Music", so, as shown in Figure 3a, most of the annotated shot values are lower than T2 and all of them are higher than T1. Figure 3b shows that some shot values are lower than T3 and most shot values are higher than T2. Therefore, the threshold should be much lower than T2 to ensure a higher recall rate. In the paper, to reduce the miss rate, we set the threshold to be lower than T1 to select the candidate shot boundaries. After the candidate selection, the total computing time will be greatly reduced. "Scent of a Woman" is a clip of a dance scene and "Men's Basketball" is a basketball broadcast which contains a basketball game clip, so they have many gradual transitions. "LANEIGE" and "MISSHA" are advertisement videos. "Edward Snowden" includes an interview clip that has many camera switches between the interviewer and interviewee.
Here, we take three videos as examples and set three different thresholds. Let M_K be the mean value of the total keyframe differences and S_K be the standard value of the total keyframe differences. The thresholds are set as: Threshold T1 = M_K, marked by the red line; Threshold T2 = M_K + 0.5 × S_K, marked by the green line; Threshold T3 = M_K + S_K, marked by the blue line.
As seen in Figure 3, a video that has a higher cut shot rate shows better discrimination. As shown in Figure 3c, the annotated shots values are higher than all three thresholds and much higher than the other keyframes. The video "LANEIGE" has a lower cut shot rate than the video "The Sound of Music", so, as shown in Figure 3a, most of the annotated shot values are lower than T2 and all of them are higher than T1. Figure 3b shows that some shot values are lower than T3 and most shot values are higher than T2. Therefore, the threshold should be much lower than T2 to ensure a higher recall rate. In the paper, to reduce the miss rate, we set the threshold to be lower than T1 to select the candidate shot boundaries. After the candidate selection, the total computing time will be greatly reduced.

Shot Detection Network
Most of the existing deep-learning-based SBD methods use deep neural networks to extract the features of each frame and then measure the similarity between two contiguous frames. Some researchers use video segments to train shot transition categories. However, the networks are targeted to recognize the object. Generally, the training object categories cannot represent most objects in real life and in videos. Therefore, place categories are used for training instead of object categories.

Dataset Construction
The datasets are mainly based on the Places database [52]. The website Places [53] provides the demo for testing the Place365 dataset. The training network is based on the PyTorch model. Here, we choose two different categories-"forest" and "fields"-for the test. The images are downloaded from the website by using the query word. The query word of each image, from left to right, from top to bottom, respectively, is "forest", "forest cartoon", "forest 3D model", "fields", "fields cartoon", and "fields 3D model". Their classified top-1 categories are shown in Figure 4.

Shot Detection Network
Most of the existing deep-learning-based SBD methods use deep neural networks to extract the features of each frame and then measure the similarity between two contiguous frames. Some researchers use video segments to train shot transition categories. However, the networks are targeted to recognize the object. Generally, the training object categories cannot represent most objects in real life and in videos. Therefore, place categories are used for training instead of object categories.

Dataset Construction
The datasets are mainly based on the Places database [52]. The website Places [53] provides the demo for testing the Place365 dataset. The training network is based on the PyTorch model. Here, we choose two different categories-"forest" and "fields"-for the test. The images are downloaded from the website by using the query word. The query word of each image, from left to right, from top to bottom, respectively, is "forest", "forest cartoon", "forest 3D model", "fields", "fields cartoon", and "fields 3D model". Their classified top-1 categories are shown in Figure 4.   Figure 4 shows that the places classification by using the pre-trained Places365 demo still has some errors, especially for similar categories. What is more, the precision for computer-generated images, such as cartoon and three-dimensional (3D) models, is not as good as that for natural images. Due to the massive variation in videos, in order to get a good result for places classification, the training dataset also should include as many types of images as possible. Therefore, we make the following changes to the Place365 dataset.
1. We add computer-generated images, such as cartoons and 3D model images, to each category. 2. We merge the categories that have common features into one category. For example, the categories of "forest_broadleaf", "forest_needleleaf", "forest_path", and "forest_road" can be merged into the category "forest".

Network Architecture
The popular CNN architectures, such as Alexnet [27], ZFNet [54], GoogLenet [28], VGG [29], ResNet [55], and InceptionResNet-v2 [56], proposed in recent years have been widely used in image and video applications. Among them, InceptionResNet-v2 can perform with the top-1 error of 19.6% and the top-5 error of 4.7% in the ILSVRC 2012 image classification challenge dataset [57]. The ILSVRC dataset contains nearly all object classes, including rare ones, and is uniquely linked to all concrete nouns in WordNet. In Table 6, some deep neural networks are briefly described. Considering the performance and the computing complexity, the GoogleNet model and the ResNet-50 model are more efficient. Therefore, the pre-trained GoogleNet and ResNet-50 support packages from the MATLAB Neural Network Toolbox were selected for training. In the experiments, we took about 540 images (500 natural images and 40 computer-generated images) from each  Figure 4 shows that the places classification by using the pre-trained Places365 demo still has some errors, especially for similar categories. What is more, the precision for computer-generated images, such as cartoon and three-dimensional (3D) models, is not as good as that for natural images. Due to the massive variation in videos, in order to get a good result for places classification, the training dataset also should include as many types of images as possible. Therefore, we make the following changes to the Place365 dataset.

1.
We add computer-generated images, such as cartoons and 3D model images, to each category.

2.
We merge the categories that have common features into one category. For example, the categories of "forest_broadleaf", "forest_needleleaf", "forest_path", and "forest_road" can be merged into the category "forest".

Network Architecture
The popular CNN architectures, such as Alexnet [27], ZFNet [54], GoogLenet [28], VGG [29], ResNet [55], and InceptionResNet-v2 [56], proposed in recent years have been widely used in image and video applications. Among them, InceptionResNet-v2 can perform with the top-1 error of 19.6% and the top-5 error of 4.7% in the ILSVRC 2012 image classification challenge dataset [57]. The ILSVRC dataset contains nearly all object classes, including rare ones, and is uniquely linked to all concrete nouns in WordNet. In Table 6, some deep neural networks are briefly described. Considering the performance and the computing complexity, the GoogleNet model and the ResNet-50 model are more efficient. Therefore, the pre-trained GoogleNet and ResNet-50 support packages from the MATLAB Neural Network Toolbox were selected for training. In the experiments, we took about 540 images (500 natural images and 40 computer-generated images) from each category for the training set, used 60 images (50 natural images and 10 computer-generated images) for the validation set, and used 100 images (80 natural images and 20 computer-generated images) for the test set. For these categories, the number of images is less than the preset number of images for training, validation, and testing. The numbers of these categories' images are distributed into three parts in percentages. The results of the classification accuracy on our used Places dataset are listed in Table 7.  Table 7 shows that the accuracies are lower than those in the Places365 dataset. This is because the training dataset has less images than Places365 and the number of training iterations is also not large enough. However, the precision for places classification in the videos is not too bad.

Shot Boundary Verification
Only using the pre-trained PlaceNet for SBD is insufficient for common conditions. The classification places inevitably have some errors compared to the ground-truth data. In Figure 5, the classified categories are listed and they are actually in a shot. category for the training set, used 60 images (50 natural images and 10 computer-generated images) for the validation set, and used 100 images (80 natural images and 20 computer-generated images) for the test set. For these categories, the number of images is less than the preset number of images for training, validation, and testing. The numbers of these categories' images are distributed into three parts in percentages. The results of the classification accuracy on our used Places dataset are listed in Table 7. Table 7 shows that the accuracies are lower than those in the Places365 dataset. This is because the training dataset has less images than Places365 and the number of training iterations is also not large enough. However, the precision for places classification in the videos is not too bad.

Shot Boundary Verification
Only using the pre-trained PlaceNet for SBD is insufficient for common conditions. The classification places inevitably have some errors compared to the ground-truth data. In Figure 5, the classified categories are listed and they are actually in a shot. Considering the consistency results, both Places365 and the Object-centric GoogleNet show better performance than PlaceNet, even though they have classified the wrong category. PlaceNet shows the right category in frame 2071, frame 2176, and frame 2206; however, as the middle frame, frame 2191 shows a different category. In this situation, the shot boundary should be verified.
At this stage, we use SIFT matching for verification to reduce false detections of shot boundaries. This process is done after using the pre-trained PlaceNet to extract the places category of the candidate segment boundaries. Here, the threshold of the shot boundary decision is transformed to the threshold of the matched numbers. If the matched number values are less than the threshold, the adjacent two candidate boundaries, which are assigned to different place categories, are truly different. If the matched number values are larger than the threshold, PlaceNet has wrongly classified the places and the current candidate shot boundary will change to the next candidate boundary.
Usually, matches for image content pairs that have little relation to each other are rare no matter how high the match threshold. As shown in Figure 6, the number of matches between them is greater than the threshold value. So, PlaceNet has made a false detection at the second stage, and SIFT matching can eliminate this falsity by feature matching. Considering the consistency results, both Places365 and the Object-centric GoogleNet show better performance than PlaceNet, even though they have classified the wrong category. PlaceNet shows the right category in frame 2071, frame 2176, and frame 2206; however, as the middle frame, frame 2191 shows a different category. In this situation, the shot boundary should be verified.
At this stage, we use SIFT matching for verification to reduce false detections of shot boundaries. This process is done after using the pre-trained PlaceNet to extract the places category of the candidate segment boundaries. Here, the threshold of the shot boundary decision is transformed to the threshold of the matched numbers. If the matched number values are less than the threshold, the adjacent two candidate boundaries, which are assigned to different place categories, are truly different. If the matched number values are larger than the threshold, PlaceNet has wrongly classified the places and the current candidate shot boundary will change to the next candidate boundary.
Usually, matches for image content pairs that have little relation to each other are rare no matter how high the match threshold. As shown in Figure 6, the number of matches between them is greater than the threshold value. So, PlaceNet has made a false detection at the second stage, and SIFT matching can eliminate this falsity by feature matching.

Image-Query-Based Video Searching
The related approaches have been described in Section 2.1. In this section, we study the method for video searching and fingerprint detection. Generally, the fingerprint should follow the main properties below: 1. Robustness. It should have invariability to common video distortions. 2. Discriminability. The features of different video contents should be distinctively different. 3. Compactness. The feature size should be large enough to retain the robustness. 4. Complexity. The computing complexity should be simple enough.
The local features are the first choice when generating video fingerprints since they can be used directly and can also be quantized by applying a quantization method. Here, we employ 12 different local features for image matching. Some of them are in VLfeat [58] and OpenCV. An image pair comprises the original image and its corresponding distorted image. The local descriptors are listed in Table 8 [59][60][61][62][63][64][65][66][67]. The image matching results are shown in Figure 7. The numbers that are marked in the images are the ID of the descriptors. In the experiments, the image pairs are resized to 256 × 256 pixels and the threshold of the match is set as 2.0. Figure 7 shows that the noise, contrast changes, cropping, rotation, and brightness changes transforms have less of an effect than horizontal flip transition. The BRISK, FREAK, MSER, and LATCH features under the flip and projection transitions show bad performance. In total, SIFT, SURF, DAISY, and KAZE show invariant characteristics to those transformations. Considering the computing time and storage, many researchers choose SURF as their first choice and some people choose DAISY and KAZE, but most people have chosen SIFT due to its invariant properties. Therefore, in this paper, we use the SIFT features to represent the fingerprint of the video shots and query images.

Image-Query-Based Video Searching
The related approaches have been described in Section 2.1. In this section, we study the method for video searching and fingerprint detection. Generally, the fingerprint should follow the main properties below: 1.
Robustness. It should have invariability to common video distortions.

2.
Discriminability. The features of different video contents should be distinctively different.

3.
Compactness. The feature size should be large enough to retain the robustness.

4.
Complexity. The computing complexity should be simple enough.
The local features are the first choice when generating video fingerprints since they can be used directly and can also be quantized by applying a quantization method. Here, we employ 12 different local features for image matching. Some of them are in VLfeat [58] and OpenCV. An image pair comprises the original image and its corresponding distorted image. The local descriptors are listed in Table 8 [59][60][61][62][63][64][65][66][67]. The image matching results are shown in Figure 7. The numbers that are marked in the images are the ID of the descriptors. In the experiments, the image pairs are resized to 256 × 256 pixels and the threshold of the match is set as 2.0. Figure 7 shows that the noise, contrast changes, cropping, rotation, and brightness changes transforms have less of an effect than horizontal flip transition. The BRISK, FREAK, MSER, and LATCH features under the flip and projection transitions show bad performance. In total, SIFT, SURF, DAISY, and KAZE show invariant characteristics to those transformations. Considering the computing time and storage, many researchers choose SURF as their first choice and some people choose DAISY and KAZE, but most people have chosen SIFT due to its invariant properties. Therefore, in this paper, we use the SIFT features to represent the fingerprint of the video shots and query images.

Evaluation Methods
We use precision, recall, and the F  score to evaluate our method's performance for shot boundary detection. Recall is the ratio of correctly identified shot boundaries to the number of ground-truth shot boundaries. Precision is the ratio of correctly identified shot boundaries to the total detected shot boundaries. The F  score is used to balance the precision and recall and a higher β value will give more importance to high precision values. In the paper, the F1-score is used. The metrics are defined as follows: F β = (β 2 + 1)Precision * Recall)/(β 2 Precision + Recall)

Evaluation Methods
We use precision, recall, and the F β score to evaluate our method's performance for shot boundary detection. Recall is the ratio of correctly identified shot boundaries to the number of ground-truth shot boundaries. Precision is the ratio of correctly identified shot boundaries to the total detected shot boundaries. The F β score is used to balance the precision and recall and a higher β value will give more importance to high precision values. In the paper, the F1-score is used. The metrics are defined as follows:

Open Video Scene Detection (OVSD) Dataset
Here, we use the videos in the OVSD dataset [49] to compare our proposed method with the Filmora software [68]. The OVSD dataset is presented for the evaluation of scene detection algorithms and its shot boundary annotations are also given. The ground-truth scene annotations are provided by using a movie script. It consists of five short videos and a full-length film. Recently, a dataset extension, including 15 new full-length videos, has also been uploaded, but it only provides the scene annotations. Information on the OVSD dataset is listed in Table 9. Among the data, "Bunny" has a more vivid color than the other animated movies.
Sample frames from the videos "Big Buck Bunny", "Cosmos Laundromat", and "Sintel" are shown in Figure 8.
In Table 8, the test video name, numbers of manual shots, shots numbers detected by using the proposed method, shots numbers detected by using the Filmora software, and total frame numbers are listed. A comparison of the results of the test videos is also displayed in Table 10. Here, we use the videos in the OVSD dataset [49] to compare our proposed method with the Filmora software [68]. The OVSD dataset is presented for the evaluation of scene detection algorithms and its shot boundary annotations are also given. The ground-truth scene annotations are provided by using a movie script. It consists of five short videos and a full-length film. Recently, a dataset extension, including 15 new full-length videos, has also been uploaded, but it only provides the scene annotations. Information on the OVSD dataset is listed in Table 9. Among the data, "Bunny" has a more vivid color than the other animated movies.
Sample frames from the videos "Big Buck Bunny", "Cosmos Laundromat", and "Sintel" are shown in Figure 8.
In Table 8, the test video name, numbers of manual shots, shots numbers detected by using the proposed method, shots numbers detected by using the Filmora software, and total frame numbers are listed. A comparison of the results of the test videos is also displayed in Table 10.     Here, we use the videos in the OVSD dataset [49] to compare our proposed method with the Filmora software [68]. The OVSD dataset is presented for the evaluation of scene detection algorithms and its shot boundary annotations are also given. The ground-truth scene annotations are provided by using a movie script. It consists of five short videos and a full-length film. Recently, a dataset extension, including 15 new full-length videos, has also been uploaded, but it only provides the scene annotations. Information on the OVSD dataset is listed in Table 9. Among the data, "Bunny" has a more vivid color than the other animated movies.
Sample frames from the videos "Big Buck Bunny", "Cosmos Laundromat", and "Sintel" are shown in Figure 8.
In Table 8, the test video name, numbers of manual shots, shots numbers detected by using the proposed method, shots numbers detected by using the Filmora software, and total frame numbers are listed. A comparison of the results of the test videos is also displayed in Table 10.    The open access video editor Filmora offers an advanced feature that can automatically split a film into its basic temporal segments by detecting the transitions between shots in a video. From Table 10, it can be seen that our proposed SBD method is similar to Filmora and has a slightly lower miss rate. The number of Filmora-detected boundaries is not less than the frame rate of the video; however, our proposed method uses a step that is a quarter of the frame rate. Since the shot annotations of the video "Big Buck Bunny" and "Cosmos Laundromat" are created by a script, the accuracy of the shot location has some difference to the real frames. Therefore, we use the manual annotation of "Big Buck Bunny" (marked as 91*) instead of the annotation in the OVSD dataset. The video "Leon" has many more cut shots than gradual shots and the distance between them is larger The open access video editor Filmora offers an advanced feature that can automatically split a film into its basic temporal segments by detecting the transitions between shots in a video. From Table 10, it can be seen that our proposed SBD method is similar to Filmora and has a slightly lower miss rate. The number of Filmora-detected boundaries is not less than the frame rate of the video; however, our proposed method uses a step that is a quarter of the frame rate. Since the shot annotations of the video "Big Buck Bunny" and "Cosmos Laundromat" are created by a script, the accuracy of the shot location has some difference to the real frames. Therefore, we use the manual annotation of "Big Buck Bunny" (marked as 91*) instead of the annotation in the OVSD dataset. The video "Leon" has many more cut shots than gradual shots and the distance between them is larger The open access video editor Filmora offers an advanced feature that can automatically split a film into its basic temporal segments by detecting the transitions between shots in a video. From Table 10, it can be seen that our proposed SBD method is similar to Filmora and has a slightly lower miss rate. The number of Filmora-detected boundaries is not less than the frame rate of the video; however, our proposed method uses a step that is a quarter of the frame rate. Since the shot annotations of the video "Big Buck Bunny" and "Cosmos Laundromat" are created by a script, the accuracy of the shot location has some difference to the real frames. Therefore, we use the manual annotation of "Big Buck Bunny" (marked as 91*) instead of the annotation in the OVSD dataset. The video "Leon" has many more cut shots than gradual shots and the distance between them is larger The open access video editor Filmora offers an advanced feature that can automatically split a film into its basic temporal segments by detecting the transitions between shots in a video. From Table 10, it can be seen that our proposed SBD method is similar to Filmora and has a slightly lower miss rate. The number of Filmora-detected boundaries is not less than the frame rate of the video; however, our proposed method uses a step that is a quarter of the frame rate. Since the shot annotations of the video "Big Buck Bunny" and "Cosmos Laundromat" are created by a script, the accuracy of the shot location has some difference to the real frames. Therefore, we use the manual annotation of "Big Buck Bunny" (marked as 91*) instead of the annotation in the OVSD dataset. The video "Leon" has many more cut shots than gradual shots and the distance between them is larger than the frame rate, so the shot boundary is clear. The video "Gangnam Style" is a music movie, and it has many cut shots with a distance that is less than half the frame rate, so our method and Filmora are unable to detect those shot changes.
Next, we use those videos' SBD to evaluate the performance of the proposed method. Due to the video "Valkaama" being too long, in the experiment, we only chose the second 10 min for the test. A boundary detected by the algorithm was said to be correct if it was within a quarter of the number of frame rate frames of a boundary listed in the baseline. Strictly speaking, the deviation for acceptance is a little higher because the cut shots happen only during two neighboring frames. However, considering the longer duration of gradual shots, such as dissolves, fade-ins, fade-outs, and wipes, in the proposed method, the smallest step of the shots is not less than a quarter of the number of frame rate frames, so the deviation for acceptance should be higher than that value. The results of SBD for the OVSD dataset are listed in Table 11. Table 11 shows that the shot boundaries of the videos "Bunny", "Cosmos", and "Valkaama" can be extracted because they have many cut shots. The video "Elephants Dream" is a 3D Computer-Generated Imagery (CGI) animated science fiction video, so most of the scenes in it are hard to match with a venue from a natural environment. This shows the weakness of our place network. The video "Sintel" is also a computer-animated film, but it has lots of action. Therefore, the large gradual shots that were brought on by the abundance of activities could make the shot boundary hard to detect.

BBC Planet Earth Dataset
The dataset for the BBC's educational TV series Planet Earth [48] has 11 videos. Sample images are shown in Figure 9. All of the videos are approximately 50 min in duration.
than the frame rate, so the shot boundary is clear. The video "Gangnam Style" is a music movie, and it has many cut shots with a distance that is less than half the frame rate, so our method and Filmora are unable to detect those shot changes.
Next, we use those videos' SBD to evaluate the performance of the proposed method. Due to the video "Valkaama" being too long, in the experiment, we only chose the second 10 min for the test. A boundary detected by the algorithm was said to be correct if it was within a quarter of the number of frame rate frames of a boundary listed in the baseline. Strictly speaking, the deviation for acceptance is a little higher because the cut shots happen only during two neighboring frames. However, considering the longer duration of gradual shots, such as dissolves, fade-ins, fade-outs, and wipes, in the proposed method, the smallest step of the shots is not less than a quarter of the number of frame rate frames, so the deviation for acceptance should be higher than that value. The results of SBD for the OVSD dataset are listed in Table 11. Table 11 shows that the shot boundaries of the videos "Bunny", "Cosmos", and "Valkaama" can be extracted because they have many cut shots. The video "Elephants Dream" is a 3D Computer-Generated Imagery (CGI) animated science fiction video, so most of the scenes in it are hard to match with a venue from a natural environment. This shows the weakness of our place network. The video "Sintel" is also a computer-animated film, but it has lots of action. Therefore, the large gradual shots that were brought on by the abundance of activities could make the shot boundary hard to detect.  Figure 9. All of the videos are approximately 50 min in duration. Their information is listed in Table 12. Here, we select the first 10 min of each for the experiments. The results are also listed in Table 12.  Their information is listed in Table 12. Here, we select the first 10 min of each for the experiments. The results are also listed in Table 12.  [70]. Here, we take some videos from the dataset for comparison experiments. Information on the videos and the results of SBD by using the proposed method are listed in Table 13.
Samples of the video frames are shown in Figure 10. The TRECVID 2001 Dataset [70] is mostly used for shot boundary detection. The reference data of the transitions are assigned to four different categories: cut, dissolve, fade-in/out, and other. In the test dataset, their percentages are respectively 65%, 30.7%, 1.7%, and 2.6% [71]. Here, we take some videos from the dataset for comparison experiments. Information on the videos and the results of SBD by using the proposed method are listed in Table 13.
Samples of the video frames are shown in Figure 10.  Here, to demonstrate the accuracy of our scheme, we also conduct comparison experiments. The results of the comparison are listed in Tables 14 and 15. Table 14 shows that the proposed method has better performance than the correlation-based algorithm, the keener-correlation-based algorithm, and the edge-oriented-based algorithm. Table 15 shows that the proposed method is better than the methods that do not use deep-learning features. Additionally, our place-centric network-based SBD method has similar performance to the compared method that uses an object-centric network. For example, our method has a higher F1-score than the method that uses an object-centric network in cut shot detection in the anni005 video and in gradual shot detection in the anni009 video.   Here, to demonstrate the accuracy of our scheme, we also conduct comparison experiments. The results of the comparison are listed in Tables 14 and 15. Table 14 shows that the proposed method has better performance than the correlation-based algorithm, the keener-correlation-based algorithm, and the edge-oriented-based algorithm. Table 15 shows that the proposed method is better than the methods that do not use deep-learning features. Additionally, our place-centric network-based SBD method has similar performance to the compared method that uses an object-centric network. For example, our method has a higher F1-score than the method that uses an object-centric network in cut shot detection in the anni005 video and in gradual shot detection in the anni009 video.

ReTRiEVED Dataset
The ReTRiEVED [76] Dataset was created to evaluate methods that require video quality assessment in transmissions. The ReTRiEVED dataset contains 176 test videos obtained from 8 source videos by applying the transmission parameters listed in Table 16. Samples of the source videos and test videos are shown in Figure 11. Here, we used this small dataset to assess the robustness of features we used to guard against possible video distortions.

ReTRiEVED Dataset
The ReTRiEVED [77] Dataset was created to evaluate methods that require video quality assessment in transmissions. The ReTRiEVED dataset contains 176 test videos obtained from 8 source videos by applying the transmission parameters listed in Table 16.
Samples of the source videos and test videos are shown in Figure 11. Here, we used this small dataset to assess the robustness of features we used to guard against possible video distortions.  Here, we compared the SIFT features against some related methods: CST−SURF [78], CC [79], and {Th; CC; ORB} [24] for video retrieval. In Table 17, the average detection F1-scores are presented. The experimental results obtained from the methods CST-SURF, CC, and {Th; CC; ORB} are adopted from the paper [24]. The CST-SURF method uses the difference of the SURF key point numbers in stable frame pairs to generate normalized differences as features. The CC method uses color correlation in the divided non-overlapping blocks of each frame. In the {Th; CC; ORB} method, "Th" represents "Thumbnail", which is designed as a global feature to resist against flipping transformations.
In the experiment, since the test videos are short enough, we used the combined features of the selected frames to represent the features of each video. The videos' keyframes are selected at the step of half of the frame rate and they are downsized to 64 × 64 pixels. In the retrieval process, we used UBCMATCH to match the features of the tested videos and the source videos. The number of matches is regarded as the similarity value. Since the match numbers differ greatly under different thresholds, we use the 12-step threshold values to find better conditions for SIFT. Figure 12 shows Here, we compared the SIFT features against some related methods: CST−SURF [77], CC [78], and {Th; CC; ORB} [24] for video retrieval. In Table 17, the average detection F1-scores are presented. The experimental results obtained from the methods CST-SURF, CC, and {Th; CC; ORB} are adopted from the paper [24]. The CST-SURF method uses the difference of the SURF key point numbers in stable frame pairs to generate normalized differences as features. The CC method uses color correlation in the divided non-overlapping blocks of each frame. In the {Th; CC; ORB} method, "Th" represents "Thumbnail", which is designed as a global feature to resist against flipping transformations.
In the experiment, since the test videos are short enough, we used the combined features of the selected frames to represent the features of each video. The videos' keyframes are selected at the step of half of the frame rate and they are downsized to 64 × 64 pixels. In the retrieval process, we used UBCMATCH to match the features of the tested videos and the source videos. The number of matches is regarded as the similarity value. Since the match numbers differ greatly under different thresholds, we use the 12-step threshold values to find better conditions for SIFT. Figure 12 shows that the Positive Predicted Value (PPV) of ReTRiEVED using the SIFT descriptor under the throughput transmission is lower than that of other transitions.   The test videos are down-sampled at 10 fps and have 72,000 frames in total. To search for the image in the video, we first use the proposed SBD to segment the videos into shots, and then we use the feature matching method to match the query photo features and the shot feature dataset. The retrieved numbers of video frames for each query image are shown in Figure 14. The retrieval accuracy depends on the accuracy of the shot boundaries. In order to reduce the miss rate, in the experiment, the skip sliding should be set to a lower value. We take five frames for a sling; after three-stage shot boundary detection, we obtain a total of 2864 shots and then use the shot features and query features for feature matching. About 2-3 frames will be missed due to gradual transitions.   The test videos are down-sampled at 10 fps and have 72,000 frames in total. To search for the image in the video, we first use the proposed SBD to segment the videos into shots, and then we use the feature matching method to match the query photo features and the shot feature dataset. The retrieved numbers of video frames for each query image are shown in Figure 14. The retrieval accuracy depends on the accuracy of the shot boundaries. In order to reduce the miss rate, in the experiment, the skip sliding should be set to a lower value. We take five frames for a sling; after three-stage shot boundary detection, we obtain a total of 2864 shots and then use the shot features and query features for feature matching. About 2-3 frames will be missed due to gradual transitions. The test videos are down-sampled at 10 fps and have 72,000 frames in total. To search for the image in the video, we first use the proposed SBD to segment the videos into shots, and then we use the feature matching method to match the query photo features and the shot feature dataset. The retrieved numbers of video frames for each query image are shown in Figure 14.   The test videos are down-sampled at 10 fps and have 72,000 frames in total. To search for the image in the video, we first use the proposed SBD to segment the videos into shots, and then we use the feature matching method to match the query photo features and the shot feature dataset. The retrieved numbers of video frames for each query image are shown in Figure 14. The retrieval accuracy depends on the accuracy of the shot boundaries. In order to reduce the miss rate, in the experiment, the skip sliding should be set to a lower value. We take five frames for a sling; after three-stage shot boundary detection, we obtain a total of 2864 shots and then use the shot features and query features for feature matching. About 2-3 frames will be missed due to gradual transitions. The retrieval accuracy depends on the accuracy of the shot boundaries. In order to reduce the miss rate, in the experiment, the skip sliding should be set to a lower value. We take five frames for a sling; after three-stage shot boundary detection, we obtain a total of 2864 shots and then use the shot features and query features for feature matching. About 2-3 frames will be missed due to gradual transitions.

Experiment on Our Video Searching and Fingerprint Detection Dataset
The source video dataset for video searching and fingerprint detection was composed of 110 videos, including 4 videos in the OSVD datasets; the other videos are mostly music clips. Information on the video data is shown in Figure 15. The source video dataset for video searching and fingerprint detection was composed of 110 videos, including 4 videos in the OSVD datasets; the other videos are mostly music clips. Information on the video data is shown in Figure 15. Figure 15. Information on the size, frame rate, duration, and detected shots in our test dataset.
Video distortions can be made by using signal processing, geometric transformations, and desynchronization. To search for and detect a fingerprint of the video in a higher accuracy by using the features of a single image or shot sequence, the proposed method must stand against most of those various distortions. Due to our method being based on a query image, the common attacks for an image are considered. Attacks on audio and video desynchronization are not included. Common transformations are listed in Table 18. To test the performance of our image-to-video method, a combination of transformations is used. The processing orders and samples of the attacked images of each attack type are also shown in Table 19. The processing parameters of each attack are the same as shown in Table 19.
For the query images, we took 200 random frames for each video from the OVSD dataset. For other videos, because their durations are less than 3 min, we take 20 random frames from each of those videos; otherwise, we take 250 poster images from the website as false examples. So, there are a total of 3170 original query images. Then, we apply the transforms listed in Table 18 to make the source video search task difficult. Finally, 34,870 query images are generated. We used two thresholds in the feature matching and decision stage. The first threshold is the UBCMATCH threshold; the other one is the matched numbers threshold for fingerprint detecting decisions. The parameters in the experiment are listed in Table 20. Video distortions can be made by using signal processing, geometric transformations, and desynchronization. To search for and detect a fingerprint of the video in a higher accuracy by using the features of a single image or shot sequence, the proposed method must stand against most of those various distortions. Due to our method being based on a query image, the common attacks for an image are considered. Attacks on audio and video desynchronization are not included. Common transformations are listed in Table 18. To test the performance of our image-to-video method, a combination of transformations is used. The processing orders and samples of the attacked images of each attack type are also shown in Table 19. The processing parameters of each attack are the same as shown in Table 19.
For the query images, we took 200 random frames for each video from the OVSD dataset. For other videos, because their durations are less than 3 min, we take 20 random frames from each of those videos; otherwise, we take 250 poster images from the website as false examples. So, there are a total of 3170 original query images. Then, we apply the transforms listed in Table 18 to make the source video search task difficult. Finally, 34,870 query images are generated. We used two thresholds in the feature matching and decision stage. The first threshold is the UBCMATCH threshold; the other one is the matched numbers threshold for fingerprint detecting decisions. The parameters in the experiment are listed in Table 20.  Table 18 to make the source video search task difficult. Finally, 34,870 query images are generated. We used two thresholds in the feature matching and decision stage. The first threshold is the UBCMATCH threshold; the other one is the matched numbers threshold for fingerprint detecting decisions. The parameters in the experiment are listed in Table 20.  Table 18 to make the source video search task difficult. Finally, 34,870 query images are generated. We used two thresholds in the feature matching and decision stage. The first threshold is the UBCMATCH threshold; the other one is the matched numbers threshold for fingerprint detecting decisions. The parameters in the experiment are listed in Table 20.   [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching.

AT7 VT7
Appl. Sci. 2018 [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching.

AT3 VT1→VT2→VT3
Appl. Sci. 2018 [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching.

AT8 VT7→VT4→VT5
Appl. Sci. 2018 [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching.

AT4 VT4→VT5
Appl. Sci. 2018 [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching.

AT9 V10→VT8
Appl. Sci. 2018 [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching.

AT5 VT1→VT4→VT5
Appl. Sci. 2018 [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching.

AT10 VT9→V10
Appl. Sci. 2018 [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching.  [2,3,4,5,6,7,8,9] Results of Image-Query-Based Video Searching Experiments without Transformations Here, the experiments are conducted under the parameters listed in Table 20. Additionally, image queries from the source videos without transformations are taken for the experiments. The goal is to study the effects of those factors on image-query-based video searching. To simplify the comparison experiments, a control group with the parameters of a 128-pixel image size and application of the step skip frames method and the SIFT descriptor is used to compare the experimental results for different factors. The results of the video searching and shot searching are shown in Figure 16. Shot searching detects a shot's location in a particular video, so the accuracies of it are lower than video searching. Compared to video searching, the shot searching accuracies have decreased by at least 0.1. One reason is that a video may include many similar shots, especially for interview videos. When the UBCMATCH threshold is lower than 1.6, there will be more false alarms; as a result, the searching results are much lower. When the UBCMATCH threshold is higher than 3, there will be lower recall; therefore, the searching results show a declined tendency. The highest video searching accuracy is 0.94 and the shot searching accuracy is 0.82. However, under most conditions, the shot searching accuracies are less than 0.8.
The first comparison experiment was conducted by using a bigger image size: 256 × 256. A larger image size conveys more content information; in theory, the video searching results from using features extracted from larger-size images will be larger. The only difference in the control group is the image size. The results of video searching and shot searching are shown in Figure 17. Compared to Figure 16, the video searching accuracies and shot searching accuracies are both increased. The shot searching accuracies rise to a greater extent than the video searching accuracies. According to the results, using a larger-size image for feature extraction improves the performance by 0.01~0.03 on average compared to using a smaller-size image. However, the computing time under a larger-size image has been increased a lot not only for feature extraction but also for video searching. The second comparison experiment adopted different frame selection methods to select the frame numbers in a shot for feature extraction. The only difference in the control group is the use of the three frames method instead of the step skip frames method. The results of video searching and shot searching are shown in Figure 18. Compared to Figure 16, the results show that three-frames-based feature selection for a shot is much worse than step skip frames-based feature selection. The highest video searching accuracy of the three frames method is lower than 0.9 and its downward trends fall faster. The video searching accuracies have decreased by at least 0.05 and the highest value has dropped by 0.07. The shot searching accuracies have also gone down. The highest shot searching accuracy is less than 0.8. Although the three frames method is simpler and has a shorter computation time, the results of it are not satisfactory, especially on video searching. Consequently, the step skip frames method should be considered instead of the three frames method in further experiments. The second comparison experiment adopted different frame selection methods to select the frame numbers in a shot for feature extraction. The only difference in the control group is the use of the three frames method instead of the step skip frames method. The results of video searching and shot searching are shown in Figure 18. Compared to Figure 16, the results show that three-frames-based feature selection for a shot is much worse than step skip frames-based feature selection. The highest video searching accuracy of the three frames method is lower than 0.9 and its downward trends fall faster. The video searching accuracies have decreased by at least 0.05 and the highest value has dropped by 0.07. The shot searching accuracies have also gone down. The highest shot searching accuracy is less than 0.8. Although the three frames method is simpler and has a shorter computation time, the results of it are not satisfactory, especially on video searching. Consequently, the step skip frames method should be considered instead of the three frames method in further experiments. The second comparison experiment adopted different frame selection methods to select the frame numbers in a shot for feature extraction. The only difference in the control group is the use of the three frames method instead of the step skip frames method. The results of video searching and shot searching are shown in Figure 18. Compared to Figure 16, the results show that three-frames-based feature selection for a shot is much worse than step skip frames-based feature selection. The highest video searching accuracy of the three frames method is lower than 0.9 and its downward trends fall faster. The video searching accuracies have decreased by at least 0.05 and the highest value has dropped by 0.07. The shot searching accuracies have also gone down. The highest shot searching accuracy is less than 0.8. Although the three frames method is simpler and has a shorter computation time, the results of it are not satisfactory, especially on video searching. Consequently, the step skip frames method should be considered instead of the three frames method in further experiments. The third comparison experiment used different feature descriptors. Since the SURF descriptor is faster and also works well in many applications, it can be used instead of SIFT. The results of video searching and shot searching are shown in Figure 19. To show the result more clearly, the results under lower UBCMATCH thresholds (1.2 and 1.4) are not shown. Compared to Figure 16, the results show that the SIFT method has far better performance than SURF. Both the video searching accuracies and shot searching accuracies of SURF have dropped at least 10%. The highest video searching accuracy is less than 0.85 and the highest shot searching accuracy is less than 0.7. Consequently, although the SURF descriptor is faster, the SIFT descriptor should be adopted to achieve better performance.
In conclusion, considering the accuracy and computing time, the control group is the better one among them for video searching and shot searching.
The two thresholds also have a big effect on the final performance. When the threshold1 value is lower, the threshold2 value should be higher accordingly. As shown in Figures 16-19, the best thresholds for shot searching are different from the best thresholds for video searching. Additionally, the best thresholds also depend on the feature descriptors and video types. When using the SIFT descriptor, under the threshold1 value of 1.6 and the threshold2 value of 8 or 9, the video searching and shot searching accuracies are higher than under most other conditions. The third comparison experiment used different feature descriptors. Since the SURF descriptor is faster and also works well in many applications, it can be used instead of SIFT. The results of video searching and shot searching are shown in Figure 19. To show the result more clearly, the results under lower UBCMATCH thresholds (1.2 and 1.4) are not shown. Compared to Figure 16, the results show that the SIFT method has far better performance than SURF. Both the video searching accuracies and shot searching accuracies of SURF have dropped at least 10%. The highest video searching accuracy is less than 0.85 and the highest shot searching accuracy is less than 0.7. Consequently, although the SURF descriptor is faster, the SIFT descriptor should be adopted to achieve better performance.
In conclusion, considering the accuracy and computing time, the control group is the better one among them for video searching and shot searching.
The two thresholds also have a big effect on the final performance. When the threshold1 value is lower, the threshold2 value should be higher accordingly. As shown in Figures 16-19, the best thresholds for shot searching are different from the best thresholds for video searching. Additionally, the best thresholds also depend on the feature descriptors and video types. When using the SIFT descriptor, under the threshold1 value of 1.6 and the threshold2 value of 8 or 9, the video searching and shot searching accuracies are higher than under most other conditions. The third comparison experiment used different feature descriptors. Since the SURF descriptor is faster and also works well in many applications, it can be used instead of SIFT. The results of video searching and shot searching are shown in Figure 19. To show the result more clearly, the results under lower UBCMATCH thresholds (1.2 and 1.4) are not shown. Compared to Figure 16, the results show that the SIFT method has far better performance than SURF. Both the video searching accuracies and shot searching accuracies of SURF have dropped at least 10%. The highest video searching accuracy is less than 0.85 and the highest shot searching accuracy is less than 0.7. Consequently, although the SURF descriptor is faster, the SIFT descriptor should be adopted to achieve better performance.
In conclusion, considering the accuracy and computing time, the control group is the better one among them for video searching and shot searching.
The two thresholds also have a big effect on the final performance. When the threshold1 value is lower, the threshold2 value should be higher accordingly. As shown in Figures 16-19, the best thresholds for shot searching are different from the best thresholds for video searching. Additionally, the best thresholds also depend on the feature descriptors and video types. When using the SIFT descriptor, under the threshold1 value of 1.6 and the threshold2 value of 8 or 9, the video searching and shot searching accuracies are higher than under most other conditions.

Results of Image-Query-Based Video Searching Experiments with Transformations
From the above experiments, the threshold1s less than 1.6 have bad performance. Additionally, when the threshold1 value is larger than 3.0, the plot lines begin to go down. Considering image distortion situations, higher thresholds may not return any matches. However, in the experiments, the threshold range should be large enough. Therefore, we set threshold1 with a step size of 0.4 and decide to enlarge threshold2 from 2 to 16. During the shot feature generation stage, the frames are resized to 128 x 128 and the step skip frames method is used. Transformations, such as flip, contrast change, noise addition, brightness change, cropping, rotation, and geometric projection, are commonly used for video copy detection. In addition, a picture in a picture, a fusion of pictures, and a pattern insertion are also used. To show the experimental results under transformations, we select five representative transformations: AT1, AT2, AT4, AT7, and AT9. AT1 employs flip. AT4 applies contrast change and noise addition. AT4 applies brightness change and cropping. AT7 uses geometric projection. AT9 was transformed using a picture in a picture and a pattern insertion. Figure 20 shows the results of video searching by using one query image under these transformations.

Results of Image-Query-Based Video Searching Experiments with Transformations
From the above experiments, the threshold1s less than 1.6 have bad performance. Additionally, when the threshold1 value is larger than 3.0, the plot lines begin to go down. Considering image distortion situations, higher thresholds may not return any matches. However, in the experiments, the threshold range should be large enough. Therefore, we set threshold1 with a step size of 0.4 and decide to enlarge threshold2 from 2 to 16. During the shot feature generation stage, the frames are resized to 128 × 128 and the step skip frames method is used. Transformations, such as flip, contrast change, noise addition, brightness change, cropping, rotation, and geometric projection, are commonly used for video copy detection. In addition, a picture in a picture, a fusion of pictures, and a pattern insertion are also used. To show the experimental results under transformations, we select five representative transformations: AT1, AT2, AT4, AT7, and AT9. AT1 employs flip. AT4 applies contrast change and noise addition. AT4 applies brightness change and cropping. AT7 uses geometric projection. AT9 was transformed using a picture in a picture and a pattern insertion. Figure 20 shows the results of video searching by using one query image under these transformations.

Results of Image-Query-Based Video Searching Experiments with Transformations
From the above experiments, the threshold1s less than 1.6 have bad performance. Additionally, when the threshold1 value is larger than 3.0, the plot lines begin to go down. Considering image distortion situations, higher thresholds may not return any matches. However, in the experiments, the threshold range should be large enough. Therefore, we set threshold1 with a step size of 0.4 and decide to enlarge threshold2 from 2 to 16. During the shot feature generation stage, the frames are resized to 128 x 128 and the step skip frames method is used. Transformations, such as flip, contrast change, noise addition, brightness change, cropping, rotation, and geometric projection, are commonly used for video copy detection. In addition, a picture in a picture, a fusion of pictures, and a pattern insertion are also used. To show the experimental results under transformations, we select five representative transformations: AT1, AT2, AT4, AT7, and AT9. AT1 employs flip. AT4 applies contrast change and noise addition. AT4 applies brightness change and cropping. AT7 uses geometric projection. AT9 was transformed using a picture in a picture and a pattern insertion. Figure 20 shows the results of video searching by using one query image under these transformations. As seen in Figure 20, the different kinds of attacks lead to different results. Compared to Figure  16a, the accuracies of the flip transform and the geometric projection transform have dropped severely. The results of AT1 and AT7 are decreased by nearly 0.1 and 0.2, respectively. The attacks noise addition, contrast change, picture in a picture, and pattern insertion have not changed the image seriously as the video searching accuracies have dropped by no more than 0.06. In addition, the performances of the cropping and bright change transformations have decreased slightly.
However, for query-image-based shot location detection, the accuracies have all dropped very seriously. Without transformations, the shot searching accuracy can reach 0.82. Under the transformations, the accuracies have dropped by at least 0.1. As shown in Figure 21, the results of shot searching are much lower than corresponding results for video searching. Compared to Figure  16b, the results of AT1 and AT4 have dropped by 0.17, and AT7 has decreased by 0.3. Even AT2 has dropped by 0.1. Although the shot searching results are lower, the video searching results under several transformations can be accepted for video searching and video copy detection.
As seen from Figures 20 and 21, threshold1 shows better performance under the values of 1.6, 2, and 2.4. When threshold1 is 1.6, the better choices of threshold2 are 7, 8, and 9. When threshold1 is 2.0, the better choices of threshold2 are 4, 5, and 6. When threshold1 is 2.4, the better choices of threshold2 are 3 and 4. As seen in Figure 20, the different kinds of attacks lead to different results. Compared to Figure 16a, the accuracies of the flip transform and the geometric projection transform have dropped severely. The results of AT1 and AT7 are decreased by nearly 0.1 and 0.2, respectively. The attacks noise addition, contrast change, picture in a picture, and pattern insertion have not changed the image seriously as the video searching accuracies have dropped by no more than 0.06. In addition, the performances of the cropping and bright change transformations have decreased slightly.
However, for query-image-based shot location detection, the accuracies have all dropped very seriously. Without transformations, the shot searching accuracy can reach 0.82. Under the transformations, the accuracies have dropped by at least 0.1. As shown in Figure 21, the results of shot searching are much lower than corresponding results for video searching. Compared to Figure 16b, the results of AT1 and AT4 have dropped by 0.17, and AT7 has decreased by 0.3. Even AT2 has dropped by 0.1. Although the shot searching results are lower, the video searching results under several transformations can be accepted for video searching and video copy detection. As seen in Figure 20, the different kinds of attacks lead to different results. Compared to Figure  16a, the accuracies of the flip transform and the geometric projection transform have dropped severely. The results of AT1 and AT7 are decreased by nearly 0.1 and 0.2, respectively. The attacks noise addition, contrast change, picture in a picture, and pattern insertion have not changed the image seriously as the video searching accuracies have dropped by no more than 0.06. In addition, the performances of the cropping and bright change transformations have decreased slightly.
However, for query-image-based shot location detection, the accuracies have all dropped very seriously. Without transformations, the shot searching accuracy can reach 0.82. Under the transformations, the accuracies have dropped by at least 0.1. As shown in Figure 21, the results of shot searching are much lower than corresponding results for video searching. Compared to Figure  16b, the results of AT1 and AT4 have dropped by 0.17, and AT7 has decreased by 0.3. Even AT2 has dropped by 0.1. Although the shot searching results are lower, the video searching results under several transformations can be accepted for video searching and video copy detection. (a)

Conclusions
In this paper, we have proposed a new video searching and fingerprint detection method by using an image query and PlaceNet-based SBD method. We used a places-centric dataset for PlaceNet training and combined it with an object-centric network for shot boundary detection. We presented a three-stage SBD method. We used several visual content features for candidate segment selection and used SIFT for shot boundary verification. For video searching and fingerprint detection, we tested several features and studied the thresholds. We compared our proposed method to several datasets, and the results showed the effectiveness of the SBD method. However, the feature storage required is larger than the other studied methods and the computing complexity still needs to be improved. For future research, we will evaluate our proposed method against larger datasets and simplify the processing complexity.  As seen from Figures 20 and 21, threshold1 shows better performance under the values of 1.6, 2, and 2.4. When threshold1 is 1.6, the better choices of threshold2 are 7, 8, and 9. When threshold1 is 2.0, the better choices of threshold2 are 4, 5, and 6. When threshold1 is 2.4, the better choices of threshold2 are 3 and 4.

Conclusions
In this paper, we have proposed a new video searching and fingerprint detection method by using an image query and PlaceNet-based SBD method. We used a places-centric dataset for PlaceNet training and combined it with an object-centric network for shot boundary detection. We presented a three-stage SBD method. We used several visual content features for candidate segment selection and used SIFT for shot boundary verification. For video searching and fingerprint detection, we tested several features and studied the thresholds. We compared our proposed method to several datasets, and the results showed the effectiveness of the SBD method. However, the feature storage required is larger than the other studied methods and the computing complexity still needs to be improved. For future research, we will evaluate our proposed method against larger datasets and simplify the processing complexity.