Video Content Plagiarism Detection Using Region-Based Feature Learning
Abstract
1. Introduction
2. Related Works
3. The Proposed Video Feature Learning Method
3.1. Overview of the Proposed Framework
3.2. Region Feature Extraction
3.3. Linear Encoder Combined with Temporal Positional Encoding
3.4. Similarity Calculation and Loss Function Design
4. Experimental Results and Analysis
4.1. Experimental Environment and Dataset
4.2. Analysis for Training and Similarity Calculation
4.3. Comparative Experiment
4.4. Ablation Experiment
4.5. Video Copyright Detection Experiment
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Tancik, M.; Mildenhall, B.; Ng, R.S. StegaStamp: Invisible hyperlinks in physical photographs. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2114–2123. [Google Scholar]
- Liu, Y.; Guo, M.; Zhang, J.; Zhu, Y.; Xie, X. A novel two-stage separable deep learning framework for practical blind watermarking. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1509–1517. [Google Scholar]
- Luo, X.; Zhan, R.; Chang, H.; Yang, F.; Milanfar, P. Distortion agnostic deep watermarking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13548–13557. [Google Scholar]
- Ahmadi, M.; Norouzi, A.; Karimi, N.; Samavi, S.; Emami, A. ReDMark: Framework for residual diffusion watermarking based on deep networks. Expert Syst. Appl. 2020, 146, 113157. [Google Scholar] [CrossRef]
- Gao, Y.; Kang, X.; Chen, Y. A robust video zero-watermarking based on deep convolutional neural network and self-organizing map in polar complex exponential transform domain. Multimed. Tools Appl. 2021, 80, 6019–6039. [Google Scholar] [CrossRef]
- Luo, X.; Li, Y.; Chang, H.; Liu, C.; Milanfar, P.; Yang, F. Dvmark: A deep multiscale framework for video watermarking. IEEE Trans. Image Process. 2023, 34, 4371–4385. [Google Scholar] [CrossRef] [PubMed]
- Hasan, H.R.; Salah, K. Combating deepfake videos using blockchain and smart contracts. IEEE Access 2019, 7, 41596–41606. [Google Scholar] [CrossRef]
- Wang, L.; Bao, Y.; Li, H.; Fan, X.; Luo, Z. Compact CNN based video representation for efficient video copy detection. In Proceedings of the 2016 International Conference on Multimedia Modeling, Miami, FL, USA, 4–6 January 2016; pp. 576–587. [Google Scholar]
- Nie, X.; Jing, W.; Ma, L.Y.; Cui, C.; Yin, Y. Two-layer video fingerprinting strategy for near-duplicate video detection. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 555–560. [Google Scholar]
- Han, Z.; He, X.; Tang, M.; Lv, Y. Video similarity and alignment learning on partial video copy detection. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 4165–4173. [Google Scholar]
- Kordopatis-Zilos, G.; Papadopoulos, S.; Patras, I.; Kompatsiaris, I. Visil: Fine-grained spatio-temporal video similarity learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6351–6360. [Google Scholar]
- Jo, W.; Lim, G.; Hwang, Y.; Lee, G.; Kim, J.; Yun, J.; Jung, J.; Choi, Y. Simultaneous video retrieval and alignment. IEEE Access 2023, 11, 28466–28478. [Google Scholar] [CrossRef]
- He, X.; Pan, Y.; Tang, M.; Lv, Y. Self-supervised video retrieval transformer network. arXiv 2021, arXiv:2104.07993. [Google Scholar]
- Kordopatis-Zilos, G.; Papadopoulos, S.; Patras, I.; Kompatsiaris, Y. Near-duplicate video retrieval with deep metric learning. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 347–356. [Google Scholar]
- Tolias, G.; Sicre, R.; Jégou, H. Particular object retrieval with integral max-pooling of CNN activations. arXiv 2016, arXiv:1511.05879. [Google Scholar]
- Kordopatis-Zilos, G.; Tzelepis, C.; Papadopoulos, S.; Kompatsiaris, I.; Patras, I. DnS: Distill-and-select for efficient and accurate video indexing and retrieval. Int. J. Comput. Vis. 2022, 130, 2385–2407. [Google Scholar] [CrossRef]
- Kordopatis-Zilos, G.; Tolias, G.; Tzelepis, C.; Kompatsiaris, I.; Patras, I.; Papadopoulos, S. Self-supervised video similarity learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4756–4766. [Google Scholar]
- Shao, J.; Wen, X.; Zhao, B.; Xue, X. Temporal context aggregation for video retrieval with contrastive learning. In Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3268–3278. [Google Scholar]
- He, S.; He, Y.; Lu, M.; Jiang, C.; Yang, X.; Qian, F.; Zhang, X.; Yang, L.; Zhang, J. TransVCL: Attention-enhanced video copy localization network with flexible supervision. In Proceedings of the 2023 AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 799–807. [Google Scholar]
- Chou, C.-L.; Chen, H.-T.; Lee, S.-Y. Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Trans. Multimed. 2015, 17, 382–395. [Google Scholar] [CrossRef]
- Tan, H.-K.; Ngo, C.-W.; Hong, R.; Chua, T.-S. Scalable detection of partial near-duplicate videos by visual-temporal consistency. In Proceedings of the 17th ACM International Conference on Multimedia, Beijing, China, 19–24 October 2009; pp. 145–154. [Google Scholar]
- Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
- Jégou, H.; Chum, O. Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In Proceedings of the 2012 European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 774–787. [Google Scholar]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–25 June 2021; pp. 8922–8931. [Google Scholar]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Neural Inf. Process. Systems. 2017, 30. [Google Scholar]
- Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2024, arXiv:2304.03198. [Google Scholar]
- Luo, P.; Ren, J.; Peng, Z.; Zhang, R.; Li, J. Differentiable learning-to-normalize via switchable normalization. arXiv 2019, arXiv:1806.10779 2018. [Google Scholar]
- Wu, X.; Hauptmann, A.G.; Ngo, C.-W. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany, 25–29 September 2007; pp. 218–227. [Google Scholar]
- Revaud, J.; Douze, M.; Schmid, C.; Jégou, H. Event retrieval in large video collections with circulant temporal encoding. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2459–2466. [Google Scholar]
- Jiang, Y.-G.; Jiang, Y.; Wang, J. VCDB: A large-scale database for partial copy detection in videos. In Proceedings of the 2014 European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 357–371. [Google Scholar]
- Kordopatis-Zilos, G.; Papadopoulos, S.; Patras, I.; Kompatsiaris, I. FIVR: Fine-grained incident video retrieval. IEEE Trans. Multimed. 2019, 21, 2638–2652. [Google Scholar] [CrossRef]
- Jiang, Q.-Y.; He, Y.; Li, G.; Lin, J.; Li, L.; Li, W.-J. SVD: A large-scale short video dataset for near-duplicate video retrieval. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5281–5289. [Google Scholar]
Dataset | CC_WEB VIDEO [28] | EVVE [29] | VCDB [30] | FIVR-200K [31] | SVD [32] | DnS-100K [16] |
---|---|---|---|---|---|---|
Query | 24 | 620 | 528 | 100 | 1206 | 21,997 |
Videos | 12,790 | 102,375 | 100,528 | 225,960 | 562,013 | 115,792 |
Average Seconds | 151.02 | 194.67 | 72.77 | 113.12 | 17.33 | / |
Parameter | Symbol | Parameter Values |
---|---|---|
Batch Size | B | 32 |
Frame Size | H/W | 224 |
Frame Number | N | 300 |
Learning Rate | learning rate | 0.0001 |
Optimizer | optimizer | Adam |
Weight Decay | decay | 0.0001 |
Loss Function Parameters | 1 | |
Loss Function Weight | 0.1 | |
Data Augmentation Parameters | 0.5 | |
Similarity Threshold | 0.5 | |
Training Rounds | epoch | 300 |
Configuration | Parameter |
---|---|
Kernel | Ubuntu 5.15.0-52-generic |
Processor | 12th Gen Intel(R) Core(TM) i9-12900KF (Intel, Santa Clara, CA, USA) |
Memory | 32 GB |
Graphics Card | NVIDIA A6000 48 GB (NVIDIA, Santa Clara, CA, USA) |
Development Platform | Miniconda3 |
Programming Language | Python 3.10 |
Deep Learning Framework | PyTorch 1.13.1 |
Method | FIVR-5K | CC WEB VIDEO | EVVE | SVD | ||
---|---|---|---|---|---|---|
DSVR | CSVR | ISVR | ||||
DML [14] | 0.391 | 0.399 | 0.380 | 0.979 | 0.531 | 0.785 |
[18] | 0.609 | 0.617 | 0.578 | 0.983 | 0.598 | / |
[16] | 0.634 | 0.647 | 0.608 | 0.980 | 0.636 | 0.868 |
SRA [12] | 0.700 | 0.719 | 0.705 | / | / | / |
S2VS [17] | 0.881 | 0.875 | 0.786 | 0.996 | 0.659 | 0.892 |
[18] | 0.726 | 0.735 | 0.701 | / | / | / |
[18] | 0.844 | 0.834 | 0.763 | 0.994 | 0.603 | / |
[16] | 0.891 | 0.880 | 0.802 | 0.995 | 0.651 | 0.902 |
[11] | 0.856 | 0.848 | 0.768 | 0.993 | 0.589 | 0.878 |
[11] | 0.880 | 0.869 | 0.777 | 0.996 | 0.658 | 0.881 |
Ours | 0.889 | 0.878 | 0.801 | 0.995 | 0.656 | 0.907 |
Ablation Object | √ Represents Usage | ||||||
---|---|---|---|---|---|---|---|
Backbone | √ | √ | √ | √ | √ | √ | √ |
FT | √ | √ | √ | √ | √ | ||
TE | √ | √ | √ | √ | |||
CANN | √ | √ | √ | √ | |||
GF | √ | √ | |||||
SM | √ | ||||||
DSVR | 0.765 | 0.778 | 0.789 | 0.771 | 0.799 | 0.808 | 0.889 |
Ablation Object | √ Represents Usage | |||
---|---|---|---|---|
√ | √ | √ | √ | |
√ | √ | √ | ||
√ | √ | √ | ||
√ | √ | |||
DSVR | 0.799 | 0.808 | 0.889 | 0.861 |
Dataset | Precision | Recall | F1 | Acc |
---|---|---|---|---|
CC WEB | 0.879 | 0.861 | 0.875 | 0.931 |
FIVR | 0.710 | 0.728 | 0.718 | 0.774 |
EVVE | 0.582 | 0.724 | 0.652 | 0.729 |
SVD | 0.765 | 0.890 | 0.824 | 0.889 |
VCDB | 0.751 | 0.816 | 0.781 | 0.847 |
Average | 0.737 | 0.804 | 0.770 | 0.834 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jin, X.; Yan, S.; Chen, R.; Li, X.; Li, D.; Wang, Y. Video Content Plagiarism Detection Using Region-Based Feature Learning. Electronics 2025, 14, 4011. https://doi.org/10.3390/electronics14204011
Jin X, Yan S, Chen R, Li X, Li D, Wang Y. Video Content Plagiarism Detection Using Region-Based Feature Learning. Electronics. 2025; 14(20):4011. https://doi.org/10.3390/electronics14204011
Chicago/Turabian StyleJin, Xun, Su Yan, Rongchun Chen, Xuanyou Li, De Li, and Yanwei Wang. 2025. "Video Content Plagiarism Detection Using Region-Based Feature Learning" Electronics 14, no. 20: 4011. https://doi.org/10.3390/electronics14204011
APA StyleJin, X., Yan, S., Chen, R., Li, X., Li, D., & Wang, Y. (2025). Video Content Plagiarism Detection Using Region-Based Feature Learning. Electronics, 14(20), 4011. https://doi.org/10.3390/electronics14204011