CLIP-RL: Closed-Loop Video Inpainting with Detection-Guided Reinforcement Learning
Abstract
1. Introduction
- We introduce a Reinforcement Learning-Based Video Inpainting Framework CLIP-RL. To our knowledge, we make one of the first attempts to integrate reinforcement learning and inpainting detection in a closed-loop setting for video inpainting. This module generates predicted masks to identify low-quality inpainted regions, providing real-time feedback to guide the agent in dynamically adjusting its inpainting strategy.
- We design a Multi-Dimensional Action Space. The action space (i) adjusts the number of Transformer attention heads to balance accuracy and efficiency, (ii) optimizes the fusion strength between inpainted and original frames for smooth boundary transitions, and (iii) dynamically tunes reward weights to adapt the optimization objective to scene complexity.
- We design a Weighted Temporal Alignment Loss. By leveraging latent space alignment and feedback from the inpainting detection module, this loss mitigates the accumulation of optical flow errors in fast-motion scenes. Combined with the reinforcement learning framework, it enhances temporal consistency and inpainting quality in complex dynamic scenarios.
2. Related Work
2.1. Existing Video Inpainting Methods
2.2. Dynamic Optimization and Feedback
3. The Proposed CLIP-RL Approach
3.1. Reinforcement Learning-Driven Closed-Loop Inpainting Framework
3.1.1. Interaction Between Inpainting and Inpainting Detection Modules
3.1.2. Reward Function Design and Policy Network Optimization
- 1.
- Reward Function Design with Weighted Temporal Alignment Loss
- 2.
- Policy Network Optimization
3.1.3. Adaptive Action-Space Design
- 1.
- Dynamic Adjustment of Attention Heads
- 2.
- Dynamic Control of Feature-Fusion Strength
- 3.
- Dynamic Balancing of Reward Weights
4. Results and Discussion
4.1. Experimental Setup
4.2. Quantitative Evaluation
4.3. Qualitative Comparisons
4.4. Ablation Evaluations
4.4.1. Effectiveness of Inpainting Detection Module
4.4.2. Effectiveness of the Reinforcement Learning Module
4.5. Analysis of Reinforcement Learning Actions and Loss Optimization
4.5.1. Analysis of Dynamic Attention Head Optimization
4.5.2. Analysis of Dynamic Fusion Strategy Optimization
4.5.3. Analysis of Reward Weight Evolution and Mechanism Effectiveness
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ebdelli, M.; Le Meur, O.; Guillemot, C. Video inpainting with short-term windows: Application to object removal and error concealment. IEEE Trans. Image Process. 2015, 24, 3034–3047. [Google Scholar] [CrossRef] [PubMed]
- Xu, R.; Li, X.; Zhou, B.; Loy, C.C. Deep flow-guided video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3723–3732. [Google Scholar]
- Tang, N.C.; Hsu, C.T.; Su, C.W.; Shih, T.K.; Liao, H.Y.M. Video inpainting on digitized vintage films via maintaining spatiotemporal continuity. IEEE Trans. Multimed. 2011, 13, 602–614. [Google Scholar] [CrossRef]
- Lee, S.; Oh, S.W.; Won, D.; Kim, S.J. Copy-and-paste networks for deep video inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4413–4421. [Google Scholar]
- Li, Z.; Lu, C.Z.; Qin, J.; Guo, C.L.; Cheng, M.M. Towards an end-to-end framework for flow-guided video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17562–17571. [Google Scholar]
- Ji, Z.; Hou, J.; Su, Y.; Pang, Y.; Li, X. G2LP-Net: Global to local progressive video inpainting network. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1082–1092. [Google Scholar] [CrossRef]
- Gao, C.; Saraf, A.; Huang, J.B.; Kopf, J. Flow-edge guided video completion. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 713–729. [Google Scholar]
- Zhang, K.; Fu, J.; Liu, D. Flow-guided transformer for video inpainting. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 74–90. [Google Scholar]
- Zhang, K.; Fu, J.; Liu, D. Inertia-guided flow completion and style fusion for video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5982–5991. [Google Scholar]
- Liu, R.; Deng, H.; Huang, Y.; Shi, X.; Lu, L.; Sun, W.; Wang, X.; Dai, J.; Li, H.F. Fusing fine-grained information in transformers for video inpainting. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
- Zhou, S.; Li, C.; Chan, K.C.; Loy, C.C. Propainter: Improving propagation and transformer for video inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10477–10486. [Google Scholar]
- Zhang, K.; Peng, J.; Fu, J.; Liu, D. Exploiting optical flow guidance for transformer-based video inpainting. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4977–4992. [Google Scholar] [CrossRef] [PubMed]
- Ukwuoma, C.C.; Heyat, M.B.B.; Masadeh, M.; Akhtar, F.; Zhiguang, Q.; Bondzie-Selby, E.; AlShorman, O.; Alkahtani, F. Image inpainting and classification agent training based on reinforcement learning and generative models with attention mechanism. In Proceedings of the 2021 International Conference on Microelectronics (ICM), New Cairo City, Egypt, 19–22 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 96–101. [Google Scholar]
- Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar]
- Xu, N.; Yang, L.; Fan, Y.; Yang, J.; Yue, D.; Liang, Y.; Price, B.; Cohen, S.; Huang, T. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 585–601. [Google Scholar]
- Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 724–732. [Google Scholar]
- Kim, D.; Woo, S.; Lee, J.Y.; Kweon, I.S. Deep video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5792–5801. [Google Scholar]
- Li, A.; Zhao, S.; Ma, X.; Gong, M.; Qi, J.; Zhang, R.; Tao, D.; Kotagiri, R. Short-term and long-term context aggregation network for video inpainting. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 728–743. [Google Scholar]
- Zou, X.; Yang, L.; Liu, D.; Lee, Y.J. Progressive temporal feature alignment network for video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16448–16457. [Google Scholar]
- Wang, J.; Yang, Z.; Huo, Z.; Chen, W. Local and nonlocal flow-guided video inpainting. Multimed. Tools Appl. 2024, 83, 10321–10340. [Google Scholar] [CrossRef]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Ren, J.; Zheng, Q.; Zhao, Y.; Xu, X.; Li, C. Dlformer: Discrete latent transformer for video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3511–3520. [Google Scholar]
- Pirinen, A.; Sminchisescu, C. Deep reinforcement learning of region proposal networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6945–6954. [Google Scholar]
- Ren, Z.; Wang, X.; Zhang, N.; Lv, X.; Li, L.J. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 290–298. [Google Scholar]
- Wu, H.; Chen, Y.; Zhou, J. Rethinking image forgery detection via contrastive learning and unsupervised clustering. arXiv 2023, arXiv:2308.09307. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Kdd, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
- Wu, H.; Zhou, J. Giid-net: Generalizable image inpainting detection network. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3867–3871. [Google Scholar]
- Zeng, Y.; Fu, J.; Chao, H. Learning joint spatial-temporal transformations for video inpainting. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 528–543. [Google Scholar]
- Oh, S.W.; Lee, S.; Lee, J.Y.; Kim, S.J. Onion-peel networks for deep video completion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4403–4412. [Google Scholar]
- Chang, Y.L.; Liu, Z.Y.; Lee, K.Y.; Hsu, W. Free-form video inpainting with 3d gated convolution and temporal patchgan. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9066–9075. [Google Scholar]
- Yuan, D.; Chang, X.; Huang, P.Y.; Liu, Q.; He, Z. Self-supervised deep correlation tracking. IEEE Trans. Image Process. 2020, 30, 976–985. [Google Scholar] [CrossRef] [PubMed]
- Geng, G.; Zhou, S.; Tang, J.; Zhang, X.; Liu, Q.; Yuan, D. Self-Supervised Visual Tracking via Image Synthesis and Domain Adversarial Learning. Sensors 2025, 25, 4621. [Google Scholar] [CrossRef] [PubMed]
- Li, Q.; Tan, K.; Yuan, D.; Liu, Q. Progressive Domain Adaptation for Thermal Infrared Tracking. Electronics 2025, 14, 162. [Google Scholar] [CrossRef]








| Method | Year | YouTube-VOS | DAVIS Square | DAVIS Object | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | ||
| VINet | 2019 | 29.83 | 0.955 | 0.047 | 28.32 | 0.943 | 0.049 | 28.47 | 0.922 | 0.083 |
| DFGVI | 2019 | 32.05 | 0.965 | 0.038 | 29.75 | 0.959 | 0.037 | 30.28 | 0.925 | 0.052 |
| CPN | 2019 | 32.17 | 0.963 | 0.040 | 30.20 | 0.953 | 0.049 | 31.59 | 0.933 | 0.058 |
| OPN | 2019 | 32.66 | 0.965 | 0.039 | 31.15 | 0.958 | 0.044 | 32.40 | 0.944 | 0.041 |
| 3DGC | 2019 | 30.22 | 0.961 | 0.041 | 28.19 | 0.944 | 0.049 | 31.69 | 0.940 | 0.054 |
| STTN | 2020 | 32.49 | 0.964 | 0.040 | 30.54 | 0.954 | 0.047 | 32.83 | 0.943 | 0.052 |
| FGVC | 2020 | 33.94 | 0.972 | 0.026 | 32.14 | 0.967 | 0.030 | 33.91 | 0.955 | 0.036 |
| TSAM | 2021 | 31.62 | 0.962 | 0.031 | 29.73 | 0.951 | 0.036 | 31.50 | 0.934 | 0.048 |
| FFM | 2021 | 33.73 | 0.970 | 0.030 | 31.87 | 0.965 | 0.034 | 34.19 | 0.951 | 0.045 |
| FGT | 2022 | 32.17 | 0.960 | 0.028 | 32.60 | 0.965 | 0.032 | 34.30 | 0.953 | 0.040 |
| LNFVI | 2024 | 30.80 | 0.970 | 0.025 | 31.35 | 0.957 | 0.027 | – | – | – |
| ProPainter | 2023 | 34.43 | 0.974 | 0.033 | 34.47 | 0.978 | 0.035 | 34.45 | 0.975 | 0.036 |
| Ours | 2025 | 34.67 | 0.986 | 0.031 | 34.51 | 0.977 | 0.030 | 34.50 | 0.979 | 0.033 |
| Configuration | PSNR | SSIM | Inference Time (s/Frame) |
|---|---|---|---|
| Without detection module | 32.30 | 0.955 | 0.083 |
| With detection module (1 feedback) | 34.16 | 0.971 | 0.088 |
| With detection module (2 feedbacks) | 34.57 | 0.978 | 0.091 |
| RL Module | — | ✓ | ✓ | ✓ | ✓ | ✓ |
| Iterations | 0 | 1 | 2 | 3 | 4 | 5 |
| GFLOPs | 808 | 956 | 1078 | 1207 | 1339 | 1473 |
| PSNR ↑ | 34.43 | 34.51 | 34.58 | 34.67 | 34.69 | 34.70 |
| SSIM ↑ | 0.974 | 0.975 | 0.977 | 0.986 | 0.987 | 0.988 |
| Scenario | Method | Max | Avg. Active Heads | PSNR | SSIM | Inference Time (s/Frame) |
|---|---|---|---|---|---|---|
| Easy inpainting | Fixed | 4 | 4 | 34.46 | 0.968 | 0.091 |
| Fixed | 8 | 8 | 34.51 | 0.974 | 0.094 | |
| Ours | 12 | 5.88 | 34.49 | 0.972 | 0.091 | |
| Hard inpainting | Fixed | 4 | 4 | 32.03 | 0.947 | 0.091 |
| Fixed | 8 | 8 | 34.07 | 0.966 | 0.092 | |
| Ours | 12 | 9.72 | 34.33 | 0.971 | 0.094 |
| Scenario | Method | PSNR | SSIM | GMS | ES | Inference Time (s/Frame) |
|---|---|---|---|---|---|---|
| Fast Background Change | No Fusion | 34.10 | 0.968 | 0.84 | 0.79 | 0.091 |
| Fixed Fusion | 34.35 | 0.973 | 0.89 | 0.84 | 0.091 | |
| Dynamic Fusion (Ours) | 34.50 | 0.978 | 0.94 | 0.89 | 0.101 | |
| Large-Area Missing | No Fusion | 34.05 | 0.965 | 0.83 | 0.78 | 0.091 |
| Fixed Fusion | 34.39 | 0.970 | 0.88 | 0.83 | 0.091 | |
| Dynamic Fusion (Ours) | 34.51 | 0.978 | 0.93 | 0.88 | 0.101 |
| Strategy | Configuration | PSNR ↑ | SSIM ↑ | ↓ |
|---|---|---|---|---|
| Fixed-Bias | Fixed at bounds | 33.92 | 0.965 | 1.048 |
| Fixed-Average | Fixed at mean | 34.45 | 0.975 | 1.012 |
| Dynamic (Ours) | Adaptive adjustment | 34.51 | 0.977 | 1.002 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, M.; Ren, J.; Wang, B.; Tang, X. CLIP-RL: Closed-Loop Video Inpainting with Detection-Guided Reinforcement Learning. Sensors 2026, 26, 447. https://doi.org/10.3390/s26020447
Wang M, Ren J, Wang B, Tang X. CLIP-RL: Closed-Loop Video Inpainting with Detection-Guided Reinforcement Learning. Sensors. 2026; 26(2):447. https://doi.org/10.3390/s26020447
Chicago/Turabian StyleWang, Meng, Jing Ren, Bing Wang, and Xueping Tang. 2026. "CLIP-RL: Closed-Loop Video Inpainting with Detection-Guided Reinforcement Learning" Sensors 26, no. 2: 447. https://doi.org/10.3390/s26020447
APA StyleWang, M., Ren, J., Wang, B., & Tang, X. (2026). CLIP-RL: Closed-Loop Video Inpainting with Detection-Guided Reinforcement Learning. Sensors, 26(2), 447. https://doi.org/10.3390/s26020447

