The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual
Abstract
1. Introduction
2. Related Research
2.1. Video Understanding
2.2. CNN-Based Approaches
2.3. Attention-Based Approaches
2.4. Transformer-Based Approaches
2.5. Multimodal-Based Approach
3. SPOT Model
3.1. Structure of SPOT Models
Algorithm 1 TimeSformer |
Input video Output Tf_prediction 1 def timesformer(video): 2 split video to video_frames 3 For frame in video_frames: 4 split frame to frame_patch 5 Flatten and embed frame_patch to patch_tokens 6 Spatial_token ← Spatial_encoder(patch_tokens) 7 Add Spatial_token into Spatial_tokens 8 Temporal_tokens ← Temporal_encoder(Spatial_tokens) 9 Tf_prediction ← Prediction_head(Temporal_tokens) 10 Return Tf_prediction |
Algorithm 2 SPOT |
Input video Output SPOT_prediction 1 def SPOT(video): 2 split video to video_frames 3 For frame in video_frames: 4 split frame to frame_patch 5 Flatten and embed frame_patch to patch_tokens 6 GL_feature ← Spatial_encoder(patch_tokens) 7 LC_feature ← CNN_encoder(frame) 8 Fusion_feature ← CAFGL(GL_feature, LC_feature) 9 Add Fusion_feature into Fusion_features 10 Temporal_tokens ← Temporal_encoder(Fusion_features) 11 SPOT_prediction ← Prediction_head(Temporal_tokens) 12 Return SPOT_prediction |
3.2. Mathematical Definition of SPOT
3.2.1. Input Clip
3.2.2. Decomposition into Patches
3.2.3. Local Feature Extraction (CNN Encoder)
3.2.4. Global Feature Extraction (Spatial Attention Encoder)
3.2.5. Feature Fusion Block (CAFGL)
3.2.6. Temporal Encoder
4. Experimental Results
4.1. Dataset for SPOT Models
4.2. Comparison with Baselines
4.3. Ablation Study
5. Conclusions and Future Studies
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Lin, D.C.-E.; Caba Heilbron, F.; Lee, J.-Y.; Wang, O.; Martelaro, N. Videogenic: Identifying Highlight Moments in Videos with Professional Photographs as a Prior. In Proceedings of the 16th Conference on Creativity & Cognition, Chicago, IL, USA, 23–26 June 2024; pp. 328–346. [Google Scholar]
- Vora, D.; Kadam, P.; Mohite, D.D.; Kumar, N.; Kumar, N.; Radhakrishnan, P.; Bhagwat, S. AI-driven video summarization for optimizing content retrieval and management through deep learning techniques. Sci. Rep. 2025, 15, 4058. [Google Scholar] [CrossRef] [PubMed]
- Xiong, B.; Kalantidis, Y.; Ghadiyaram, D.; Grauman, K. Less is More: Learning Highlight Detection from Video Duration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1258–1267. [Google Scholar]
- Zeng, C. Unveiling the impact of short videos: Consumption behavior and decision-making in the digital age. Highlights Bus. Econ. Manag. 2023, 21, 469–474. [Google Scholar] [CrossRef]
- Arslan, S.; Tanberk, S. Key frame extraction with attention based deep neural networks. arXiv 2023, arXiv:2306.13176. [Google Scholar] [CrossRef]
- Violot, C.; Elmas, T.; Bilogrevic, I.; Humbert, M. Shorts vs. Regular Videos on YouTube: A Comparative Analysis of User Engagement and Content Creation Trends. In Proceedings of the 16th ACM Web Science Conference, New York, NY, USA, 21–24 May 2024; pp. 213–223. [Google Scholar]
- Zannettou, S.; Nemes-Nemeth, O.; Ayalon, O.; Goetzen, A.; Gummadi, K.P.; Redmiles, E.M.; Roesner, F. Analyzing User Engagement with TikTok’s Short Format Video Recommendations Using Data Donations. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–16. [Google Scholar]
- Chen, Z.; Liu, P.; Piao, J.; Xu, F.; Li, Y. Shorter is different: Characterizing the dynamics of short-form video platforms. arXiv 2024, arXiv:2410.16058. [Google Scholar] [CrossRef]
- Van Daele, T.; Iyer, A.; Zhang, Y.; Derry, J.C.; Huh, M.; Pavel, A. Making Short-Form Videos Accessible with Hierarchical Video Summaries. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–17. [Google Scholar]
- Peronikolis, M.; Panagiotakis, C. Personalized Video Summarization: A comprehensive survey of methods and datasets. Appl. Sci. 2024, 14, 4400. [Google Scholar] [CrossRef]
- Manic, M. Short-form video content and consumer engagement in digital landscapes. Bull. Transilv. Univ. Brasov. Ser. V Econ. Sci. 2024, 17, 45–52. [Google Scholar] [CrossRef]
- Islam, Z.; Paul, S.; Rochan, M. Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 8702–8711. [Google Scholar]
- Xu, Y.; Sun, Y.; Zhai, B.; Jia, Y.; Du, S. MH-DETR: Video Moment and Highlight Detection with Cross-Modal Transformer. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
- Zhao, P.; He, Z.; Zhang, F.; Lin, S.; Zhou, F. Ld-detr: Loop decoder detection transformer for video moment retrieval and highlight detection. arXiv 2025, arXiv:2501.10787. [Google Scholar]
- Zhang, T.; Cui, W.; Liu, S.; Jiang, F. SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
- Xiong, T.; Wei, W.; Xu, K.; Chen, D. SA-DETR: Span Aware Detection Transformer for Moment Retrieval. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 7634–7647. [Google Scholar]
- Sun, H.; Zhou, M.; Chen, W.; Xie, W. Tr-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2024; pp. 4998–5007. [Google Scholar]
- Arnab, A.; Sun, C.; Schmid, C. Unified Graph Structured Models for Video Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 8117–8126. [Google Scholar]
- Tang, Y.; Bi, J.; Xu, S.; Song, L.; Liang, S.; Wang, T.; Zhang, D.; An, J.; Lin, J.; Zhu, R. Video understanding with large language models: A survey. arXiv 2023, arXiv:2312.17432. [Google Scholar] [CrossRef]
- Zeng, Z.; McDuff, D.; Song, Y. Contrastive learning of global and local video representations. Adv. Neural Inf. Process. Syst. 2021, 34, 7025–7040. [Google Scholar]
- Sun, G.; Liu, Y.; Ding, H.; Wu, M.; Van Gool, L. Learning local and global temporal contexts for video semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6919–6934. [Google Scholar] [CrossRef]
- Wei, F.; Wang, B.; Ge, T.; Jiang, Y.; Li, W.; Duan, L. Learning Pixel-Level Distinctions for Video Highlight Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3073–3082. [Google Scholar]
- Badamdorj, T.; Rochan, M.; Wang, Y.; Cheng, L. Contrastive Learning for Unsupervised Video Highlight Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14042–14052. [Google Scholar]
- Xu, J.; Liu, S.; Chen, G.; Liu, Q. Highlight Detection and Removal Method Based on Bifurcated-CNN. In Proceedings of the International Conference on Intelligent Robotics and Applications, Harbin, China, 1–3 August 2022; pp. 307–318. [Google Scholar]
- Al-Saad, M.; Ramaswamy, L.; Bhandarkar, S. F4D: Factorized 4D Convolutional Neural Network for Efficient Video-level Representation Learning. arXiv 2023, arXiv:2401.08609. [Google Scholar]
- Paul, D.; Parvez, M.R.; Mohammed, N.; Rahman, S. VideoLights: A Cross-Modal Cross-Task Transformer Model for Joint Video Highlight Detection and Moment Retrieval. arXiv 2024, arXiv:2412.01558. [Google Scholar]
- Azad, R.; Kazerouni, A.; Azad, B.; Khodapanah Aghdam, E.; Velichko, Y.; Bagci, U.; Merhof, D. Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 736–746. [Google Scholar]
- Pereira, G.A.; Hussain, M. A review of transformer-based models for computer vision tasks: Capturing global context and spatial relationships. arXiv 2024, arXiv:2408.15178. [Google Scholar] [CrossRef]
- Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef] [PubMed]
- Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local Features Coupling Global Representations for Visual Recognition. In Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10–17 October 2021; pp. 367–376. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention all You Need for Video Understanding? In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; p. 4. [Google Scholar]
- Menon, V.V.; Feldmann, C.; Amirpour, H.; Ghanbari, M.; Timmerer, C. VCA: Video Complexity Analyzer. In Proceedings of the 13th ACM Multimedia Systems Conference, Athlone, Ireland, 14–17 June 2022; pp. 259–264. [Google Scholar]
- Alaa, T.; Mongy, A.; Bakr, A.; Diab, M.; Gomaa, W. Video Summarization Techniques: A Comprehensive Review. arXiv 2024, arXiv:2410.04449. [Google Scholar] [CrossRef]
- Li, T.; Sun, Z.; Xiao, X. Unsupervised modality-transferable video highlight detection with representation activation sequence learning. IEEE Trans. Image Process. 2024, 33, 1911–1922. [Google Scholar] [CrossRef]
- Lee, M.J.; Gong, D.; Cho, M. Video Summarization with Large Language Models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 18981–18991. [Google Scholar]
- Tang, K.-S.; So, H.-J.; Rappa, N. Examining the Multimodal Design of Explainer Videos: A multimodal Content Analysis of Khan Academy Online Resources. SSRN Electron. J. 2023. [CrossRef]
- Aminbeidokhti, M.; Pedersoli, M.; Cardinal, P.; Granger, E. Emotion Recognition with Spatial Attention and Temporal Softmax Pooling. In Proceedings of the International Conference on Image Analysis and Recognition, Waterloo, ON, Canada, 27–29 August 2019; pp. 323–331. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3d Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Hassan, E. Learning video actions in two stream recurrent neural network. Pattern Recognit. Lett. 2021, 151, 200–208. [Google Scholar] [CrossRef]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
- Chen, X.; Shi, S.; Ma, T.; Zhou, J.; See, S.; Cheung, K.C.; Li, H. M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving. arXiv 2025, arXiv:2503.18100. [Google Scholar] [CrossRef]
- Chen, J.; Ho, C.M. Mm-vit: Multi-Modal Video Transformer for Compressed Video Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1910–1921. [Google Scholar]
- Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv 2023, arXiv:2303.04671. [Google Scholar] [CrossRef]
- Xiao, B.; Yin, X.; Kang, S.-C. Vision-based method of automatically detecting construction video highlights by integrating machine tracking and CNN feature extraction. Autom. Constr. 2021, 129, 103817. [Google Scholar] [CrossRef]
- Sul, J.; Han, J.; Lee, J. Hisum: A large-scale dataset for video highlight detection and summarization. Adv. Neural Inf. Process. Syst. 2023, 36, 40542–40555. [Google Scholar]
- Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional Feature Fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar]
- Mungoli, N. Adaptive feature fusion: Enhancing generalization in deep learning models. arXiv 2023, arXiv:2304.03290. [Google Scholar] [CrossRef]
- Kim, D.-H.; Son, W.-H.; Kwak, S.-S.; Yun, T.-H.; Park, J.-H.; Lee, J.-D. A hybrid deep learning emotion classification system using multimodal data. Sensors 2023, 23, 9333. [Google Scholar] [CrossRef] [PubMed]
Variable Name | Description |
---|---|
Video | Input video sequence |
Tf_prediction | The output of Prediction_head, i.e., the final predicted value of the TimeSformer model |
video_frames | List/arrangement of frames divided by video (T frames) |
frame | Single frame in video_frames |
frame_patch | As a result of dividing the frame into smaller patches (for example, 16 × 16) |
patch_tokens | Token vector embedded with frame_patch unfolded (linear projection) |
Spatial_token | Output of spatial encoder (set of tokens in one frame) |
Spatial_tokens | Sequence that gathers the spatial_token of all frames (stacked by the time axis) |
Temporal_tokens | Temporary encoder results (final token sequence with added time context) |
Variable Name | Description |
---|---|
Video | Input video sequence |
SPOT_prediction | Final prediction value (e.g., highlight score) |
Video_frame | List/array of frames obtained from video segmentation |
frame | A single frame within video_frames |
frame_patch | Result of dividing a frame into small patches |
patch_tokens | Token vector obtained by embedding a frame patch |
GL_feature | Global feature extracted from the Spatial Encoder |
LC_feature | Local feature extracted from the CNN Encoder |
CAFGL | Module for dynamically fusing local and global features based on cross-attention |
Fusion_feature | Fused feature obtained by integrating GL_feature and LC_feature through the CAFGL block |
Fusion_features | Sequence of fusion features from all frames arranged in temporal order |
Temporal_tokens | Temporal Encoder output (tokens with temporal context incorporated) |
Column Name | Description |
---|---|
video_id | A unique identifier assigned to distinguish each video within the dataset |
yt8m_file | The file name of the corresponding video in the YouTube-8M dataset |
random_id | A randomly generated ID created during the data randomization process |
youtube_id | The actual YouTube video ID (https://www.youtube.com/ watch?v=%E2%80%98youtube_id%E2%80%99 (accessed on 8 September 2025)) |
duration | The total playback duration of the video |
views | The number of views for the video |
entry 2 | The set of topics or categories associated with the video |
Name | Description | Type |
---|---|---|
startMillis | Start time in milliseconds | String |
durationMillis | Duration in milliseconds | String |
intensityScoreNormalized | Normalized intensity score | Float |
Metric | Description | Significance |
---|---|---|
MSE | Mean Squared Error | Evaluates prediction accuracy including outliers |
mAP | Mean Average Precision | Measures the ability to detect highlights that match the ground truth |
Coverage Ratio | The proportion of predicted highlight segments that contain the actual highlights (ground truth). | Evaluation of the ability to correctly match the start and end positions of highlights. |
F1-Score | Evaluates both the accuracy of the highlights detected by the model and its ability to avoid missing relevant highlights. |
Model | Params (M) | MACs (G) | Complexity | MSE | mAP | Coverage | F1-Score |
---|---|---|---|---|---|---|---|
SPOT | ~94.5 | ~825 | Low | 0.065 | 0.504 | 0.78 | 0.85 |
Medium | 0.070 | 0.662 | 0.80 | 0.87 | |||
High | 0.080 | 0.562 | 0.85 | 0.92 | |||
TimeSformer | ~89.1 | ~409.1 | Low | 0.060 | 0.452 | 0.78 | 0.85 |
Medium | 0.071 | 0.561 | 0.79 | 0.86 | |||
High | 0.090 | 0.503 | 0.78 | 0.85 | |||
3D-CNNN | ~33 | ~60 | Low | 0.087 | 0.434 | 0.834 | 0.44 |
Medium | 0.084 | 0.482 | 0.745 | 0.36 | |||
High | 0.056 | 0.502 | 0.603 | 0.33 | |||
MH-DETR | ~37 | ~95 | Low | 0.078 | 0.52 | 0.45 | 0.60 |
Medium | 0.073 | 0.55 | 0.48 | 0.80 | |||
High | 0.068 | 0.54 | 0.46 | 0.80 |
Test Name | Module | MSE | mAP | Coverage | F1-Score | |||
---|---|---|---|---|---|---|---|---|
CFAGL Block | CNN Encoder | Spatial Encoder | Temporal Encoder | |||||
SPOT | √ | √ | √ | √ | 0.065 | 0.576 | 0.810 | 0.880 |
Ablation 1 (sum) | √ | √ | √ | 0.072 | 0.094 | 0.264 | 0.223 | |
Ablation 1 (avg) | √ | √ | √ | 0.071 | 0.096 | 0.268 | 0.226 | |
Ablation 1 (mul) | √ | √ | √ | 0.080 | 0.084 | 0.242 | 0.198 | |
Ablation 2 | √ | √ | 0.0737 | 0.505 | 0.783 | 0.853 | ||
Ablation 3 | √ | √ | 0.073 | 0.141 | 0.315 | 0.278 | ||
Ablation 4 | √ | √ | √ | 0.071 | 0.174 | 0.174 | 0.227 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kwak, S.; Lee, J.; Park, S. The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual. Electronics 2025, 14, 3640. https://doi.org/10.3390/electronics14183640
Kwak S, Lee J, Park S. The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual. Electronics. 2025; 14(18):3640. https://doi.org/10.3390/electronics14183640
Chicago/Turabian StyleKwak, Sungshin, Jaedong Lee, and Sohyun Park. 2025. "The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual" Electronics 14, no. 18: 3640. https://doi.org/10.3390/electronics14183640
APA StyleKwak, S., Lee, J., & Park, S. (2025). The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual. Electronics, 14(18), 3640. https://doi.org/10.3390/electronics14183640