The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual

Kwak, Sungshin; Lee, Jaedong; Park, Sohyun

doi:10.3390/electronics14183640

Open AccessArticle

The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual

by

Sungshin Kwak

¹,

Jaedong Lee

^2,* and

Sohyun Park

^2,*

¹

Department of Artificial Intelligence Convergence, Graduate School, Dankook University, Yongin 16890, Republic of Korea

²

Department of Software Science, SW Convergence College, Dankook University, Yongin 16890, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(18), 3640; https://doi.org/10.3390/electronics14183640

Submission received: 12 August 2025 / Revised: 9 September 2025 / Accepted: 10 September 2025 / Published: 15 September 2025

(This article belongs to the Special Issue Image Processing Based on Convolution Neural Network: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

With the rapid growth of video platforms such as YouTube, Bilibili, and Dailymotion, an enormous amount of video content is being shared worldwide. In this environment, content providers are increasingly adopting methods that restructure videos around highlight scenes and distribute them in short-form formats to encourage more efficient content consumption by viewers. As a result of this trend, the importance of highlight extraction technologies capable of automatically identifying key scenes from large-scale video datasets has been steadily increasing. To address this need, this study proposes SPOT (Spatial Perceptual Optimized TimeSformer), a highlight extraction model. The proposed model enhances spatial perceptual capability by integrating a CNN encoder into the internal structure of the existing Transformer-based TimeSformer, enabling simultaneous learning of both the local and global features of a video. The experiments were conducted using Google’s YT-8M video dataset along with the MR.Hisum dataset, which provides organized highlight information. The SPOT model adopts a regression-based highlight prediction framework. Experimental results on video datasets of varying complexity showed that, in the high-complexity group, the SPOT model achieved a reduction in mean squared error (MSE) of approximately 0.01 (from 0.090 to 0.080) compared to the original TimeSformer. Furthermore, the model outperformed the baseline across all complexity groups in terms of mAP, Coverage, and F1-Score metrics. These results suggest that the proposed model holds strong potential for diverse multimodal applications such as video summarization, content recommendation, and automated video editing. Moreover, it is expected to serve as a foundational technology for advancing video-based artificial intelligence systems in the future.

Keywords:

highlight extraction; video understanding; Transformer; CNN; short-form video

1. Introduction

With the rapid proliferation of video platforms such as YouTube, Bilibili, and Dailymotion, an enormous amount of video content is being shared online [1,2]. In this environment, content providers have actively adopted strategies that extract only the highlight scenes and reorganize them into short-form videos to enable viewers to consume content more quickly and easily [3,4].

This approach serves as an effective means of rapidly attracting viewer interest and delivering key information in a compressed format, thereby increasing viewer engagement and driving higher viewership [5,6]. Alongside this shift in content consumption patterns, the short-form video market has also experienced rapid growth, attracting significant attention. Platforms such as TikTok, YouTube Shorts, and Instagram Reels have gained explosive popularity among users worldwide [7,8], and the practice of delivering key information in a short time span is becoming increasingly common [9,10].

Recent studies have analyzed this trend, noting that “short-form videos have revolutionized digital consumption experiences due to fast consumption habits and mobile-friendly designs” [11]. In particular, the proliferation of short-form content is closely tied to strategies aimed at encouraging users to watch full-length videos by effectively exposing them to key scenes. As a result, the importance of highlight detection technologies capable of automatically extracting key moments from long videos has become increasingly emphasized [12].

However, the creation of such short-form content still largely depends on manual editing, which is both time-consuming and inefficient, given the exponential increase in video data. Consequently, there is a growing need for automated highlight detection methods that can efficiently and accurately identify the most engaging moments in long videos. Nevertheless, automatic highlight extraction remains a significant challenge, as it requires effectively selecting moments of interest from among the vast number of scenes within long videos [7,8].

Consequently, the growing demand for more sophisticated and efficient highlight extraction techniques has actively driven research efforts. Notable examples include MH-DETR [13], LD-DETR [14], SC-HVPPNet [15], and other recent approaches [16,17], which aim to improve highlight detection performance in complex video data. These highlight extraction techniques have evolved based on video understanding technologies, which are designed to comprehend both visual and contextual information within videos. Video understanding is a technology that goes beyond the analysis of individual frames to comprehensively interpret events, activities, object relationships, and contextual information across sequences [18,19].

In this process, the complementary learning of local and global information is crucial [20,21], as highlights are determined not only by subtle changes in objects but also by the overall contextual flow of the video [22]. Local information refers to fine-grained patterns and object-level changes occurring within limited regions of a frame, such as variations in facial expressions or hand gestures. In contrast, global information encompasses relationships and contextual flows across entire frames or sequences of frames, serving as key elements for understanding narrative development, scene transitions, and the overall atmosphere of the video.

To effectively leverage both local and global information, prior research on highlight extraction has largely evolved in two directions. The first approach is based on 2D/3D CNN models [23,24], which excel at detecting local patterns but face limitations in capturing relationships between segments and understanding broader video context [25]. The second approach employs Transformer-based models [26], which leverage self-attention to learn long-range spatiotemporal dependencies and global contextual information. However, because they process inputs by dividing them into patches, these models often struggle to capture subtle, fine-grained local variations within frames [27,28].

While both CNN- and Transformer-based approaches possess unique strengths, they exhibit structural limitations when applied independently—often failing to jointly model fine-grained local variations and long-range contextual dependencies. This structural gap, as discussed in prior studies [29,30], underscores the need for a unified model capable of integrating both aspects effectively. However, despite these advances, existing approaches remain limited in their ability to jointly capture both fine-grained local variations and long-range contextual dependencies. Most prior studies have focused on either local or global information in isolation, leaving a gap in effectively integrating the two for robust highlight detection.

To address these limitations, we propose SPOT, which combines a CNN with the Transformer-based TimeSformer [31]. The SPOT architecture extends TimeSformer by incorporating a CNN branch in parallel with its Spatial Encoder, enabling the model to simultaneously learn local and global features and effectively fuse them. This integrated design is particularly beneficial in videos with high visual complexity or frequent local variations, where TimeSformer alone often struggles. Under such conditions, SPOT demonstrates improved highlight prediction accuracy compared to the baseline TimeSformer.

In this study, video complexity was calculated by analyzing the degree of change between consecutive frames. Recent studies have emphasized the importance of video complexity analysis for adaptive encoding and streaming. For instance, the Video Complexity Analyzer (VCA) [32] estimates spatial and temporal complexity by leveraging block-based DCT energy and its temporal variations. Inspired by such approaches, we adopt a computationally simple calculation based on inter-frame variations to categorize videos into different complexity levels for model evaluation.

Using this metric, the dataset was categorized into different complexity levels to compare the performance of the proposed SPOT model with that of TimeSformer. Experimental results revealed that SPOT consistently outperformed TimeSformer across key evaluation metrics such as Coverage Ratio [33] and mean average precision (mAP) [34] as video complexity increased. Notably, in the high-complexity group, SPOT achieved up to a 0.01 reduction in mean squared error (MSE) [35] compared to TimeSformer. This demonstrates that SPOT is more sensitive to local visual variations within videos, thereby exhibiting superior highlight detection capabilities even in highly complex scenes.

Furthermore, to analyze in greater depth the impact of visual complexity on model performance, we included MH-DETR, a representative state-of-the-art (SOTA) model in the field of highlight detection, as an additional comparison baseline. Through this, we aimed to experimentally verify that SPOT demonstrates competitive performance not only compared to pure Transformer-based architectures but also against the latest designs that leverage cross-modal attention for integrating spatiotemporal features.

These results demonstrate that SPOT maintains high performance even in dynamic video content, supporting its applicability to real-world tasks such as automated editing and video summarization. While this study focuses on the visual modality for highlight extraction, the use of multimodal datasets [36] suggests the potential for future expansion to additional modalities such as audio and subtitles.

2. Related Research

Research on highlight extraction has evolved within the broader field of video understanding and can be broadly categorized into CNN-based, attention-based, Transformer-based, and multimodal methods. Each of these approaches has contributed to important advances: CNNs are effective for learning local spatial features, attention-based models improve the focus on salient regions, and Transformers provide strong global context modeling, and multimodal methods enrich visual information with complementary cues such as audio or text. Nevertheless, several limitations remain. CNNs often fail to capture long-range dependencies, attention mechanisms may struggle with fine-grained spatiotemporal patterns, and Transformers are computationally expensive and relatively insensitive to subtle local variations, and multimodal methods introduce challenges of modality alignment and increased complexity. To address these issues in a unified manner, this study proposes SPOT, a model designed to integrate the complementary strengths of these approaches while mitigating their respective limitations.

2.1. Video Understanding

Video understanding is a technology that interprets visual data in videos to recognize objects, scenes, actions, and contextual information, and it has recently expanded into diverse applications such as highlight extraction, action recognition, and video summarization. Early studies in video understanding primarily leveraged convolutional neural networks (CNNs) [37], which had demonstrated strong performance in image recognition, to learn local patterns at the frame level or employed 3D CNNs [38] to capture spatiotemporal features. Subsequently, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks [39] were introduced to model temporal dependencies. The integration of attention mechanisms further enhanced the ability to focus on important frames and regions. More recently, Transformer-based models [40] have activated research that emphasizes understanding global context, offering improved performance in various video understanding tasks. In parallel, matching-based methods have been proposed to address few-shot and fine-grained recognition challenges. For instance, M³Net [41] introduces a multi-view encoding, matching, and fusion framework that leverages intra-frame, intra-video, and intra-episode relationships to improve recognition performance in low-data regimes. In addition, the incorporation of multimodal techniques [42] and large language models (LLMs) [43] has enabled more advanced forms of video understanding, such as video captioning, question answering, and streaming summarization.

2.2. CNN-Based Approaches

Early studies on highlight extraction were largely developed based on CNN [44] models, which had been successfully applied in image recognition. These models typically extracted spatial features from individual frames and then connected them along the temporal axis to detect or summarize highlight scenes within videos. While 3D CNNs demonstrated strong capabilities in capturing local visual information, they faced limitations in modeling sequential changes between scenes or detecting events likely to capture viewer interest. In particular, their architectures were often inadequate for handling patterns involving rapid visual changes or scene transitions within short time spans.

2.3. Attention-Based Approaches

Attention mechanisms enable models to focus on the most relevant parts of a video by dynamically attending to spatial and temporal features, thereby allowing them to effectively capture key scenes and fine-grained details. Attention-based models compute weights that represent the relevance of each element in an input sequence to all others, enabling the efficient extraction of important information even from long sequences. This approach addresses some of the limitations of CNNs by allowing models to respond more sensitively to event sequences and transitions between scenes.

2.4. Transformer-Based Approaches

Transformer architectures, built upon attention-based neural networks, extend the self-attention mechanism and have marked a turning point in video understanding. Notable examples include MH-DETR and LD-DETR, which utilize Transformer-based designs to improve highlight detection and video moment localization. However, while Transformer-based models excel at learning global context, prior studies have noted that they can be relatively less effective at capturing fine-grained local patterns. This tendency may affect tasks such as highlight detection, where sensitivity to subtle visual changes is important.

2.5. Multimodal-Based Approach

Recent studies have increasingly focused on integrating multimodal information to further enhance highlight extraction performance. These approaches utilize not only visual information from videos but also incorporate audio, subtitles, and metadata to more accurately identify key moments. For example, highlights may be detected by considering the co-occurrence of specific sound events and visual scenes or by analyzing dialog content alongside visual context to select highlight candidates.

3. SPOT Model

3.1. Structure of SPOT Models

Previous studies have primarily focused on effectively learning global context through Transformer-based architectures; however, viewers’ actual interests are often based on subtle object changes or localized movements within videos. This study distinguishes itself by combining the global representation learning capabilities of Transformers with the local pattern extraction strengths of convolutional neural networks (CNNs), thereby enabling the capture of fine-grained spatial information that conventional models often overlook and achieving more precise highlight prediction. The overall architecture of the proposed model is illustrated in Figure 1.

As illustrated in Figure 1, the proposed SPOT model follows a network architecture that combines a CNN-based encoder and the Transformer-based TimeSformer in parallel to simultaneously capture both local and global features in videos. Specifically, the architecture operates through the following five stages.

First, the input frame sequence is divided into patches, and each patch is transformed into an input token through a linear embedding process. The resulting tokens are then processed by the Spatial Encoder in the Frame’s Global Feature Extraction Branch, which is responsible for learning the global visual information of each frame.

Second, each frame is fed into a pre-trained CNN encoder, such as ResNet-18, to extract local features. This CNN encoder constitutes the Frame’s Local Feature Extraction Branch and effectively encodes fine-grained patterns within each frame, including localized objects, boundaries, and movements (e.g., the motion of a ball in sports videos) that are likely to attract viewer attention.

Third, the frames extracted from the two branches are fused through the CAFGL (Cross-Attention Fusion of Global and Local features) block, as illustrated in Figure 2, by using the local features as the Value (V) and the global features as the Query (Q) and Key (K). The CAFGL block performs two successive cross-attention operations to iteratively update the local and global features. In each stage, the softmax function converts the Q–K similarity into a probability distribution, which is then used to compute a weighted sum of V, thereby enabling the two feature types to exchange information and achieve effective fusion.

Fourth, the Spatial feature map is processed by the Temporal Encoder to learn temporal relationships and sequential dynamics along the time axis. This stage plays a critical role in capturing the temporal context of the video.

Finally, a fully connected regression head predicts an intensity score for each video segment, representing the level of viewer engagement. This prediction is formulated as a regression problem and is trained to minimize the mean squared error (MSE). The model’s performance is further evaluated using metrics such as mean average precision (mAP).

The baseline model used for comparison in this study follows the TimeSformer-based processing pipeline, as outlined in Algorithm 1.

Algorithm 1 TimeSformer

Input                       video
Output                    Tf_prediction
1                               def timesformer(video):
2                                  split video to video_frames
3                                   For frame in video_frames:
4                                       split frame to frame_patch
5                                        Flatten and embed frame_patch to patch_tokens
6                                        Spatial_token ← Spatial_encoder(patch_tokens)
7                                        Add Spatial_token into Spatial_tokens
8                                   Temporal_tokens ← Temporal_encoder(Spatial_tokens)
9                                   Tf_prediction ← Prediction_head(Temporal_tokens)
10                                 Return Tf_prediction

In Algorithm 1, the TimeSformer receives a video as its input. From lines 2 to 5, the input video is divided into fixed-size patches for each frame, which are then flattened and embedded to generate Patch_tokens. In line 6, the generated Patch_tokens are processed by the Spatial Encoder to extract spatial features. Subsequently, in line 7, the extracted Spatial_token from each frame is aggregated sequentially to form Spatial_tokens. In line 8, the Temporal Encoder learns the visual relationships among the Spatial_tokens, thereby capturing contextual information across the temporal dimension. Finally, in line 9, the Prediction Head produces the output Tf_prediction, representing the final prediction. The parameters used in Algorithm 1 are presented in Table 1.

In addition, the proposed SPOT model is presented in Algorithm 2. This model adopts the TimeSformer framework illustrated in Algorithm 1 as its backbone. However, unlike the original architecture, a CNN encoder is integrated during the frame feature extraction stage to enhance the features extracted from each frame. Specifically, the local features (LC_feature) obtained from the CNN encoder and the global features (GL_feature) extracted by the Spatial Encoder are fused through the CAFGL block before being fed into the Temporal Encoder.

Algorithm 2 SPOT

Input                       video
Output                    SPOT_prediction
1                               def SPOT(video):
2                                  split video to video_frames
3                                   For frame in video_frames:
4                                       split frame to frame_patch
5                                        Flatten and embed frame_patch to patch_tokens
6                                        GL_feature ← Spatial_encoder(patch_tokens)
7                                        LC_feature ← CNN_encoder(frame)
8                                        Fusion_feature ← CAFGL(GL_feature, LC_feature)
9                                        Add Fusion_feature into Fusion_features
10                                   Temporal_tokens ← Temporal_encoder(Fusion_features)
11                                   SPOT_prediction ← Prediction_head(Temporal_tokens)
12                                   Return SPOT_prediction

The parameters used in Algorithm 2 are summarized in Table 2.

The baseline SPOT model in Algorithm 2 employs ResNet-18 as the CNN encoder. By introducing this architectural modification, the structural differences between the processing flow of the original TimeSformer (Algorithm 1) and that of the proposed SPOT model (Algorithm 2) enable performance improvements to be achieved.

3.2. Mathematical Definition of SPOT

3.2.1. Input Clip

The SPOT model takes video clips as input, represented as a tensor

{X \in R}^{H \times W \times C \times F}

. Here, C denotes the number of channels in each frame (

C = 3

for RGB), and

F

represents the number of RGB frames sampled from the original video. Each frame has a spatial resolution of

H \times W

.

3.2.2. Decomposition into Patches

Since the proposed SPOT model is based on TimeSformer, it decomposes the input video clips into patches in the same manner as the original model within the Spatial block. Each frame is divided into

N

non-overlapping patches of size

P \times P

. This process ensures that all

N

patches completely cover the entire frame, where

N = H W / P^{2}

. Here, p = 1, …, N denotes the spatial locations.

In addition, the CNN block (e.g., ResNet-18), which operates in parallel with the Spatial block, decomposes each frame into overlapping regions (overlapping patches). The CNN applies a kernel of size

K \times K

, stride

S

, and padding

P

to the input image X∈

R^{H \times W \times C}

to extract patches. Each location

(i, j)

in the resulting feature map corresponds to a region of the input image

X [i \cdot S : i \cdot S + K, j \cdot S : j \cdot S + K]

. When

S < K

, the patches overlap, resulting in overlapping regions across the feature map.

3.2.3. Local Feature Extraction (CNN Encoder)

In the CNN encoder is designed to capture local features, each convolutional layer performs the operation described in Equation (1) on the input feature map

F^{(l - 1)}

.

F_{i, j, c}^{(l)} = \sum_{u = 1}^{K} \sum_{v = 1}^{K} \sum_{d = 1}^{C} W_{u, v, d, c}^{(l)} \cdot F_{i + u, j + v, d}^{(l - 1)} + b_{c}^{(l)}

(1)

In Equation (1),

H_{l}, W_{l},

and

C_{l}

represent the height, width, and number of channels of the output feature map, respectively. After the convolution operation defined in Equation (1), a nonlinear activation function

σ (\cdot)

, such as ReLU, is applied to introduce nonlinearity, as expressed in Equation (2).

F_{i, j, c}^{(l)} = σ (F_{i, j, c}^{(l)})

(2)

After repeating the operation in Equation (2) across multiple layers, the final feature map

F_{C N N}^{(t)} \in R^{H' \times W' \times C'}

is obtained from the last layer. This feature map has a downsampled spatial resolution (H′, W′) and a final channel size C′. Here, t = 1, …, F denotes the frame index in the video sequence.

3.2.4. Global Feature Extraction (Spatial Attention Encoder)

Each patch is then linearly mapped into a D-dimensional embedding vector

z_{(p, t)}^{(0)}

using a learnable projection matrix E∈

R^{D \times 3 P^{2}}

, as follows.

z_{(p, t)}^{(0)} = {E x}_{(p, t)} + e_{(p, t)}^{p o s}

(3)

In Equation (3),

e_{(p, t)}^{p o s}

represents the positional embedding that encodes the spatiotemporal locations of each patch. The resulting embedding sequence

z_{(p, t)}^{(0)}

is then used as the input to the Transformer.

The Transformer consists of L encoding blocks, where each block

l

computes the Query, Key, and Value vectors from the output of the previous block

z_{(p, t)}^{(l - 1)}

, as defined in Equation (4).

q_{(p, t)}^{(l, a)} = W_{Q}^{(l, a)} \cdot L N (z_{(p, t)}^{(l - 1)}) k_{(p, t)}^{(l, a)} = W_{K}^{(l, a)} \cdot L N (z_{(p, t)}^{(l - 1)}) v_{(p, t)}^{(l, a)} = W_{V}^{(l, a)} \cdot L N (z_{(p, t)}^{(l - 1)})

(4)

In Equation (4), LN (

\cdot

) denotes the LayerNorm operation, a = 1, …, A represents the attention head index, and D_h = D/A indicates the dimensionality of the latent representation for each head. TimeSformer adopts a Pre-LN architecture, in which LayerNorm is applied prior to the computation of Query, Key, and Value (Q/K/V) vectors. This design helps to stabilize the self-attention distribution when modeling the spatiotemporal characteristics of videos.

The self-attention weights are computed using the dot-product and the SoftMax (SM) function, as defined in Equation (5).

a_{(p, t)}^{(l, a)} = S M (\frac{q \cdot k^{T}}{\sqrt{D_{h}}})

(5)

Using the weights computed in Equation (5), a weighted sum of the Value vectors is calculated to obtain the final encoded representation for each patch. This process is defined in Equation (6).

s_{(p, t)}^{(l, a)} = a_{(p, t) . (0,0)}^{(l, a)} v_{(0,0)}^{(l, a)} + \sum_{p' = 1}^{N} \sum_{t' = 1}^{F} a_{(p, t) . (p^{'}, t^{'})}^{(l, a)} v_{(p', t')}^{(l, a)}

(6)

The outputs

s_{(p, t)}^{(l, a)}

from all attention heads are concatenated and then passed through a linear projection and a multi-layer perceptron (MLP). The resulting representation is updated via residual connections, as defined in Equations (7) and (8).

z_{(p, t)}^{(l)'} = W_{0} [s_{(p, t)}^{(l, 1)}, \dots, s_{(p, t)}^{(l, A)}] + z_{(p, t)}^{(l - 1)}

(7)

z_{(p, t)}^{(l)} = {M L P (L N (z}_{(p, t)}^{(l)'})) + z_{(p, t)}^{(l)'}

(8)

3.2.5. Feature Fusion Block (CAFGL)

To learn richer representations by fusing the local features from the CNN encoder and the global features from the Spatial encoder, the CAFGL (Cross-Attention Fusion for Global and Local) module was employed. In this process, the output of the CNN encoder

F_{i, j, c}^{(l)}

is used as the Query, while the final output of the Spatial encoder

z_{(p, t)}^{(l)}

serves as the Key and Value. These are mapped to the Q/K/V representations, and the cross-attention is computed as defined in Equation (9).

Q_{(i, j)} = W_{Q} \cdot F_{(p, t)}^{(i, j)} K_{(p, t)} = W_{K} \cdot z_{(p, t)}^{(l)} V_{(p, t)} = W_{V} \cdot z_{(p, t)}^{(l)}

(9)

In Equation (9),

W_{Q}, W_{K}

, and

W_{V}

denote learnable weight matrices. The CAFGL module computes the scaled dot-product between the Query and Key, followed by the application of a SM function to obtain the attention weights

a_{(i, j) (p, t)}

, as defined in Equation (10).

a_{(i, j) (p, t)} = S M (\frac{Q \cdot K^{T}}{\sqrt{D_{h}}})

(10)

The attention weights computed in Equation (10) are then used to calculate the final fused features by performing a weighted sum of the Value vectors, as defined in Equation (11).

S_{(i, j)} = \sum_{p = 1}^{N} \sum_{t = 1}^{F} a_{(i, j) (p, t)} V_{(p, t)}

(11)

Finally, the local and global features are combined, as defined in Equation (12).

H_{f u s i o n}^{(i, j)} = M L P ([S_{(i, j)}; F_{C N N}^{(i, j)}])

(12)

3.2.6. Temporal Encoder

The output of the CAFGL module,

H_{f u s i o n}^{(i, j)}

, is passed to the Temporal Encoder to learn temporal continuity. Since

H_{f u s i o n}^{(i, j)}

is a spatial feature map, it is partitioned into non-overlapping patches, following the same procedure used in the Spatial Encoder, and subsequently transformed into

{z_{t i m e}}_{(p, t)}

, as defined in Equation (13).

{z_{t i m e}}_{(p, t)} = P (H_{f u s i o n})

(13)

In Equation (13), P( ) denotes the operation of partitioning

H_{f u s i o n}^{(i, j)}

into

P \times P

patches. Using the resulting sequence

{z_{t i m e}}_{(p, t)}

, each block

l

of the Temporal Encoder computes the Query, Key, and Value vectors as defined in Equation (14).

{q_{t i m e}}_{(p, t)}^{(l, a)} = W_{Q t i m e}^{(l, a)} \cdot LN (H_{f u s i o n}^{(i, j)}) {k_{t i m e}}_{(p, t)}^{(l, a)} = W_{K t i m e}^{(l, a)} \cdot LN (H_{f u s i o n}^{(i, j)}) {v_{t i m e}}_{(p, t)}^{(l, a)} = W_{V t i m e}^{(l, a)} \cdot LN (H_{f u s i o n}^{(i, j)})

(14)

Temporal attention learns the relationships between the Query and all Key vectors across different time frames at the same spatial location p. The attention score

{a_{t i m e}}_{(p, t)}^{(l, a)}

is computed as defined in Equation (15).

{a_{t i m e}}_{(p, t)}^{(l, a)} = S M (\frac{q_{t i m e} {\cdot k_{t i m e}}^{T}}{\sqrt{D_{h}}})

(15)

Subsequently, the attention scores

{a_{t i m e}}_{(p, t)}^{(l, a)}

from Equation (15) are used to compute the weighted sum of the Value vectors, as defined in Equation (16).

{s_{t i m e}}_{(p, t)}^{(l, a)} = \sum_{t' = 1}^{F} {a_{t i m e}}_{(p, t) . t'}^{(l, a)} \cdot v_{(p, t')}^{(l, a)}

(16)

The outputs from all attention heads are concatenated and then passed through a linear projection and a multi-layer perceptron (MLP), as defined in Equation (17), to obtain the final representation.

z_{(p, t)}^{(l)} = M L P (L N (W_{0} [{s_{t i m e}}_{(p, t)}^{(l, 1)}, \dots, {s_{t i m e}}_{(p, t)}^{(l, A)}])) + W_{0} [{s_{t i m e}}_{(p, t)}^{(l, 1)}, \dots, {s_{t i m e}}_{(p, t)}^{(l, A)}]

(17)

The final representation is processed through a downstream regression head to generate an intensity score for each segment of the video, representing the level of viewer interest. Subsequently, the scipy.signal.find_peaks algorithm is applied to the resulting intensity score sequence to automatically detect highlight segments based on local maxima. In this context, highlight detection is defined as the set of points that satisfy the condition expressed in Equation (18).

P (y) = {t_{p} | y (t_{p}) > y (t_{p} - 1), y (t_{p}) > y (t_{p} - 1), y (t_{p}) > T_{h e i g h t}, ∆ T_{d i s t a n c e}}

(18)

This process effectively identifies sections of successive frames with soaring interest and is used to determine the start and end points of each highlight.

4. Experimental Results

4.1. Dataset for SPOT Models

In this study, we collected video data from the YouTube-8M (YT8M) dataset using the metadata employed in MR.Hisum [45], a previous study on highlight dataset creation. This metadata contains the information summarized in Table 3.

Using the youtube_id from the metadata, the YouTube-8M (YT-8M) dataset provided by the YouTube platform is collected, following the sequence illustrated in Figure 3.

Specifically, it includes video (visual), audio, and subtitle information, providing a rich dataset that allows the analysis of not only visual elements but also auditory and linguistic contexts.

In this study, we focused primarily on the visual modality for highlight extraction. We selectively utilized videos for which highlight information was provided by YouTube’s “Most Replayed” feature. “Most Replayed” appears as a graph on the playback bar, indicating sections of the video that users have frequently rewatched.

The “Most Replayed” feature, as shown in Figure 4, is displayed above the YouTube playback bar and visualizes the frequency with which specific segments are repeatedly viewed by users. This information can be collected from the HTML source of the video page, and the replay frequency data for the entire video is stored in the macroMarkersListEntity field in the format shown in Table 4.

The segment with the highest replay frequency among these is labeled as the “Most Replayed.” This process leverages automatically generated segment replay information from YouTube, which is based on user viewing patterns, enabling reliable highlight labeling that reflects the actual audience response for each video. In this study, to utilize this feature, we collected the HTML data of each video and extracted the highlight segments using the macroMarkersListEntity field, storing the results in JSON format.

In addition, video complexity was calculated by analyzing inter-frame pixel differences, after a light, standardized preprocessing (all frames resized to 224 × 224, sampled at 24 FPS, converted to the luminance channel with pixel values normalized to [0, 1]). Let

{I_{t}}_{t = 1}^{T}

denote the resulting frame sequence.

D_{t} = \frac{1}{H W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} | I_{t} (x, y) - I_{t - 1} (x, y)

(19)

As shown in Equation (19), we first computed the mean absolute pixel-wise difference

D_{t}

between consecutive frames, normalized by image size (H,W). This simple calculation captures overall changes in intensity and texture between frames.

To aggregate over time, we computed the mean variation score:

C = \frac{1}{T - 1} \sum_{t = 2}^{T} D_{t}

(20)

Finally, I min–max normalizes C across the dataset to obtain

\hat{C}

∈ [0, 1], which enables stratification into three complexity regimes using empirical quantiles: Low (

\hat{C}

≤ q0.33), Medium (q0.33 <

\hat{C}

≤ q0.66), and High (

\hat{C}

> q0.66). For each regime, the dataset was further divided into training, validation, and test sets in a ratio of 7:1.5:1.5 to ensure balanced evaluation across complexity levels.

4.2. Comparison with Baselines

In this study, to compare the highlight prediction performance of the proposed SPOT model with that of the conventional TimeSformer model, experiments were conducted based on the mean squared error (MSE) across different levels of video complexity. The performance metrics used in the evaluation are shown in Table 5.

The results of the experiments using the performance indicators in Table 5 are shown in Table 6.

Experimental results demonstrated that the proposed SPOT model outperformed the baseline TimeSformer model in terms of mAP and F1-Score, thereby validating its improved highlight detection capability. For MSE, TimeSformer showed a slight advantage in the low-complexity group, while both models exhibited similar performance in the medium-complexity group. In contrast, in the high-complexity group, SPOT achieved an MSE of 0.080, representing a clear improvement over TimeSformer’s 0.090. This indicates that SPOT can perform more precise and stable highlight predictions in scenarios with high visual complexity.

In terms of computational cost, SPOT requires more parameters (~94.5M vs. ~89.1M) and higher FLOPs (~825G vs. ~409G MACs) than TimeSformer under the same tokenization setting, due to the additional CNN branch and the increased token sequence length. Nevertheless, these results indicate that the extra cost is justified, as SPOT provides consistently better accuracy across key metrics, particularly in high-complexity scenarios.

Furthermore, when compared to a model trained solely with a 3D CNN on the same dataset, SPOT achieved superior performance across all evaluation metrics (MSE, mAP, Coverage, and F1-Score). This supports the effectiveness of the hybrid architecture that combines local feature extraction via CNN with global relationship modeling using a Transformer, compared to a single 3D CNN-based approach. While a pure 3D CNN captures spatial-temporal features within limited receptive fields, it struggles to model long-range dependencies and contextual relationships. By contrast, SPOT leverages the CNN to encode fine-grained local patterns and the Transformer to capture global temporal dynamics, leading to more accurate highlight detection across diverse video types.

In addition, a performance comparison with the state-of-the-art model MH-DETR on our dataset showed that, while SPOT achieved overall performance comparable to MH-DETR, it outperformed MH-DETR in most cases within the high-complexity group. It should be noted, however, that MH-DETR processes refined individual frames as input, whereas SPOT takes the raw video itself as input—a difference that is likely to have influenced the results to some extent. These findings suggest that SPOT is particularly well-suited for analyzing videos with frequent scene transitions and significant changes in key objects or actions.

Figure 5 illustrates the trend that the performance of the SPOT model improves as the number of parameters in the CNN backbone increases.

The graph in the upper-left corner illustrates the relationship between the number of backbone parameters and model performance, showing a gradual improvement as the parameter count increases in the order of MobileNetV3-Small, ResNet18, and ResNet50. These results experimentally demonstrate that the parallel integration of CNNs and Transformer effectively overcomes the limitations of Transformer-only architectures in learning local visual information, thereby functioning as a highlight inference model robust to variations in video complexity.

4.3. Ablation Study

In this study, to analyze the performance contribution of each module in the SPOT model to highlight extraction, a series of ablation experiments were conducted, as shown in Table 7, in which specific modules were removed or replaced with alternative designs.

The dataset used in these experiments was not divided based on complexity. The reason is that, while the primary experiments in this paper focused on performance differences according to complexity levels (low, medium, high)—as emphasized throughout the study—the purpose of the ablation experiments is to examine how the presence or structural variation in each block affects performance. Therefore, dividing the dataset by complexity was not necessary in this context.

First, to assess the contribution of the Cross-Attention Fusion of Global and Local features (CAFGL) module, the module was removed, and experiments were conducted using only a simple fusion of the local and global features, as described in Equations (21)–(23). This is because the limitations of such simple fusion methods are evident.

F_{s u m} = L o c a l f e a t u r e + G l o b a l f e a t u r e

(21)

F_{a v g} = \frac{1}{2} (L o c a l f e a t u r e + G l o b a l f e a t u r e)

(22)

F_{m u l} = L o c a l f e a t u r e ⊙ G l o b a l f e a t u r e

(23)

First, as shown in Equation (21), the addition method assigns equal weights to features with different semantic properties, which can lead to conflicts in importance or the cancelation of information. The study Attentional Feature Fusion [46] pointed out that such simple summation can cause an information bottleneck when merging features with semantic discrepancies, thereby failing to ensure sufficient representational capacity. This effect is also reflected in our experiments, as evidenced by the results of Ablation 1 (sum) in Table 7.

Similarly, Equation (22) computes a simple average, which suffers from problems analogous to those in Equation (21). Averaging treats the information from the two domains equally without reflecting the independent significance of each feature, ultimately weakening the representational power of the fused information. A recent study proposing an Adaptive Feature Fusion technique [47] mentioned that traditional averaging-based fusion could degrade a model’s generalization performance and dilute critical information depending on the context. This effect is also reflected in our experiments, as evidenced by the results of Ablation 1 (avg) in Table 7.

Finally, Equation (23) fuses features through multiplication. This approach only activates when both input features are strongly expressed; if one feature is weak, the overall fusion result may collapse to an insignificant value. The study Attentional Feature Fusion [45] reported that multiplication-based fusion can cause information loss when there is an imbalance in the importance of the features, directly affecting visual recognition performance. This effect is also reflected in our experiments, as evidenced by the results of Ablation 1 (mul) in Table 7.

Considering these limitations of non-CAFGL fusion methods, the CAFGL used in the SPOT model enables sophisticated information integration by reflecting the relative importance of each feature while also accounting for their mutual context. This capability significantly contributes to improving highlight prediction performance in SPOT, where the fusion of local and global features is essential.

Next, to evaluate the impact of the CNN Encoder, we compared the performance using the original TimeSformer architecture. The results revealed an overall decrease in performance, with a particularly pronounced drop in prediction accuracy for high-complexity videos.

To assess the influence of the Spatial Encoder, we removed this module and performed predictions using only spatial features at the single-frame level. This resulted in a noticeable degradation in overall performance, as the model failed to sufficiently capture important objects or events. In particular, ignoring spatial-level relationships and the distribution of subjects often led to missing frames that should have been identified as highlights. This demonstrates that the SPOT model relies not only on simple visual features but also on its ability to capture the spatial composition and interactions within a scene.

Finally, to verify the role of the Temporal Encoder, we removed this module and replaced it with a structure that simply averages frame-level features before prediction. This modification led to a significant drop in performance across all datasets, with particularly severe degradation in videos where the temporal progression of events was critical. This result underscores the decisive importance of learning and preserving temporal contextual information for highlight prediction performance.

Overall, these findings confirm that the superior performance of the SPOT model is achieved through the CAFGL module, which integrates CNN and Transformer branches, and the Temporal Encoder, which captures temporal dependencies. Notably, SPOT demonstrates more stable and robust performance in high-complexity video environments compared to pure Transformer-based models such as TimeSformer.

5. Conclusions and Future Studies

In this study, we proposed the SPOT model, which combines CNN and TimeSformer to improve the precision of video highlight prediction in response to the growing demand for short-form and highlight videos. This hybrid architecture was designed to learn both local and global features simultaneously. In particular, by scaling the CNN backbone from MobileNetV3-Small to ResNet18 and ResNet50, the number of parameters increased, enabling the extraction of richer local features, which, when fused with the Transformer, led to overall performance improvements. As a result, SPOT achieved superior highlight prediction performance in complex scenes (with up to a 0.01 reduction in MSE), allowing for the automatic extraction of short-form content from large-scale video collections—a format that aligns with the rapidly growing viewing trend. Moreover, SPOT outperformed the SOTA benchmark MH-DETR in complex scenes.

The proposed SPOT model was designed to enhance the global spatio-temporal representation learning capability of TimeSformer by incorporating a CNN-based encoder that effectively captures local visual patterns. This hybrid structure contributed to overall performance gains, particularly in improving highlight prediction precision in complex scenes. However, the addition of the CNN module increased the number of parameters and computational cost, resulting in higher resource consumption during both training and inference. SPOT tended to have longer training times compared to TimeSformer, which could pose a significant limitation for large-scale dataset processing or real-time applications such as mobile environments; therefore, future work will focus on improving computational efficiency through model compression (e.g., pruning, quantization) and optimization techniques.

The video dataset used in this study is a typical multimodal dataset composed of temporal information, visual information (frames, thumbnails, etc.), and audio information. However, the highlight extraction of SPOT relies solely on temporal and visual frame information, which limits its ability to capture semantic cues from the audio track. We sincerely appreciate the reviewer’s suggestion in this regard, as audio signals can indeed provide important cues that effectively reveal highlight moments. For example, audio features can be extracted using MFCC (Mel-Frequency Cepstral Coefficients) [48] and then fed into a CNN-based encoder for learning. Such an approach would allow the integration of complementary semantic information with the visual context of video scenes. In future work, we plan to extend our system by incorporating not only audio but also other multimodal sources such as subtitles and visual metadata (e.g., thumbnails), thereby leveraging richer contextual information for video highlight extraction. These advancements are expected to contribute to a wide range of real-world video platform applications, including automatic editing and summarization.

Author Contributions

Conceptualization, S.K.; methodology, S.K.; formal analysis, S.K.; software, S.K.; validation, S.K.; investigation, J.L. and S.P.; resources, S.K.; supervision, J.L. and S.P.; project administration, S.K.; writing—original draft preparation, S.K.; writing—review and editing, S.K., J.L. and S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science, ICT), Korea, under the Global Research Support Program in the Digital Field program (RS-2024-00428758) supervised by the IITP (Institute for Information and Communications Technology Planning and Evaluation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on YouTube-8M at https://research.google.com/youtube8m/ (accessed on 8 September 2025).

Acknowledgments

During the preparation of this manuscript, Korean text was translated into English. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, D.C.-E.; Caba Heilbron, F.; Lee, J.-Y.; Wang, O.; Martelaro, N. Videogenic: Identifying Highlight Moments in Videos with Professional Photographs as a Prior. In Proceedings of the 16th Conference on Creativity & Cognition, Chicago, IL, USA, 23–26 June 2024; pp. 328–346. [Google Scholar]
Vora, D.; Kadam, P.; Mohite, D.D.; Kumar, N.; Kumar, N.; Radhakrishnan, P.; Bhagwat, S. AI-driven video summarization for optimizing content retrieval and management through deep learning techniques. Sci. Rep. 2025, 15, 4058. [Google Scholar] [CrossRef] [PubMed]
Xiong, B.; Kalantidis, Y.; Ghadiyaram, D.; Grauman, K. Less is More: Learning Highlight Detection from Video Duration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1258–1267. [Google Scholar]
Zeng, C. Unveiling the impact of short videos: Consumption behavior and decision-making in the digital age. Highlights Bus. Econ. Manag. 2023, 21, 469–474. [Google Scholar] [CrossRef]
Arslan, S.; Tanberk, S. Key frame extraction with attention based deep neural networks. arXiv 2023, arXiv:2306.13176. [Google Scholar] [CrossRef]
Violot, C.; Elmas, T.; Bilogrevic, I.; Humbert, M. Shorts vs. Regular Videos on YouTube: A Comparative Analysis of User Engagement and Content Creation Trends. In Proceedings of the 16th ACM Web Science Conference, New York, NY, USA, 21–24 May 2024; pp. 213–223. [Google Scholar]
Zannettou, S.; Nemes-Nemeth, O.; Ayalon, O.; Goetzen, A.; Gummadi, K.P.; Redmiles, E.M.; Roesner, F. Analyzing User Engagement with TikTok’s Short Format Video Recommendations Using Data Donations. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–16. [Google Scholar]
Chen, Z.; Liu, P.; Piao, J.; Xu, F.; Li, Y. Shorter is different: Characterizing the dynamics of short-form video platforms. arXiv 2024, arXiv:2410.16058. [Google Scholar] [CrossRef]
Van Daele, T.; Iyer, A.; Zhang, Y.; Derry, J.C.; Huh, M.; Pavel, A. Making Short-Form Videos Accessible with Hierarchical Video Summaries. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–17. [Google Scholar]
Peronikolis, M.; Panagiotakis, C. Personalized Video Summarization: A comprehensive survey of methods and datasets. Appl. Sci. 2024, 14, 4400. [Google Scholar] [CrossRef]
Manic, M. Short-form video content and consumer engagement in digital landscapes. Bull. Transilv. Univ. Brasov. Ser. V Econ. Sci. 2024, 17, 45–52. [Google Scholar] [CrossRef]
Islam, Z.; Paul, S.; Rochan, M. Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; pp. 8702–8711. [Google Scholar]
Xu, Y.; Sun, Y.; Zhai, B.; Jia, Y.; Du, S. MH-DETR: Video Moment and Highlight Detection with Cross-Modal Transformer. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
Zhao, P.; He, Z.; Zhang, F.; Lin, S.; Zhou, F. Ld-detr: Loop decoder detection transformer for video moment retrieval and highlight detection. arXiv 2025, arXiv:2501.10787. [Google Scholar]
Zhang, T.; Cui, W.; Liu, S.; Jiang, F. SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Xiong, T.; Wei, W.; Xu, K.; Chen, D. SA-DETR: Span Aware Detection Transformer for Moment Retrieval. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 7634–7647. [Google Scholar]
Sun, H.; Zhou, M.; Chen, W.; Xie, W. Tr-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2024; pp. 4998–5007. [Google Scholar]
Arnab, A.; Sun, C.; Schmid, C. Unified Graph Structured Models for Video Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 8117–8126. [Google Scholar]
Tang, Y.; Bi, J.; Xu, S.; Song, L.; Liang, S.; Wang, T.; Zhang, D.; An, J.; Lin, J.; Zhu, R. Video understanding with large language models: A survey. arXiv 2023, arXiv:2312.17432. [Google Scholar] [CrossRef]
Zeng, Z.; McDuff, D.; Song, Y. Contrastive learning of global and local video representations. Adv. Neural Inf. Process. Syst. 2021, 34, 7025–7040. [Google Scholar]
Sun, G.; Liu, Y.; Ding, H.; Wu, M.; Van Gool, L. Learning local and global temporal contexts for video semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6919–6934. [Google Scholar] [CrossRef]
Wei, F.; Wang, B.; Ge, T.; Jiang, Y.; Li, W.; Duan, L. Learning Pixel-Level Distinctions for Video Highlight Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3073–3082. [Google Scholar]
Badamdorj, T.; Rochan, M.; Wang, Y.; Cheng, L. Contrastive Learning for Unsupervised Video Highlight Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14042–14052. [Google Scholar]
Xu, J.; Liu, S.; Chen, G.; Liu, Q. Highlight Detection and Removal Method Based on Bifurcated-CNN. In Proceedings of the International Conference on Intelligent Robotics and Applications, Harbin, China, 1–3 August 2022; pp. 307–318. [Google Scholar]
Al-Saad, M.; Ramaswamy, L.; Bhandarkar, S. F4D: Factorized 4D Convolutional Neural Network for Efficient Video-level Representation Learning. arXiv 2023, arXiv:2401.08609. [Google Scholar]
Paul, D.; Parvez, M.R.; Mohammed, N.; Rahman, S. VideoLights: A Cross-Modal Cross-Task Transformer Model for Joint Video Highlight Detection and Moment Retrieval. arXiv 2024, arXiv:2412.01558. [Google Scholar]
Azad, R.; Kazerouni, A.; Azad, B.; Khodapanah Aghdam, E.; Velichko, Y.; Bagci, U.; Merhof, D. Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 736–746. [Google Scholar]
Pereira, G.A.; Hussain, M. A review of transformer-based models for computer vision tasks: Capturing global context and spatial relationships. arXiv 2024, arXiv:2408.15178. [Google Scholar] [CrossRef]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef] [PubMed]
Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local Features Coupling Global Representations for Visual Recognition. In Proceedings of the IEEE/CVF international conference on computer vision, Montreal, QC, Canada, 10–17 October 2021; pp. 367–376. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention all You Need for Video Understanding? In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; p. 4. [Google Scholar]
Menon, V.V.; Feldmann, C.; Amirpour, H.; Ghanbari, M.; Timmerer, C. VCA: Video Complexity Analyzer. In Proceedings of the 13th ACM Multimedia Systems Conference, Athlone, Ireland, 14–17 June 2022; pp. 259–264. [Google Scholar]
Alaa, T.; Mongy, A.; Bakr, A.; Diab, M.; Gomaa, W. Video Summarization Techniques: A Comprehensive Review. arXiv 2024, arXiv:2410.04449. [Google Scholar] [CrossRef]
Li, T.; Sun, Z.; Xiao, X. Unsupervised modality-transferable video highlight detection with representation activation sequence learning. IEEE Trans. Image Process. 2024, 33, 1911–1922. [Google Scholar] [CrossRef]
Lee, M.J.; Gong, D.; Cho, M. Video Summarization with Large Language Models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 18981–18991. [Google Scholar]
Tang, K.-S.; So, H.-J.; Rappa, N. Examining the Multimodal Design of Explainer Videos: A multimodal Content Analysis of Khan Academy Online Resources. SSRN Electron. J. 2023. [CrossRef]
Aminbeidokhti, M.; Pedersoli, M.; Cardinal, P.; Granger, E. Emotion Recognition with Spatial Attention and Temporal Softmax Pooling. In Proceedings of the International Conference on Image Analysis and Recognition, Waterloo, ON, Canada, 27–29 August 2019; pp. 323–331. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3d Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Hassan, E. Learning video actions in two stream recurrent neural network. Pattern Recognit. Lett. 2021, 151, 200–208. [Google Scholar] [CrossRef]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Chen, X.; Shi, S.; Ma, T.; Zhou, J.; See, S.; Cheung, K.C.; Li, H. M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving. arXiv 2025, arXiv:2503.18100. [Google Scholar] [CrossRef]
Chen, J.; Ho, C.M. Mm-vit: Multi-Modal Video Transformer for Compressed Video Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1910–1921. [Google Scholar]
Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv 2023, arXiv:2303.04671. [Google Scholar] [CrossRef]
Xiao, B.; Yin, X.; Kang, S.-C. Vision-based method of automatically detecting construction video highlights by integrating machine tracking and CNN feature extraction. Autom. Constr. 2021, 129, 103817. [Google Scholar] [CrossRef]
Sul, J.; Han, J.; Lee, J. Hisum: A large-scale dataset for video highlight detection and summarization. Adv. Neural Inf. Process. Syst. 2023, 36, 40542–40555. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional Feature Fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar]
Mungoli, N. Adaptive feature fusion: Enhancing generalization in deep learning models. arXiv 2023, arXiv:2304.03290. [Google Scholar] [CrossRef]
Kim, D.-H.; Son, W.-H.; Kwak, S.-S.; Yun, T.-H.; Park, J.-H.; Lee, J.-D. A hybrid deep learning emotion classification system using multimodal data. Sensors 2023, 23, 9333. [Google Scholar] [CrossRef] [PubMed]

Figure 1. SPOT model architecture.

Figure 2. CAFGL Block.

Figure 3. Download flowchart of dataset.

Figure 4. YouTube’s “Most Replayed”.

Figure 5. SPOT performance changes as the CNN encoder changes: (a) MSE versus parameters; (b) F1-score versus parameters.

Table 1. Parameters of Algorithm 1.

Variable Name	Description
Video	Input video sequence
Tf_prediction	The output of Prediction_head, i.e., the final predicted value of the TimeSformer model
video_frames	List/arrangement of frames divided by video (T frames)
frame	Single frame in video_frames
frame_patch	As a result of dividing the frame into smaller patches (for example, 16 × 16)
patch_tokens	Token vector embedded with frame_patch unfolded (linear projection)
Spatial_token	Output of spatial encoder (set of tokens in one frame)
Spatial_tokens	Sequence that gathers the spatial_token of all frames (stacked by the time axis)
Temporal_tokens	Temporary encoder results (final token sequence with added time context)

Table 2. Parameters of Algorithm 2.

Variable Name	Description
Video	Input video sequence
SPOT_prediction	Final prediction value (e.g., highlight score)
Video_frame	List/array of frames obtained from video segmentation
frame	A single frame within video_frames
frame_patch	Result of dividing a frame into small patches
patch_tokens	Token vector obtained by embedding a frame patch
GL_feature	Global feature extracted from the Spatial Encoder
LC_feature	Local feature extracted from the CNN Encoder
CAFGL	Module for dynamically fusing local and global features based on cross-attention
Fusion_feature	Fused feature obtained by integrating GL_feature and LC_feature through the CAFGL block
Fusion_features	Sequence of fusion features from all frames arranged in temporal order
Temporal_tokens	Temporal Encoder output (tokens with temporal context incorporated)

Table 3. Columns in the metadata.

Column Name	Description
video_id	A unique identifier assigned to distinguish each video within the dataset
yt8m_file	The file name of the corresponding video in the YouTube-8M dataset
random_id	A randomly generated ID created during the data randomization process
youtube_id	The actual YouTube video ID (https://www.youtube.com/ watch?v=%E2%80%98youtube_id%E2%80%99 (accessed on 8 September 2025))
duration	The total playback duration of the video
views	The number of views for the video
entry 2	The set of topics or categories associated with the video

Table 4. macroMarkersListEntity’s data.

Name	Description	Type
startMillis	Start time in milliseconds	String
durationMillis	Duration in milliseconds	String
intensityScoreNormalized	Normalized intensity score	Float

Table 5. Performance metrics.

Metric	Description	Significance
MSE	Mean Squared Error	Evaluates prediction accuracy including outliers
mAP	Mean Average Precision	Measures the ability to detect highlights that match the ground truth
Coverage Ratio	The proportion of predicted highlight segments that contain the actual highlights (ground truth).	Evaluation of the ability to correctly match the start and end positions of highlights.
F1-Score	$\frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}$	Evaluates both the accuracy of the highlights detected by the model and its ability to avoid missing relevant highlights.

Table 6. Comparison of average performance by complexity group.

Model	Params (M)	MACs (G)	Complexity	MSE	mAP	Coverage	F1-Score
SPOT	~94.5	~825	Low	0.065	0.504	0.78	0.85
			Medium	0.070	0.662	0.80	0.87
			High	0.080	0.562	0.85	0.92
TimeSformer	~89.1	~409.1	Low	0.060	0.452	0.78	0.85
			Medium	0.071	0.561	0.79	0.86
			High	0.090	0.503	0.78	0.85
3D-CNNN	~33	~60	Low	0.087	0.434	0.834	0.44
			Medium	0.084	0.482	0.745	0.36
			High	0.056	0.502	0.603	0.33
MH-DETR	~37	~95	Low	0.078	0.52	0.45	0.60
			Medium	0.073	0.55	0.48	0.80
			High	0.068	0.54	0.46	0.80

Table 7. Ablation study: “√” indicates that the corresponding module/component is included in the model.

Test Name	Module				MSE	mAP	Coverage	F1-Score
Test Name	CFAGL Block	CNN Encoder	Spatial Encoder	Temporal Encoder	MSE	mAP	Coverage	F1-Score
SPOT	√	√	√	√	0.065	0.576	0.810	0.880
Ablation 1 (sum)		√	√	√	0.072	0.094	0.264	0.223
Ablation 1 (avg)		√	√	√	0.071	0.096	0.268	0.226
Ablation 1 (mul)		√	√	√	0.080	0.084	0.242	0.198
Ablation 2			√	√	0.0737	0.505	0.783	0.853
Ablation 3		√		√	0.073	0.141	0.315	0.278
Ablation 4	√	√	√		0.071	0.174	0.174	0.227

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kwak, S.; Lee, J.; Park, S. The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual. Electronics 2025, 14, 3640. https://doi.org/10.3390/electronics14183640

AMA Style

Kwak S, Lee J, Park S. The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual. Electronics. 2025; 14(18):3640. https://doi.org/10.3390/electronics14183640

Chicago/Turabian Style

Kwak, Sungshin, Jaedong Lee, and Sohyun Park. 2025. "The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual" Electronics 14, no. 18: 3640. https://doi.org/10.3390/electronics14183640

APA Style

Kwak, S., Lee, J., & Park, S. (2025). The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual. Electronics, 14(18), 3640. https://doi.org/10.3390/electronics14183640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Effective Highlight-Detection Model for Video Clips Using Spatial—Perceptual

Abstract

1. Introduction

2. Related Research

2.1. Video Understanding

2.2. CNN-Based Approaches

2.3. Attention-Based Approaches

2.4. Transformer-Based Approaches

2.5. Multimodal-Based Approach

3. SPOT Model

3.1. Structure of SPOT Models

3.2. Mathematical Definition of SPOT

3.2.1. Input Clip

3.2.2. Decomposition into Patches

3.2.3. Local Feature Extraction (CNN Encoder)

3.2.4. Global Feature Extraction (Spatial Attention Encoder)

3.2.5. Feature Fusion Block (CAFGL)

3.2.6. Temporal Encoder

4. Experimental Results

4.1. Dataset for SPOT Models

4.2. Comparison with Baselines

4.3. Ablation Study

5. Conclusions and Future Studies

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI