AutoStageMix: Fully Automated Stage Cross-Editing System Utilizing Facial Features

Oh, Minjun; Jang, Howon; Lee, Daeho

doi:10.3390/app15137613

Open AccessArticle

AutoStageMix: Fully Automated Stage Cross-Editing System Utilizing Facial Features

by

Minjun Oh

,

Howon Jang

and

Daeho Lee

^*

Department of Software Convergence, Kyung Hee University, Yongin 17104, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7613; https://doi.org/10.3390/app15137613

Submission received: 8 June 2025 / Revised: 27 June 2025 / Accepted: 5 July 2025 / Published: 7 July 2025

Download

Browse Figures

Versions Notes

Abstract

StageMix is a video compilation of multiple stage performances of the same song, edited seamlessly together using appropriate editing points. However, generating a StageMix requires specialized editing techniques and is a considerable time-consuming process. To address this challenge, we introduce AutoStageMix, an automated StageMix generation system designed to perform all processes automatically. The system is structured into five principal stages: preprocessing, feature extraction, identifying a transition point, editing path determination, and StageMix generation. The initial stage of the process involves audio analysis to synchronize the sequences across all input videos, followed by frame extraction. After that, the facial features are extracted from each video frame. Next, transition points are identified, which form the basis for face-based transitions, inter-stage cuts, and intra-stage cuts. Subsequently, a cost function is defined to facilitate the creation of cross-edited sequences. The optimal editing path is computed using Dijkstra’s algorithm to minimize the total cost of editing. Finally, the StageMix is generated by applying appropriate editing effects tailored to each transition type, aiming to maximize visual appeal. Experimental results suggest that our method generally achieves lower NME scores than existing StageMix generation approaches across multiple test songs. In a user study with 21 participants, AutoStageMix achieved viewer satisfaction comparable to that of professionally edited StageMixes, with no statistically significant difference between the two. AutoStageMix enables users to produce StageMixes effortlessly and efficiently by eliminating the need for manual editing.

Keywords:

cross-video editing; stage mixing; video synthesis

1. Introduction

StageMix is a video that cross-edits multiple stage performances of the same song. It has gained significant popularity, particularly in the context of K-pop idol performances. The visual satisfaction and immersion of a StageMix largely depend on how natural the scene transitions are, as well as their frequency and timing. Therefore, creating a StageMix requires a skilled editor who aligns the input videos to a consistent sequence, identifies appropriate editing points, and utilizes techniques such as hard cut editing and face-based editing to ensure seamless transitions between different performance videos. Moreover, producing a single StageMix requires approximately 10 h of manual work by a professional editor. These constraints make creating cross-edited videos a challenging and time-consuming task. For fans who want to create their own StageMix, the task can be especially challenging.

To address the challenges of manual editing, prior works have proposed automated approaches for StageMix generation. Jung et al. [1] employed facial and body keypoints to guide transitions, but their method lacked robustness due to limited frame comparison and simplistic heuristics. Lee et al. [2] improved visual continuity using eye landmarks and dynamic programming, yet their approach relied on commercial tools for shot boundary detection and provided a limited explanation of the preprocessing step.

Building upon the limitations of previous works, we present AutoStageMix, a system designed to automatically generate StageMix from multiple stage performances to address these challenges. AutoStageMix introduces several key improvements over prior approaches. First, it incorporates head pose features to assess morphological similarity more comprehensively and enhance the accuracy of face-based transitions. Second, it removes dependency on external software by fully automating the entire process. The system consists of five key stages: preprocessing of the input video, feature extraction from the preprocessed data, identifying a transition point, determining an editing path, and generating the StageMix.

The first stage involves analyzing the audio of the input videos to synchronize sequences and extract frames. The second stage focuses on extracting facial features from the extracted frames. The third stage involves identifying potential transition points based on facial features. In the fourth stage, an optimal editing path is determined with the goal of maximizing visual satisfaction and immersion, based on the identified transition points. The final stage generates the StageMix by applying suitable transition effects tailored to each transition type, based on the determined editing path. These five steps enable the creation of an optimal StageMix output from the input footage.

We conducted both a quantitative evaluation and a user study to demonstrate the effectiveness of AutoStageMix. The quantitative evaluation assessed the accuracy of face matching, which is a critical aspect of face-based transitions. In a user study, participants were presented with StageMix created by AutoStageMix and scored their level of satisfaction. These evaluations confirmed the practicality and efficiency of AutoStageMix, thereby demonstrating its potential to greatly benefit both K-pop fans and general users.

2. Related Works

2.1. Cross-Editing Video Generation

The generation of StageMix has been the subject of various studies. Jung et al. [1] proposed a novel approach that uses the extraction of facial or body keypoints from multiple performance videos to automatically generate cross-edited footage. To prevent the inclusion of different scenes around transition points where similar faces appear, groups of five consecutive frames were compared to each other. However, this approach does not compare a sufficient number of frames to fully ensure visual immersion. Moreover, in their pose-based editing method, the researchers used the center coordinates of performers’ bounding boxes to assess pose similarity within frames for cross-editing. However, this approach may lack precision.

Lee et al. proposed a novel method for generating StageMix using eye landmark features [2]. Their approach used dynamic programming to identify the optimal editing path and improved the visual continuity between frames by silhouette alignment and virtual camera view optimization. However, their study has certain limitations, including an insufficient explanation of the preprocessing steps, such as audio synchronization, and a reliance on commercial software for shot boundary detection. Furthermore, relying solely on eye landmarks to determine facial similarity could disrupt viewer immersion, particularly when face-based cross-editing involves different individuals or faces of different genders.

Shrestha et al. proposed a method for creating a mashup video by combining footage captured from multiple cameras filming a concert [3]. They synchronized the videos using audio fingerprints and analyzed both audio and video features to identify the most suitable segments for combination using a greedy algorithm. This method can be applied to create StageMix by combining multiple videos.

Recently, Achary et al. introduced Real Time GAZED, a real-time video editing system that stabilizes camera trajectories and assembles aesthetically pleasing shots on-the-fly [4]. While the system emphasizes dynamic camera control over face-based transitions, it shares AutoStageMix’s goal of automating professional-quality editing and delivering an engaging viewing experience in real time. In addition, Girmaji et al. proposed EditIQ, an automated editing framework that generates cinematically coherent videos from static, wide-angle footage [5]. The system simulates multiple virtual camera feeds and employs large language model (LLM)-based dialogue analysis along with visual saliency prediction to guide shot selection and transitions. EditIQ’s approach to automated editing based on contextual and visual cues is relevant to broader research in cross-editing.

In summary, previous studies have explored various approaches to automatically identifying transition points and assembling edited videos from multiple sources. These include face-based or pose-based alignment for performance videos, dialogue and saliency-driven editing for cinematic footage, and real-time systems for dynamic content creation. Building upon these findings of previous studies, our research aims to propose a fully automated StageMix generation system that focuses on improving transition accuracy through facial feature matching, while enhancing usability and viewer engagement.

2.2. Feature Extraction Techniques

A common transition technique in StageMix is cross-editing multiple videos based on scenes with similar faces. In this process, the facial landmarks extracted from the main figure are used to align faces appearing in different videos. Kazemi and Sullivan developed a technique that uses an ensemble of regression trees for fast and accurate facial landmark detection [6]. Taigman et al. introduced Deepface, a model that uses deep neural networks to achieve high accuracy in face recognition and robust face verification across various environments [7]. Zhang et al. proposed the MTCNN (Multi-task Cascaded Convolutional Networks) model, which efficiently detects faces and facial landmarks at high speed and with high accuracy [8].

Building on these advancements, Deng et al. proposed RetinaFace, a method that achieves high accuracy in face detection across diverse real-world environments [9]. RetinaFace employs multi-task learning, which includes face classification, face bounding box regression, facial landmarks regression, and dense regression to predict facial position, landmarks, and three-dimensional poses. Due to its high accuracy in complex environments, we used RetinaFace for face detection in our research.

Previous studies have primarily used facial landmarks to match similar faces. In this paper, we used head pose estimation to achieve higher accuracy in matching faces. Ruiz et al. proposed a method for estimating head pose from image intensity, without using keypoints [10]. Li et al. presented an approach combining image rectification with a lightweight CNN, reducing perspective distortion and achieving high accuracy and speed in head pose estimation [11]. Prados-Torreblanca et al. proposed SPIGA, which uses Graph Attention Networks (GAT) to extract facial landmarks and accurately estimate head poses [12], and we used this method for its high accuracy in head pose estimation. To address the challenge of aligning faces with large pose variations, Zhu et al. developed the 3D Dense Face Alignment (3DDFA) model, which incorporates a dense 3D face model to handle extreme angles, thereby improving alignment robustness for profile views [13].

To prevent transitions between different individuals, a verification step is used to ensure continuity of the same individual. Parkhi et al. developed the VGGFace model, which achieved high accuracy in face recognition by utilizing a deep convolutional neural network (CNN) to extract facial features effectively, demonstrating robust performance across various environments [14]. Schroff et al. introduced FaceNet, which maps facial images into a high-dimensional embedding space to measure facial similarity for face recognition and clustering tasks [15]. FaceNet was trained using triplet loss and achieved high accuracy in face verification. In our research, we used FaceNet to ensure the continuity of the main figure during scene transitions in StageMix generation.

2.3. Video Editing Techniques

Several factors must be considered to generate seamless cross-edited transitions using facial features. Smith investigated the impact of continuous editing in film and video on audience attention [16]. He described how continuous editing techniques, such as eye-level alignment and consistent screen orientation, help maintain logical consistency between scenes while keeping the audience engaged with the narrative. Magliano and Zacks further emphasized that continuity editing supports audience comprehension by reinforcing spatial and temporal coherence, helping viewers maintain a mental model of events and minimizing cognitive disruption during transitions [17]. Ardizzone et al. explored the automatic generation of subject-based transitions in slideshows and demonstrated through user evaluations that subject-focused transition techniques were preferred over simple crossfades [18]. They also highlighted that linear easing—where changes occur instantaneously—is less aesthetically pleasing and recommended using non-linear functions such as ease-in, ease-out, and ease-in-out for the blending process.

In face-based transitions, where dozens of frames are utilized near the transition point, it is crucial to accurately identify shot boundaries to prevent a disruption in visual immersion. Bendraou et al. proposed a high-accuracy method for detecting shot boundaries using histogram difference and the SURF algorithm [19]. However, this approach requires high computing power. Radwan et al. proposed a method for detecting scene changes based on a histogram correlation between consecutive frames [20]. Gygli presented a fast, fully convolutional network for shot boundary detection, achieving high speed but limited performance in long dissolves and fast-motion scenes [21]. Thakar et al. introduced a block-based

χ^{2}

histogram method to detect abrupt transitions, though it requires manual threshold adjustments for optimal performance across different videos [22].

In this paper, we apply editing effects based on the findings of the studies above, thereby ensuring smooth scene transitions. Additionally, we propose a shot boundary detection method using histogram correlation and the Intersection over Union (IoU) of bounding boxes obtained during feature extraction. This method is easy to implement while maintaining a high performance.

3. AutoStageMix

Figure 1 illustrates the five steps of AutoStageMix: preprocessing, feature extraction, identifying a transition point, finding an editing path, and StageMix generation. The input consists of multiple performance videos featuring the same singer performing the same song. The output is a StageMix produced by seamlessly editing these input videos into what appears to be a single-stage performance.

In the preprocessing step, audio is extracted from the input videos, and all videos are time-synchronized by calculating the cross-correlation of their waveforms. Furthermore, frames are extracted at 29.97 frames per second, which is the standard frame rate used in Korean broadcasting. Subsequently, in the feature extraction step, the faces in the frame images are detected and their head pose features are extracted. In the identification of a transition point process, we define the types of edits to be applied in creating the StageMix and all possible editing points are calculated based on the extracted frame images. In the step for determining an editing path, Dijkstra’s algorithm [23] is employed to identify the most efficient route from the initial frame to the final frame of StageMix. For this, each frame is represented in a graph structure that includes information on the type of transition to the next frame and the transition cost. Finally, the video is created using the appropriate editing effects corresponding to the editing path in the StageMix generation step.

3.1. Preprocessing

The input videos often contain non-performance segments, such as stage introduction before the performance or audience cheers afterward, which lead to variations in video length. To ensure temporal alignment and extract an identical number of image frames from each video, it is essential to synchronize their timestamps. Since all input videos contain the same audio track, we utilize audio waveforms for synchronization. Specifically, we used cross-correlation to calculate the time delay and achieve synchronization of audio signals as proposed by Knapp and Carter [24].

The videos are arranged in ascending order of length, with the shortest video serving as the reference. The baseline audio is extracted from the reference video, while the target audio is extracted from the other input videos. Cross-correlation between these audio waves is then calculated.

Let the signal from the base audio be

f [t]

and the signal from the target audio be

g [t]

, then the cross-correlation function

R_{x y} [t]

is calculated as follows:

R_{x y} [t] = \sum_{m = 0}^{N - 1} f [m] g [m + t],

(1)

where

N

is the length of

f [t]

.

To calculate the time delay between the two signals, the peak value of the cross-correlation function,

t_{p}

, must be identified. The lag, representing the delay in samples between the two signals, is calculated as follows:

t_{p} = \arg \max_{t} R_{x y} [t],

(2)

lag = t_{p} - (N - 1)

(3)

Figure 2 shows the cross-correlation results of two audio signals, illustrating their similarity. The x-axis represents the lag time, while the y-axis represents the correlation coefficient. The center index indicates the point at which the base and target audio are precisely aligned, and the peak index indicates the point of maximum correlation. The time delay is calculated by dividing the lag between these two indices by the sampling rate.

After time synchronization, the lengths of each video are aligned to match that of the reference video. Then, frames are extracted at the same frame rate (29.97 fps) as the original input video, with a resolution of FHD (1920 × 1080).

3.2. Feature Extraction

In our work, we determine the type of transition between frames by analyzing the facial features of individuals appearing in each frame. The face bounding boxes in each frame were obtained using RetinaFace [9]. Subsequently, SPIGA [12] was used to extract the head poses and facial landmarks from the detected face regions. Figure 3 illustrates the head pose in 3D space, defined by yaw, pitch, and roll.

3.3. Identifying a Transition Point

3.3.1. Defining Transition Types

To create StageMix, we defined and utilized four primary types of transitions between frames: Face-Based Transition (FBT), Inter-Stage Cut (INTER-C), Intra-Stage Cut (INTRA-C), and None. Each transition type is categorized according to whether it involves a switch to a frame from a different video or remains within the same scene sequence. Figure 4 illustrates each transition type.

The first type of transition is FBT, which involves identifying morphologically similar faces in frames from different videos and applying face alignment and alpha blending to achieve a seamless transition. FBT is the most crucial transition type for enhancing viewer satisfaction.

The second type of transition is INTER-C, a hard cut from one video to another. This transition is used to avoid having a single video dominate the scene for too long.

The third type of transition is INTRA-C, which refers to a hard cut within a single video. These transitions are typically pre-existing cuts in the original stage footage. By identifying the locations of INTRA-C transitions (i.e., shot boundaries), the system can avoid FBT or INTER-C transitions near these points.

The final transition type, None, refers to a simple progression from one frame to the next within the same scene, without any transition to a different scene. Unlike the first three types, which involve transitions to different scenes, the None type does not result in a scene change.

3.3.2. Determining the Main Figure in the Frame

FBT is based on a single face, even when multiple faces are detected within a frame. Jung et al. [1] proposed a method for selecting the most appropriate face for FBT, where they identified the main figure based on the horizontal length of the bounding box of the detected faces. Lee et al. [2] suggested an alternative approach that utilizes the length of the line segment connecting the two eyes and the coordinates of the center point between the eyes.

In our study, the location and size of the bounding boxes of the detected faces were used to determine the main figure. The face location was determined by identifying the coordinates of the center of the bounding box, while the size was determined by calculating the area of the bounding box. For an input image with a resolution of 1920 × 1080, the largest face located within 300 pixels of the image center and with an area greater than 30,000 pixels was selected as the main figure for FBT in that frame. If no face satisfies these conditions, FBT is not applied, and the system considers alternative transitions, such as INTER-C, INTRA-C, or None.

3.3.3. FBT

In Section 3.3.2, we defined the main figure used for FBT. For all frames in which a main figure is present, the yaw, pitch, and roll values of the main figure were extracted and vectorized using head pose estimation. Subsequently, cosine similarity (sim) was computed to assess the similarity between the main figures’ faces in time-synchronized frames from different videos. Let the head pose vector of main Figure 1 be denoted as

h_{1}

and that of main Figure 2 as

h_{2}

. The face similarity is then calculated as follows:

h_{1} = (\begin{array}{l} {y a w}_{1} \\ {p i t c h}_{1} \\ {r o l l}_{1} \end{array}), h_{2} = (\begin{array}{l} {y a w}_{2} \\ {p i t c h}_{2} \\ {r o l l}_{2} \end{array}), sim = \frac{h_{1} \cdot h_{2}}{‖h_{1}‖ ‖h_{2}‖} .

(4)

Zhu et al. demonstrated that excessive yaw angles reduce the accuracy of facial landmark extraction, potentially compromising the reliability of face alignment based on these landmarks [11]. Therefore, in our study, face similarity between main figures was computed only when the yaw value was within the range of

[- π / 6, π / 6]

. A frame was considered suitable for FBT if the calculated face similarity of the main figure was 0.9 or higher.

3.3.4. INTER-C

INTER-C represents a simple hard cut between frames from different videos. Such transitions can occur at any frame except the final frame of the footage. According to Smith [14], for a scene transition to be perceived as continuous, a noticeable difference must exist between the scenes before and after the transition. Based on this, we hypothesized that a greater difference in the number of faces between frames from different videos would make an INTER-C more likely to appear smooth to the viewer.

3.3.5. INTRA-C and None

Identifying INTRA-C in the original footage is equivalent to detecting shot boundaries. If an FBT or INTER-C occurs near an INTRA-C, the necessary temporal gap between transitions may not be maintained, potentially reducing viewer satisfaction. To prevent this, it is essential to detect INTRA-C in the original videos and avoid placing other transitions nearby.

Jung et al. examined whether a non-similar scene appears within five frames surrounding the scene where FBT occurs [1]. However, this approach does not guarantee a sufficient time interval between transitions. Lee et al. used Adobe’s commercial software, Premiere Pro, to detect shot boundaries [2], but relying on commercial tools contradicts the object of this research, which aims to generate cross-edited videos automatically. Bendraou et al. initially detected shot boundaries by comparing histogram differences between consecutive frames and subsequently filtered out false positives using the SURF algorithm [19]. However, SURF [25] is known to be computationally expensive.

Therefore, we propose a shot boundary detection method that leverages the intensity histograms of each frame along with the bounding box data obtained in Section 3.2. Specifically, histogram correlations were computed for each RGB channel between consecutive frames. If the correlation for any channel fell below 0.7, the frame was flagged as a potential boundary. Next, the intersection over union (IoU) was calculated between the largest bounding boxes in the frames before and after the potential boundary. If the IoU was less than 0.5, the boundary was confirmed as an actual shot boundary.

For each channel

c \in {R, G, B}

in frame

k

and frame

k + 1

, the histogram correlation coefficient

r_{c}

is computed as follows:

r_{c} = \frac{\sum_{i = 0}^{L - 1} (h_{c}^{i} (k) - μ_{c (k)}) \times (h_{c}^{i} (k + 1) - μ_{c (k + 1)})}{\sqrt{\sum_{i = 0}^{L - 1} {(h_{c}^{i} (k) - μ_{c (k)})}^{2} \times \sum_{i = 0}^{L - 1} {(h_{c}^{i} (k + 1) - μ_{c (k + 1)})}^{2}}},

(5)

where

h_{c}^{i} (k)

denotes the histogram value of the

i

-th bin in channel

c

of frame

k

, and

μ_{c (k)}

is the average of the histogram values in channel

c

of frame

k

.

L

is the number of histogram bins. For an 8-bit RGB image,

L = 256

.

Let

S

be the minimum correlation coefficient across the three color channels:

S = \min (r_{R}, r_{G}, r_{B}) .

(6)

If

S < 0.7

, frame

k

is considered a potential shot boundary.

Next, if this condition is met, let the largest bounding boxes in frame

k

and frame

k + 1

be

b_{k}

and

b_{k + 1}

. The IoU of these two bounding boxes is calculated as follows:

IoU (b_{k}, b_{k + 1}) = \frac{b_{k} \cap b_{k + 1}}{b_{k} \cup b_{k + 1}} .

(7)

If

{IoU (b}_{k}, b_{k + 1}) < 0.5

, the frame is finally determined to be a shot boundary.

3.4. Editing Path Determination

3.4.1. Representing Frames and Transitions as a Graph Structure

To compute the editing path for creating a StageMix from the input footage, we represent each frame as a node in a graph structure. Arev et al. demonstrated that editing paths can be determined using dynamic programming after representing each frame as a node in a graph [26]. Similarly, in our work, each frame is treated as a node, with the transition type and transition cost serving as the associated node values.

Each node is defined by a key consisting of (stage video number, frame number), and its value includes (stage video number, following frame number, transition type, transition cost). As detailed in Section 3.3.1, there are four transition types, and for each frame, we calculated all possible transitions along with their associated costs.

If a frame was identified as suitable for FBT, the FBT transition cost (FBT-C) was computed as follows:

FBT-C = \frac{1}{sim \times brisq} .

(8)

The Brisque score (brisq), based on the image quality assessment method proposed by Mittal et al. [27], assigns values close to 1 for clear and sharp images, and values near 0 for blurred ones. Since facial blurring can occur when the main figure is in motion—potentially reducing visual satisfaction during FBT—the brisq was used to favor frames with clearer facial details in the FBT-C calculation.

As discussed in Section 3.3.4, a larger disparity in the number of faces between two frames (NoF) leads to a more seamless INTER-C, thereby lowering the associated transition cost. The INTER-C cost (INTER-CC) is computed as follows:

INTER-CC = c \times (\frac{1}{NoF + 1}),

(9)

where the constant

c

is introduced to ensure that the INTER-CC is greater than the FBT-C. In our implementation, we set

c

to

e^{- 5}

.

Finally, since INTRA-C and None do not involve specific factors that affect transition cost, unlike FBT or INTER-C, their costs were set manually. The INTRA-C cost (INTRA-CC) was set to 500, and the None cost (None-C) to 1000. This ensures that the transition costs are arranged in ascending order: FBT-C, INTER-CC, INTRA-CC, and None-C.

INTRA-CC = 500, None-C = 1000 .

(10)

In our study, a total of 39 frames were used for smooth FBT, consisting of the FBT point itself along with the 19 frames before and 19 frames preceding and following it. To maintain editing consistency, no INTRA-C were allowed within this 39-frame window centered on the FBT point. Therefore, after computing the transition types and costs for each frame, any information related to FBT and their associated costs within the 19 frames before and after an INTRA-C occurrence was removed.

3.4.2. Calculating the Editing Path

Figure 5 presents an example of the editing path. Based on the graph constructed in Section 3.4.1, the Dijkstra algorithm is employed to compute the optimal editing path for StageMix generation. Since all transition costs at each node are non-negative, Dijkstra algorithm is suitable for finding the least-cost path. To avoid overly frequent transitions, a minimum frame interval was enforced between successive transitions. Specifically, a minimum of 45 frames was required between FBT, and 90 frames between INTER-C or INTRA-C. This constraint ensures that a None spans at least 45 frames after an FBT, and at least 90 frames after an INTER-C or INTRA-C. The shorter minimum interval for FBT was intended to allow more frequent transitions of that type.

3.5. StageMix Generation

The generation of StageMix is based on the editing path calculated in Section 3.4.2. A simple cut-based editing approach was employed for transition types categorized as INTER-C, INTRA-C, or None, with no transition effects applied between frames. However, transition effects were applied for FBT to enhance visual immersion and improve user satisfaction.

Figure 6 illustrates the visualization of FBT process, which consists of three phases: pre-blending, blending, and post-blending. During the blending phase, the faces in both videos—designated as Video A and Video B—are aligned and overlaid using alpha blending. To ensure a smooth transition into the blending phase, Video A undergoes incremental scaling and parallel shifting during the pre-blending phase, aligning its faces with the initial frame of the blending phase. Likewise, in the post-blending phase, Video B is gradually resized and parallel-shifted from the final frame of the blending phase back to its original state.

The terms and concepts of pre-blending, blending, and post-blending are adapted from Lee et al. [2]. In our work, the durations of these phases were set to 14 frames, 11 frames, and 14 frames, respectively. The video editing process for FBT involves the following five steps, which are detailed in Section 3.5.1 through Section 3.5.5

3.5.1. Extracting Landmarks and Aligning Faces

In the blending phase, the faces are aligned using facial landmarks and head pose features to ensure a smooth transition from Video A to Video B. Figure 7 illustrates an example of face alignment using eye landmarks and head pose features. The landmarks of the two eyes are extracted and used to represent the main figure in each frame within the blending phase. Additionally, the rotation angle, position, and size of the faces in Videos A and B are aligned by averaging these attributes between the two faces. The rotation angle is adjusted using the mean roll value derived from the head pose features. The position is determined by averaging the center coordinates of the two eye landmarks from both videos. Similarly, the size of the face is calculated by averaging the distances between the center coordinates of the two eye landmarks in Video A and Video B.

To minimize the appearance of black pixels caused by scaling and shifting during alignment, the faces in each frame pair (Video A and Video B) are aligned to the position and size of the larger face—that is, the face with the greater distance between the eyes. Regarding rotation, the faces in the aligned video are gradually rotated throughout the transition. Specifically, from the first frame of the aligned video to the FBT point, the roll value is incrementally adjusted to match the average roll value of the two faces. Likewise, from the last frame of the aligned video to the FBT point, the roll value is also gradually adjusted toward the same average roll value.

3.5.2. Cropping and Resizing Without Black Pixels

During the face alignment process, rotations, scaling, and parallel translations can result in the appearance of black pixels in the aligned video frames. To prevent these black regions from being visible in the StageMix, they must be cropped out and the frames resized to their original dimensions. To minimize the amount of cropping required, a valid region is defined for each frame of the aligned video as the overlapping area between Video A and Video B, as illustrated in Figure 8.

Let the frame of Video A be

I_{A}

and the frame of Video B be

I_{B}

, and let the masks for each frame be

M_{A}

and

M_{B}

. The valid region

M_{v}

is calculated as follows:

M_{v} = M_{A} \cap M_{B} .

(11)

R

is defined as the largest rectangle that does not contain any black pixels. To calculate

R

based on the center of gravity of

M_{v}

, we first calculate the center of gravity

M_{c}

of

M_{v}

as follows:

M_{c} = (\frac{1}{N} \sum_{i = 1}^{N} x_{i}, \frac{1}{N} \sum_{i = 1}^{N} y_{i}),

(12)

where

N

is the number of pixels in

M_{v}

and

(x_{i}, y_{i})

is the coordinates of those pixels.

Then,

R

is resized to

(R_{w}, R_{h})

, centered on

M_{c}

. This resizing meets the following conditions:

(R_{w}, R_{h}) = \underset{k}{\arg \max} {(16 k, 9 k) | \forall (x, y) \in rectangle ((16 k, 9 k)), {rectangle}_{c} = M_{c}} .

(13)

Let the frame that is cropped from the aligned frame by the area

R

be denoted as

F_{c}

. The frame resized to the original size

F_{r}

is calculated as follows using

F_{c}

,

s c a l e_{x}

, and

s c a l e_{y}

:

s c a l e_{x} = \frac{1920}{R_{w}}, s c a l e_{y} = \frac{1080}{R_{h}},

(14)

F_{r} = F_{c} \times (\begin{matrix} s c a l e_{x} & 0 \\ 0 & s c a l e_{y} \end{matrix}) .

(15)

To summarize, the above process removes black pixels from each frame and resizes the frames to their original dimensions. As a result, the frames in the blending phase are free from black regions, ensuring high visual quality in the StageMix.

3.5.3. Alpha Blending

Transitions in the blending phase are performed seamlessly through alpha blending of the two images, A and B. At the start of the blending phase, the alpha value of Video A is set to 1.0 (fully opaque), while that of Video B is set to 0.0 (fully transparent). By the final frame of the blending phase, these values are reversed, with Video A at 0.0 and Video B at 1.0. Rather than applying a linear change in transparency—shown in Figure 9a, a non-linear approach illustrated in Figure 9b was adopted to achieve a smoother and more visually appealing transition. This effect was implemented using a sigmoid function to introduce an ease-in-out effect. Figure 9c displays the result of FBT using linear easing, whereas Figure 9d shows the result with non-linear easing.

3.5.4. Pre-Blending and Post-Blending

It is important to establish a seamless connection between the frames in the blending phase and those in the original footage. To achieve this, the pre-blending and post-blending phases gradually enlarge or reduce the size of the main figure’s face in the original frames to match the face size at the FBT point.

In Video A, the facial landmarks of the main figure were extracted from the final frame of the pre-blending segment. These landmarks were then used to compute the transformation matrix required for transitioning to the first frame of the blending segment. Linear interpolation was subsequently performed between the identity matrix and the transformation matrix over the duration of the pre-blending segment. The resulting interpolated transformations were applied to each frame within this segment.

Similarly, in the post-blending segment, the transformation matrix was obtained by linearly interpolating between the transformation matrix of the final frame of the blending segment and the identity matrix corresponding to the first frame of the post-blending segment. The interpolated transformation was then applied to each frame within the post-blending segment.

3.5.5. Cases of Restricting FBT

In cases where implementing FBT could disrupt the viewer’s immersion, we restricted its use and substituted it with INTER-C.

Figure 10 illustrates scenarios in which FBT may interfere with visual immersion. The first scenario, shown in Figure 10a, occurs when FBT is applied between different individuals. Since such transitions can break immersion, we employed FaceNet [13] for face verification to ensure that FBT is performed only when the faces belong to the same person.

The second scenario, depicted in Figure 10b, involves inaccurate landmark detection caused by a sharp head turn. The FBT decision process restricts the yaw value only at the FBT point, and the yaw values in other frames within the blending phase may become excessively large. This can lead to face alignment issues, potentially resulting in visual artifacts. To address this, we examined the yaw values of all faces within the blending phase, and restricted FBT transition when any yaw value exceeded the range of −30 to 30 degrees.

The last scenario, illustrated in Figure 10c, occurs when the face of the main figure is excessively enlarged during the transition. This typically happens when there is a significant size difference between the faces in the two images within the blending phase. After the faces are aligned, the black regions are cropped to ensure they are not visible, and the image is resized back to its original dimensions. However, in this process, the face can become abruptly enlarged in consecutive frames. To avoid creating an unnatural visual effect, FBT was restricted when the face was enlarged by more than twice its original size.

4. Experimental Results

AutoStageMix enables the automated generation of StageMix without requiring any special pre-processing or manual editing by the user. The resulting cross-edited videos seamlessly integrate FBT, INTER-C, and INTRA-C, thereby producing smooth and visually appealing outputs.

4.1. Quantitative Evaluations

4.1.1. Shot Detection Results

The shot boundary detection method described in Section 3.3.5 was applied to performance videos of SEVENTEEN’s songs “Darl+ing” [28] and “Ready to love” [29]. Specifically, three videos of “Darl+ing” [28] and two videos of “Ready to love” [29] were used in the analysis. The results are presented in Table 1, with precision, recall, and F1-score serving as the evaluation metrics.

The method achieved a mean precision of 0.890, a mean recall of 0.834, and a mean F1-score of 0.850, indicating that the proposed approach is capable of accurately detecting shot boundaries.

Overall, the method demonstrated high precision and recall, validating its effectiveness. However, in certain cases, such as in Video #4, rapid changes in background color or lighting led to false positives and lower precision. In Video #5, the histogram correlation frequently exceeded the threshold (0.7), resulting in missed boundaries and a lower recall. These findings suggest that further refinement is needed to ensure consistent performance across a wider variety of video types.

4.1.2. Normalized Mean Error (NME)

At the FBT point, an alignment operation is performed to match the size and position of the faces of the main figures appearing in different frames. The more accurately the shapes of the two faces are aligned, the greater the visual satisfaction derived from the FBT. In our work, we compared our proposed method for identifying similar faces across frames with those of Jung et al. [1] and Lee et al. [2]. The morphological congruence of the detected face pairs was evaluated using the NME, a commonly used metric for assessing the accuracy of facial landmark alignment, as shown in previous studies by Zhu and Ramanan [30] and Sagonas et al. [31]. Specifically, 68 landmarks were extracted using SPIGA [12], and the two sets of landmarks were aligned using Procrustes analysis [32], after which the NME was computed.

Let the two landmark sets be denoted as

A

and

B

. After applying Procrustes analysis, we obtain aligned landmark sets

A^{'}

and

B^{'}

, from which the Euclidean distance

d_{i}

between corresponding landmarks is calculated. The Mean Error (ME) is defined as the average of these Euclidean distances. The NME is then computed by normalizing the ME by

d_{e}

, the inter-ocular distance, defined as the distance between the centers of the two eyes. The distance

d_{e}

is calculated from the coordinates of the left eye center

(l_{x}, l_{y})

and the right eye center

(r_{x}, r_{y})

.

(A^{'}, B^{'}) = p r o c r u s t e s (A, B),

(16)

d_{i} = \sqrt{{({a^{'}}_{i x} - {b^{'}}_{i x})}^{2} - {({a^{'}}_{i y} - {b^{'}}_{i y})}^{2}},

(17)

ME = \frac{1}{N} \sum_{i = 1}^{N} d_{i},

(18)

d_{e} = \sqrt{{(l_{x} - r_{x})}^{2} + {(l_{y} - r_{y})}^{2}},

(19)

NME = \frac{ME}{d_{e}} .

(20)

For the NME comparison, we used three videos of Seventeen’s “Darl+ing” [28] (4384 frames), two videos of “Ready to love” [29] (5676 frames), and three videos of “God of Music” [33] (6078 frames), all with a resolution of 1920 × 1080. Analogous face pairs suitable for FBT were detected according to the procedures described by Jung et al. [1] and Lee et al. [2], as well as our proposed method. The number of similar face pairs identified using each method for the three ‘Darl+ing’ videos is presented in Table 2.

As shown in Table 2, the number of potential FBT frame pairs detected varies by method. Jung et al.’s method identified the fewest frame pairs, as it only allowed transitions when the angle between the line connecting the eyes and the horizontal axis was 7 degrees or less [1]. In contrast, Lee et al.’s method identified the most frame pairs due to its looser constraints [2].

For each detected frame pair, 68 facial landmarks were extracted and aligned using Procrustes analysis, after which the NME was computed. To determine whether significant differences existed in morphological consistency among the three methods, a non-parametric Kruskal–Wallis H tests [34] was first performed. If the global test was significant, indicating at least one group differed, we conducted pairwise post hoc comparisons using the Mann–Whitney U test [35]. To control for the increased risk of Type I error due to multiple comparisons, Bonferroni correction [36] was applied to the resulting p-values. The detailed results are presented in Table 3 and Table 4.

The NME reflects subtle positional differences between facial landmarks, with lower values indicating greater morphological similarity between faces. As shown in Table 3, across all the analyzed videos, the NME consistently followed the pattern: Jung et al. [1] > Lee et al. [2] > our method, suggesting that our method leads to higher morphological similarity.

As shown in Table 4, the results of the Kruskal–Wallis H tests [34] indicated statistically significant differences in NME values among the three methods for all three songs (p < 0.05). This justified the use of post hoc pairwise comparisons using the Mann–Whitney U test [35] with Bonferroni correction [36] (adjusted α = 0.05/3 ≈ 0.0167).

In the case of “Darl+ing” [28], our method showed significantly lower NME than both Jung et al. [1] and Lee et al. [2] (p < 0.001), whereas the difference between Jung et al. [1] and Lee et al. [2] was not significant (p = 0.1720).

For “Ready to love” [32], our method performed significantly better than Jung et al. [1] (p = 0.0118), while the comparison with Lee et al. [2] did not reach the Bonferroni-adjusted significance threshold (p = 0.2762). This lack of significance may be attributed to the smaller number of frame pairs identified in the “Ready to love” video, as shown in Table 2, which likely reduced the statistical power. The comparison between Jung et al. [1] and Lee et al. [2] was also statistically significant (p = 0.0165).

In “God of Music” [33], our method again showed significantly lower NME than both Jung et al. [1] and Lee et al. [2] (both p < 0.001), and the difference between Jung et al. [1] and Lee et al. [2] was also significant (p = 0.0066), meeting the Bonferroni-adjusted criterion.

Therefore, we conclude that the method proposed in this study—identifying similar faces using head pose—is generally more effective than the methods proposed by Jung et al. [1] and Lee et al. [2], as it consistently yielded lower NME values with statistically significant differences in most cases. The only exception was the comparison between our method and Lee et al. [2] in the “Ready to love” [32] video, which did not reach statistical significance, although it followed the same trend.

4.1.3. Influence of User Parameters on the Number of Transitions

Transitions in StageMix are generated using FBT, INTER-C, and INTRA-C. To control their frequency, two user-defined parameters were introduced to specify the minimum time interval between different types of transitions. The first parameter,

t_{I N T}

, defines the minimum number of frames that must elapse between an INTER-C or INTRA-C and the subsequent transition. The second parameter

t_{F A C E}

, sets the minimum number of frames between an FBT and the subsequent transition. For example, setting

t_{I N T} = 90

ensures that, after an INTER-C or INTRA-C, the None type will last for at least 90 frames within the same stage footage. In our implementation,

t_{F A C E}

was set to 45, ensuring that after an FBT, the same stage is maintained for at least 45 frames before the next transition occurs.

Table 5 presents the number of transitions in the three “Darl+ing” [28] videos by Seventeen, using different values of the user-defined parameter

t_{I N T}

. As

t_{I N T}

increases, the frequency of INTER-C and INTRA-C decreases, while the frequency of FBT increases. Additionally, a higher

t_{I N T}

value results in a reduced total number of transitions and longer average time intervals between scene changes. This enables users to adjust

t_{I N T}

to control the transition frequency and the number of FBT in the StageMix according to their preferences.

4.1.4. Computational Efficiency

We assessed the runtime performance of AutoStageMix using a desktop equipped with an Intel Core i7-10700 CPU (2.90 GHz), 32 GB of RAM, and an NVIDIA GeForce RTX 3070 GPU.

The evaluation was conducted on videos of SEVENTEEN’s songs: three videos of “Darl+ing” [28], two of “Ready to love” [29], and three of “God of Music” [33]. Each video had a resolution of 1080p.

Table 6 summarizes the average processing time for each stage in the AutoStageMix pipeline. The Feature Extraction stage, which includes facial region detection and facial feature extraction, accounted for the majority of the total processing time across all videos. Using a lightweight deep learning model in this stage could potentially reduce processing time, although it may lead to a decrease in transition accuracy or the number of viable transition points

4.2. User Study

A user study involving 21 adult participants was conducted to compare the quality of StageMix produced by AutoStageMix against those edited by professional editors. Participants rated the videos based on editing naturalness and transition smoothness, excluding factors such as music or video content. A 5-point Likert scale was used, where 1 indicated “very unsatisfied” and 5 indicated “very satisfied.” Three songs were included in the study: IVE’s “HEYA” [37], NewJeans’ “Hype Boy” [38], and aespa’s “Supernova” [39]. For each song, one video was generated using AutoStageMix and two were edited by professional editors, yielding a total of nine videos. The three AutoStageMix were grouped as Group 1, and the six professionally edited videos as Group 2.

Table 7 summarizes the overall satisfaction scores. Group 1 achieved an average score of 4.349 (SD = 0.716), while Group 2 received 4.143 (SD = 0.774). To assess variance homogeneity between the two groups, an F-test was performed due to the unequal sample sizes. The F-test result (F = 0.362, p = 0.548) indicated no significant difference in variance, validating the use of a t-test.

The null hypothesis of the t-test assumed no significant difference in satisfaction scores between the two groups. The t-test produced a t-statistic of 1.762 and a p-value of 0.080. Since the p-value exceeds 0.05, the null hypothesis could not be rejected, suggesting no statistically significant difference between AutoStageMix and professional editing in terms of viewer satisfaction.

These results indicate that AutoStageMix performs comparably to professional editors, with no significant differences in satisfaction or variability. This supports the potential of AutoStageMix as a tool for producing high-quality video edits.

5. Conclusions and Discussion

In this work, we propose AutoStageMix, an automated system that generates cross-edited videos from stage performance footage without requiring specialized editing skills. By leveraging deep learning, the entire process is automated—from extracting facial features to identifying transition points and generating the StageMix. Notably, transition points identified using head pose estimation were more effective at selecting scenes with morphologically similar faces compared to those identified using facial landmarks, thereby improving transition quality. However, our work has several limitations.

First, it does not synchronize transitions with the beat of the music, which can result in transitions occurring outside of musical highlights or at unnatural moments. This may make certain scenes feel awkward and reduce viewer immersion. To address this, future development should incorporate music-aware editing that aligns transitions with the flow of the music. Specifically, beat-detection algorithms like spectral flux onset detection or deep-learning-based tempo tracking could be integrated to identify rhythmically significant movement. These enhancements are expected to improve perceptual continuity and increase viewer satisfaction.

Second, inconsistencies in body movements during FBT can disrupt viewer immersion. This highlights the need for more advanced motion analysis and matching techniques. Incorporating body landmarks into the editing process could help produce smoother and more natural transitions. As FBT are the most commonly used editing technique in StageMix, we prioritized facial alignment to ensure practical applicability. Incorporating pose similarity in combination with facial features remains a promising direction for future research.

Third, footage taken under varying lighting conditions and camera angles may result in noticeable differences in color and brightness. Professional editors often retouch StageMix to reduce these inconsistencies while maintaining focus on the performers. We initially experimented with automatic color correction using histogram matching but found it difficult to apply effectively due to the dynamic nature of K-pop stage videos, where lighting and background colors can vary significantly within a single video. This aspect could be explored in future work to further improve visual coherence.

Fourth, our method focuses on K-pop videos, particularly broadcast-stage performances commonly used in StageMix editing. This focus reflects the fact that StageMix originated from K-pop fan culture and is rarely applied outside of that context. While we did not evaluate other genres or lighting conditions, we expect AutoStageMix to operate in such settings as long as performer faces are visible.

Lastly, the transition costs were manually set to reflect practical editing priorities in StageMix. While the system showed stability as long as the cost order was preserved, future work may explore more systematic parameter tuning to assess sensitivity and future optimize path quality. We aim to address these limitations and further develop the system to create StageMix content that delivers greater viewer satisfaction.

Author Contributions

Conceptualization, M.O.; Data curation, M.O. and H.J.; Formal analysis, M.O. and H.J.; Investigation, M.O. and H.J.; Methodology, M.O. and H.J.; Project administration, D.L.; Resources, M.O. and H.J.; Software, M.O. and H.J.; Supervision, D.L.; Validation, M.O. and H.J.; Visualization, M.O. and H.J.; Writing—original draft, M.O. and H.J.; Writing—review and editing, M.O., H.J. and D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) under the metaverse support program, grant funded by the Korea government (MSIT) (IITP-RS-2024(2025)-00425383). Following are also results of a study on the “Convergence and Open Sharing System” Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (B0080706000707).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to copyright restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jung, M.; Lee, S.; Sim, E.S.; Jo, M.H.; Lee, Y.J.; Choi, H.B.; Kwon, J. Stagemix video generation using face and body keypoints detection. Multimed. Tools Appl. 2022, 81, 38531–38542. [Google Scholar] [CrossRef]
Lee, D.; Yoo, J.E.; Cho, K.; Kim, B.; Im, G.; Noh, J. PopStage: The Generation of Stage Cross-Editing Video Based on Spatio-Temporal Matching. ACM Trans. Graph. 2022, 41, 1–13. [Google Scholar] [CrossRef]
Shrestha, P.; de With, P.H.; Weda, H.; Barbieri, M.; Aarts, E.H. Automatic mashup generation from multiple-camera concert recordings. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010. [Google Scholar] [CrossRef]
Achary, S.; Girmaji, R.; Deshmukh, A.A.; Gandhi, V. Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024. [Google Scholar] [CrossRef]
Girmaji, R.; Beri, B.; Subramanian, R.; Gandhi, V. EditIQ: Automated Cinematic Editing of Static Wide-Angle Videos via Dialogue Interpretation and Saliency Cues. In Proceedings of the 30th International Conference on Intelligent User Interfaces, Cagliari, Italy, 24–27 March 2025. [Google Scholar] [CrossRef]
Kazemi, V.; Sullivan, J. One Millisecond Face Alignment with an Ensemble of Regression Trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
Taigman, Y.; Yang, M.; Ranzato, M.A.; Wolf, L. Deepface: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Zhou, Y.; Yu, J.; Kotsia, I.; Zafeiriou, S. Retinaface: Single-stage Dense Face Localisation in the Wild. arXiv 2019, arXiv:1905.00641. [Google Scholar]
Ruiz, N.; Chong, E.; Rehg, J.M. Fine-Grained Head Pose Estimation without Keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Li, X.; Zhang, D.; Li, M.; Lee, D.-J. Accurate Head Pose Estimation Using Image Rectification and a Lightweight Convolutional Neural Network. IEEE Trans. Multimed. 2023, 25, 2239–2251. [Google Scholar] [CrossRef]
Prados-Torreblanca, A.; Buenaposada, J.M.; Baumela, L. Shape preserving facial landmarks with graph attention networks. arXiv 2022, arXiv:2210.07233. [Google Scholar]
Zhu, X.; Lei, Z.; Liu, X.; Shi, H.; Li, S.Z. Face Alignment Across Large Poses: A 3D Solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Parkhi, O.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the British Machine Vision Conference (BMVC), Swansea, UK, 7–10 September 2015. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Smith, T.J. An Attentional Theory of Continuity Editing. Ph.D. Thesis, University of Edinburgh, Edinburgh, UK, 2006. [Google Scholar]
Magliano, J.P.; Zacks, J.M. The Impact of Continuity Editing in Narrative Film on Event Segmentation. Cogn. Sci. 2011, 35, 1489–1517. [Google Scholar] [CrossRef] [PubMed]
Ardizzone, E.; Gallea, R.; La Cascia, M.; Morana, M. Automatic Generation of Subject-Based Image Transitions. In Image Analysis and Processing—ICIAP 2011; Maino, G., Foresti, G.L., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6978, pp. 237–246. [Google Scholar] [CrossRef]
Bendraou, Y.; Essannouni, F.; Aboutajdine, D.; Salam, A. Video shot boundary detection method using histogram differences and local image descriptor. In Proceedings of the 2014 Second World Conference on Complex Systems (WCCS), Agadir, Morocco, 10–12 November 2014. [Google Scholar] [CrossRef]
Radwan, N.I.; Salem, N.M.; El Adawy, M.I. Histogram Correlation for Video Scene Change Detection. In Advances in Computer Science, Engineering & Applications; Wyld, D., Zizka, J., Nagamalai, D., Eds.; Advances in Intelligent and Soft Computing; Springer: Berlin/Heidelberg, Germany, 2012; Volume 166, pp. 765–773. [Google Scholar] [CrossRef]
Gygli, M. Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks. In Proceedings of the 2018 International Conference on Content-Based Multimedia Indexing (CBMI), La Rochelle, France, 4–6 September 2018. [Google Scholar] [CrossRef]
Thakar, N.; Panchal, P.I.; Patel, U.; Chaudhari, K.; Sangada, S. Analysis and Verification of Shot Boundary Detection in Video using Block Based χ² Histogram Method. In Proceedings of the International Conference on Advanced Computing, Communication and Networks, Chandigarh, India, 3 June 2011. [Google Scholar] [CrossRef]
Dijkstra, E. A note on two problems in connexion with graphs. Num. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
Knapp, C.; Carter, G. The Generalized Correlation Method for Estimation of Time Delay. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 320–327. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Computer Vision—ECCV 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3951, pp. 404–417. [Google Scholar] [CrossRef]
Arev, I.; Park, H.S.; Sheikh, Y.; Hodgins, J.; Shamir, A. Automatic Editing of Footage from Multiple Social Cameras. ACM Trans. Graph. 2014, 33, 81. [Google Scholar] [CrossRef]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
SEVENTEEN. “Darl+ing”. Face the Sun; Pledis Entertainment: Seoul, Republic of Korea, 2022. [Google Scholar]
SEVENTEEN. “Ready to Love”. Your Choice; Pledis Entertainment: Seoul, Republic of Korea, 2021. [Google Scholar]
Zhu, X.; Ramanan, D. Face Detection, Pose Estimation, and Landmark Localization in the Wild. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar] [CrossRef]
Sagonas, C.; Tzimiropoulos, G.; Zafeiriou, S.; Pantic, M. 300 Faces in-the-Wild Challenge: The First facial landmark localization Challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia, 2–8 December 2013. [Google Scholar] [CrossRef]
Gower, J.C. Generalized procrustes analysis. Psychometrika 1975, 40, 33–51. [Google Scholar] [CrossRef]
SEVENTEEN. “God of Music”. Seventeenth Heaven; Pledis Entertainment: Seoul, Republic of Korea, 2023. [Google Scholar]
Kruskal, W.H.; Wallis, W.A. Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
Wald, A.; Wolfowitz, J. On a test whether two samples are from the same population. Ann. Math. Stat. 1940, 11, 147–162. [Google Scholar] [CrossRef]
Bonferroni, C. Teoria statistica delle classi e calcolo delle probabilita. Pubbl. R Ist. Super. Sci Econ. Commer. Firenze 1936, 8, 3–62. [Google Scholar]
IVE. “HEYA”. IVE Switch; Starship Entertainment: Seoul, Republic of Korea, 2024. [Google Scholar]
NewJeans. “Hype Boy”. New Jeans; ADOR: Seoul, Republic of Korea, 2022. [Google Scholar]
AESPA. “Supernova”. Armageddon; SM Entertainment: Seoul, Republic of Korea, 2024. [Google Scholar]

Figure 1. Flow chart of the AutoStageMix.

Figure 2. (a) Waveform of Signal 1 (Base Audio), (b) waveform of Signal 2 (Target Audio), (c) cross-correlation plot between Signal 1 and Signal 2.

Figure 3. Head pose in 3-dimensional space.

Figure 4. Examples of transition types.

Figure 5. Example of editing path.

Figure 6. Visualization of FBT process.

Figure 7. (a) Frames from the first video input (Video A), (b) frames from the second video input (Video B), (c) frames from the aligned video, where Video A and Video B are aligned using eyes landmarks and head pose features.

Figure 8. (a) Frames from the Aligned Video from Figure 7c, (b) calculation of the Valid Area, (c) Calculation of the Cropped Area based on the Valid Area, (d) results of cropping and resizing the frames from the Aligned Video.

Figure 9. (a) Linear transparency change, (b) non-linear transparency change (ease-in-out using a sigmoid function), (c) FBT using Linear Easing, (d) FBT using Non-linear Easing.

Figure 10. (a) FBT occurs between different individuals, (b) inaccurate landmark detection causes FBT errors, (c) excessive face enlargement occurs during FBT.

Table 1. Shot detection results for five videos.

	Video #1	Video #2	Video #3	Video #4	Video #5	Mean
Precision	1	1	0.828	0.624	1	0.890
Recall	0.857	0.935	0.828	0.869	0.679	0.834
F1-Score	0.923	0.967	0.828	0.726	0.808	0.850

Table 2. Number of frame pairs with similar faces.

Video	Jung et al. [1]	Lee et al. [2]	Our Method
Darl+ing [28]	90	1730	454
Ready to love [29]	28	53	17
God of Music [33]	23	697	156

Table 3. Mean NME by method.

Song	Method	Mean NME
Darl+ing [28]	Jung et al. [1]	0.1272
	Lee et al. [2]	0.1214
	Our method	0.0542
Ready to love [29]	Jung et al. [1]	0.2038
	Lee et al. [2]	0.1256
	Our method	0.1087
God of Music [33]	Jung et al. [1]	0.3086
	Lee et al. [2]	0.1441
	Our method	0.0880

Table 4. Statistical test results. Bold indicates significance at adjusted α = 0.0167.

Song	Kruskal–Wallis p-Value	Jung et al. [1] vs. Lee et al. [2] (p-Value)	Jung et al. [1] vs. Our Method (p-Value)	Lee et al. [2] vs. Our Method (p-Value)
Darl+ing [28]	<0.001	0.1720	<0.001	<0.001
Ready to love [29]	0.0136	0.0165	0.0118	0.2762
God of Music [33]	<0.001	0.0066	<0.001	<0.001

Table 5. Number of Transitions Based on Different

t_{I N T}

Values.

Table 5. Number of Transitions Based on Different

t_{I N T}

Values.

$t_{I N T}$ (Frame)	FBT	INTER-C	INTRA-C	Total Number of Transitions	Average Time Interval Between Transitions (s)
45	16	24	13	53	2.19
90	17	12	7	36	3.23
150	20	9	4	33	3.52
240	21	5	1	27	4.31

Table 6. Processing time per pipeline stage (in seconds).

Song	Pre- Processing	Feature Extraction		Identifying a Transition Point	Editing Path Determination	StageMix Generation	Total Time (s)
Song	Pre- Processing	RetinaFace [9]	SPIGA [12]	Identifying a Transition Point	Editing Path Determination	StageMix Generation	Total Time (s)
Darl+ing [28]	258.71	1350.37	960.08	260.25	73.35	379.15	3281.91
Ready to love [29]	326.22	1268.96	571.15	218.72	6.13	417.47	2808.65
God of Music [33]	438.65	2769.19	1013.43	378.02	30.29	502.31	5131.89

Table 7. Comparison of overall satisfaction scores between Group1 and Group2.

	Group 1 (AutoStageMix)	Group 2 (Professional Editors)
Mean	4.349	4.143
Standard Deviation	0.716	0.774

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oh, M.; Jang, H.; Lee, D. AutoStageMix: Fully Automated Stage Cross-Editing System Utilizing Facial Features. Appl. Sci. 2025, 15, 7613. https://doi.org/10.3390/app15137613

AMA Style

Oh M, Jang H, Lee D. AutoStageMix: Fully Automated Stage Cross-Editing System Utilizing Facial Features. Applied Sciences. 2025; 15(13):7613. https://doi.org/10.3390/app15137613

Chicago/Turabian Style

Oh, Minjun, Howon Jang, and Daeho Lee. 2025. "AutoStageMix: Fully Automated Stage Cross-Editing System Utilizing Facial Features" Applied Sciences 15, no. 13: 7613. https://doi.org/10.3390/app15137613

APA Style

Oh, M., Jang, H., & Lee, D. (2025). AutoStageMix: Fully Automated Stage Cross-Editing System Utilizing Facial Features. Applied Sciences, 15(13), 7613. https://doi.org/10.3390/app15137613

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AutoStageMix: Fully Automated Stage Cross-Editing System Utilizing Facial Features

Abstract

1. Introduction

2. Related Works

2.1. Cross-Editing Video Generation

2.2. Feature Extraction Techniques

2.3. Video Editing Techniques

3. AutoStageMix

3.1. Preprocessing

3.2. Feature Extraction

3.3. Identifying a Transition Point

3.3.1. Defining Transition Types

3.3.2. Determining the Main Figure in the Frame

3.3.3. FBT

3.3.4. INTER-C

3.3.5. INTRA-C and None

3.4. Editing Path Determination

3.4.1. Representing Frames and Transitions as a Graph Structure

3.4.2. Calculating the Editing Path

3.5. StageMix Generation

3.5.1. Extracting Landmarks and Aligning Faces

3.5.2. Cropping and Resizing Without Black Pixels

3.5.3. Alpha Blending

3.5.4. Pre-Blending and Post-Blending

3.5.5. Cases of Restricting FBT

4. Experimental Results

4.1. Quantitative Evaluations

4.1.1. Shot Detection Results

4.1.2. Normalized Mean Error (NME)

4.1.3. Influence of User Parameters on the Number of Transitions

4.1.4. Computational Efficiency

4.2. User Study

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI