FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting

Liu, Ruixin; Zhu, Yuesheng

doi:10.3390/electronics12214452

Open AccessArticle

FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting

by

Ruixin Liu

and

Yuesheng Zhu

^*

Communication and Information Security Laboratory, Shenzhen Graduate School, Peking University, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(21), 4452; https://doi.org/10.3390/electronics12214452

Submission received: 4 September 2023 / Revised: 16 October 2023 / Accepted: 20 October 2023 / Published: 29 October 2023

(This article belongs to the Special Issue Application of Machine Learning in Graphics and Images)

Download

Browse Figures

Versions Notes

Abstract

:

Video inpainting aims to complete the missing regions with content that is consistent both spatially and temporally. How to effectively utilize the spatio-temporal information in videos is critical for video inpainting. Recent advances in video inpainting methods combine both optical flow and transformers to capture spatio-temporal information. However, these methods fail to fully explore the potential of optical flow within the transformer. Furthermore, the designed transformer block cannot effectively integrate spatio-temporal information across frames. To address the above problems, we propose a novel video inpainting model, named Flow-Guided Spatial Temporal Transformer (FSTT), which effectively establishes correspondences between missing regions and valid regions in both spatial and temporal dimensions under the guidance of completed optical flow. Specifically, a Flow-Guided Fusion Feed-Forward module is developed to enhance features with the assistance of optical flow, mitigating the inaccuracies caused by hole pixels when performing MHSA. Additionally, a decomposed spatio-temporal MHSA module is proposed to effectively capture spatio-temporal dependencies in videos. To improve the efficiency of the model, a Global–Local Temporal MHSA module is further designed based on the window partition strategy. Extensive quantitative and qualitative experiments on the DAVIS and YouTube-VOS datasets demonstrate the superiority of our proposed method.

Keywords:

deep video inpainting; video editing; spatial temporal transformer; optical flow; object removal

1. Introduction

Video inpainting aims to fill in missing or corrupted regions of a video with plausible content and has been widely used in various practical applications, including video editing, damaged video restoration, and watermark removal. Compared to image inpainting, video inpainting presents greater challenges due to the additional temporal dimension. In addition to generating visually plausible content for each frame, video inpainting has to maintain temporal coherence in the missing regions. While significant progress has been made in image inpainting, directly processing video frame by frame with an image inpainting method will lead to temporal inconsistencies and severe artifacts due to the complex motion of the objects and camera.

Although the temporal dimension brings challenges to video inpainting, it inherently provides more information for restoration of the missing regions. Therefore, the effective utilization of complementary information across frames to synthesize high-quality content is critical for video inpainting. The advances in deep learning and computer vision have enabled the development of deep video inpainting. A number of deep learning-based video inpainting methods have been proposed [1,2,3,4,5,6,7,8,9,10,11]. These methods can be roughly classified into two classes: pixel-based methods and flow-based methods. The first class generally utilizes 3D convolution [1,2,3], attention mechanisms [4,5,6], or transformers [10,11] to capture the spatio-temporal correlations among video frames. These methods take corrupted frames as input and employ learned spatio-temporal correlations to directly infer the missing regions without going through complicated transformations. Flow-based methods [7,9] argue that completing optical flow is much easier than completing the pixels in missing regions and formulate the video inpainting task as a pixel propagation problem. These methods complete optical flow first and use the synthesized optical flow to guide the pixel propagation from the valid region to the missing region. Compared with pixel-based methods, flow-based methods can produce inpainting results with high-frequency details. However, the flow-based video inpainting methods are highly dependent on the accuracy of the completed flow. Incorrect optical flow will greatly degrade the final inpainting quality. Furthermore, errors in early stages will inevitably propagate to subsequent stages, yielding inconsistent results.

Recently, transformers have drawn great attention. The powerful long-range spatio-temporal modeling capability of video-based transformers has made great progress in some video-related tasks, such as video super-resolution [12,13], video action recognition [14,15], etc. Not surprisingly, more and more researchers have begun to employ video transformers for deep video inpainting [10,11]. They take multi-frames as input and utilize various transformer blocks to establish correspondences between missing region tokens and valid region tokens. These correspondences are then used to hallucinate missing regions and generate final inpainting results. However, the existence of hole region tokens will easily lead to inaccurate results when estimating correspondences. To achieve better video inpainting results, some researchers make efforts to integrate optical flow and video transformer techniques, e.g., E2FGVI [16] and FGT [17]. In these methods, the corrupted optical flows are first completed, and the content is propagated across frames using the completed flows. The propagated information offers more effective cues for the subsequent transformer-based integration. While some promising inpainting results have been shown, these methods fail to fully explore the guidance of optical flow, which only propagates information in the early stage of the network without considering the effect of optical flow on subsequent transformer blocks. Moreover, the video scenes are variable due to the complex motion of cameras and objects. For example, when the video is almost static, it is difficult for the network to capture valid information from adjacent frames. The transformer blocks proposed by [16,17], which are composed of a temporal transformer block or a separated spatial temporal block, cannot integrate effective spatio-temporal information across frames well.

To address the above problems, in this paper, we propose a novel Flow-Guided Spatial Temporal Transformer (FSTT) architecture for deep video inpainting, which aims to effectively utilize the remarkable spatio-temporal modeling capabilities of transformers under the guidance of the optical flow. More specifically, to further mitigate the degradation caused by hole region pixels when establishing correspondences between missing regions and valid regions, we introduce the completed optical flow into each transformer block and design a Flow-Guided Fusion Feed-Forward (Flow-Guided F3N) module to replace the two-layer MLPs in the conventional transformer architecture. The Flow-Guided F3N module propagates the information across video frames along the optical flow trajectory, providing more effective information for the subsequent attention operations. For video inpainting, when the neighboring frames cannot provide sufficient information for missing regions, the spatial information within the current frame can also be utilized. Based on this observation, a decomposed spatial temporal MHSA (multi-head self-attention) mechanism is proposed, in which a temporal MHSA module is utilized to capture the temporal information in videos, and the spatial MHSA module is further utilized to integrate the spatial information in each frame. To improve the efficiency of the network, we further design a global–local attention mechanism for the temporal MHSA module, called Global–Local Temporal MHSA. We employ a local temporal attention mechanism within a small window and integrate global temporal information in a coarse-fine-grained way. Through the above designs, our network effectively and efficiently leverages the complementary content across video frames, producing a high level of visual quality.

We conduct extensive quantitative and qualitative experiments on two popular video inpainting datasets, DAVIS [18] and YouTube-VOS [19], to validate the effectiveness of the proposed network. The experiment results compared to the state-of-the-art methods show the superiority of our method.

The main contributions of the proposed method are summarized as follows:

We propose a novel Flow-Guided Spatial Temporal Transformer (FSTT) architecture for high-quality video inpainting.
We propose a Flow-Guided F3N module to alleviate the inaccuracy caused by hole pixels when performing MHSA.
We propose a decomposed spatial temporal MHSA to effectively integrate the spatio-temporal information across frames. A global–local temporal attention mechanism is further designed to improve the efficiency of the FSTT.

2. Related Works

In this section, we first review the research progress on image inpainting tasks that is most relevant to video inpainting. Then, the related works of video inpainting are introduced.

2.1. Image Inpainting

Image inpainting is a process of filling in missing or damaged regions of an image with plausible information. Traditional approaches for image inpainting include diffusion-based methods [20,21,22] and patch-based methods [23,24,25]. These approaches typically rely on some form of image modeling or statistical analysis to guide the inpainting process. Early diffusion-based methods propagate local information to the missing region with smoothness constraints. The patch-based methods utilize redundancy in natural images to sample patches from known regions to complete the missing regions according to the patch-level similarities. As one of the representative image inpainting method, PatchMatch [24] searches the patches outside the hole based on the approximate nearest neighbor algorithm and is widely used in practical editing tools. While some progress has been made, however, it is difficult for traditional image inpainting methods to handle complex scenes due to a lack of understanding of image semantics.

In recent years, deep learning-based methods have achieved significant success in image inpainting, especially with the development of generative adversarial networks (GANs). Pathak et al. [26] propose an encoder–decoder architecture with generative adversarial learning for image inpainting. Based on this architecture, numerous improved variants have been developed, such as contextual attention [27,28], global and local discriminator [29], progressive learning [30], etc. Moreover, Liu et al. [31] and Yu et al. [27] utilize the partial convolution operation and gated convolution to address the limitations of the vanilla convolution in image inpainting. To enhance the texture details of inpainted results, some methods propose a two-stage architecture that incorporates auxiliary priors like semantic segmentation maps [32], edges [33], and foreground [34] as guidance. With the completed auxiliary priors, the network is able to generate high-quality inpainting results in the second stage. Furthermore, vision transformer technology has also been introduced into image inpainting tasks [35] and has achieved promising results.

2.2. Video Inpainting

Traditional video inpainting methods are usually extended from patch-based image inpainting methods. Patwardhan et al. [36,37] extend image texture synthesis to the temporal domain and employ a greedy algorithm to find the non-local patches. However, this method can only be applied to scenes with static cameras and limited camera motion. To address the inpainting task under dynamic camera motion, Wexler et al. [38] propose an extension of 2D image patches to 3D spatio-temporal video patches and formulate the video inpainting task as a global optimization problem by alternating between patch searching and reconstruction steps. Newson et al. [39] propose 3D PatchMatch to enforce temporal consistency and accelerate the patch searching process. To cope with more complex scenarios, some methods suggest to utilize 2D patch to complete the hole regions and maintain temporal consistency by incorporating explicit flow constraints or homography-based registration. Granados et al. [40] utilize homographies to align video frames, and optical flow is applied to inpainting frames to enforce temporal coherence. Huang et al. [41] perform alternate optimization on patch search, optical flow estimation and color completion. Different from the above methods, Li et al. [42] propose a depth-guided mesh-warping model to collect the reference information from local adjacent frames and design a short-long-term propagation-based framework to achieve inpainting.

Deep learning provides a more efficient and effective method for video inpainting. Deep learning-based video inpainting approaches can usually be classified into the following categories: 3D CNN-based [1,2,3,43], flow-guided [7,9], attention mechanism [4,5,6], and transformer [10,11]. The 3D CNN-based methods exploit 3D convolutions to capture temporal information. Wang et al. [1] propose the first deep learning-based video inpainting framework. They adopt a 3D CNN to capture the temporal structure from low-resolution inputs and further utilize a 2D CNN to recover the spatial details. Chang et al. [2] extend the 2D gate convolution and PatchGAN to 3D to address the video inpainting problem. Hu et al. [43] combine 3D convolution with a region proposal-based strategy to enhance the quality of the inpainted results. Among flow-guided video inpainting methods, VINet [8] aligns frames with the estimated optical flow and uses a recurrent neural network to integrate temporal information. Xu et al. [7] propose a multi-stage video inpainting network. They first restore the optical flow and then utilize the completed forward–backward flows to propagate information from potentially distant frames. Finally, the image inpainting method is employed to fill the remaining missing regions. Gao et al. [9] further introduce an edge prior in the flow map to preserve the sharpness of motion boundaries. Wang et al. [44] utilize local and non-local flow to improve the quality of the completed optical flow. Attention mechanisms have the ability to capture long-range correspondences from distant frames. OPN [5] designs an asymmetric attention block to progressively aggregate temporal information from the hole boundary. Recently, Zeng et al. [10] and Liu et al. [11] designed a specific transformer architecture to find the correspondences in a considerable temporal receptive field. To reduce the computational cost, Zhang et al. [45] propose a 3D Swin Transformer model for video inpainting. Moreover, some methods combine two or more of the above approaches. For example, Li et al. [16] and Zhang et al. [17] integrate flow-guided and transformer methods into a framework that utilizes the guidance of optical flow to enhance the accuracy of the transformer, producing superior inpainting results.

3. Proposed Methods

Given a set of corrupted video frames

X : = \{X^{1}, X^{2}, \dots, X^{t}\}

of height h, width w, and length t in RGB space

R

, with their corresponding missing region

M : = \{M^{1}, M^{2}, \dots, M^{t}\}

, where M is a binary mask and the value ‘0’ represents known pixels, our network aims to learn a function

F : X \to \hat{Y}

that generates the inpainting results

\hat{Y} : = \{{\hat{Y}}^{1}, {\hat{Y}}^{2}, \dots, {\hat{Y}}^{t}\}

. The

\hat{Y}

should be consistent spatially and temporally, and as consistent as possible with the ground-truth video

Y : = \{Y^{1}, Y^{2}, \dots, Y^{t}\}

.

To achieve this goal, we propose a novel Flow-Guided Spatial Temporal Transformer (FSTT) for video inpainting. In the following, an overview of FSTT is first introduced. Then, we present the detailed design of the FSTT. Finally, the loss functions that are utilized to train the network are given.

3.1. Network Overview

The overall framework of the proposed FSTT is illustrated in Figure 1, which mainly consists of two stages: optical flow completion and corrupted frame inpainting. These two stages consist of four components: (1) the optical flow completion module, (2) the frame feature encoder module, (3) flow-guided spatial temporal transformer blocks, and (4) the decoder module.

Specifically, taking a corrupted video sequence

\{X^{i} \in R^{h \times w \times 3} | i \in [0, T]\}

and its corresponding binary masks

\{M^{i} \in R^{h \times w \times 1} | i \in [0, T]\}

as inputs, FSTT firstly completes the forward optical flows

{\hat{F}}_{f}

and backward optical flows

{\hat{F}}_{b}

across frames of inputs at 1/4 resolution through optical flow completion module

F

. These completed flows are utilized to guide the restoration of the corrupted frames at the feature level.

In the frame inpainting stage, the convolutional encoder that is built on stacked 2D convolution layers extracts contextual features from input frames and obtains c channel feature maps

\{f^{i} \in R^{h / 4 \times w / 4 \times c} | i \in [0, T]\}

. Then, the valid information is propagated between the feature maps with the help of the completed bidirectional flow

{\hat{F}}_{f}

and

{\hat{F}}_{b}

, which provides more effective information for the missing regions. Thirdly, the propagated features are split into smaller patches and flattened to one-dimensional tokens

Z \in R^{t \times n \times d}

, where t is the frame length, n represents the number of the tokens in one feature map, and d represents the token channel. Next, Z is fed into the core component, the flow-guided spatial temporal transformer block, to integrate the spatial temporal information in the video under the guidance of the completed optical flow

{\hat{F}}_{f}

and

{\hat{F}}_{b}

, producing refined tokens

\tilde{Z} \in R^{t \times n \times d}

. The refined tokens are then linearly transformed and reshaped to obtain feature maps

\{d^{i} \in R^{h / 4 \times w / 4 \times C} | i \in [0, T]\}

. Finally, similar to the encoder module, the decoder module with a series of deconvolution layers is leveraged to decode the features back to the completed RGB results

\{{\hat{Y}}^{i} \in R^{h \times w \times 3} | i \in [0, T]\}

.

3.2. Optical Flow Completion Module

The motion information between video frames provides significant assistance in solving various video-related tasks, such as video segmentation and video object detection. Similarly, the motion information is also crucial for the video inpainting task. In this paper, we introduce optical flow into video inpainting to better integrate the spatial temporal information of videos.

To leverage the motion information in the video, we first need to obtain the completed optical flow map. Considering the efficiency of the proposed network, we exploit a lightweight optical flow estimation network

F

to complete the flow. The

F

adopts a similar architecture to SpyNet [46] which is widely used for optical flow-related tasks.

Specifically, we down-sample the input corrupted frames X to 1/4 resolution (denoted as

\tilde{X} \in R^{h / 4 \times w / 4 \times 3}

), which matches the spatial resolution of the encoder feature maps. The completed forward optical flow

{\hat{F}}^{t \to t + 1}

between frames

{\tilde{X}}^{t}

and

{\tilde{X}}^{t + 1}

is computed by the flow estimation module

F

as follows:

{\hat{F}}^{t \to t + 1} = F ({\tilde{X}}^{t}, {\tilde{X}}^{t + 1}),

(1)

and the backward optical flow

{\hat{F}}^{t + 1 \to t}

is computed in a similar manner.

3.3. Flow-Guided Feature Propagation

With the completed optical flow, we can propagate the valid information from neighboring frames to the corrupted region in the current frame by warping operation

W (:)

:

f^{t + 1 \to t} = W ({\hat{F}}^{t \to t + 1}, f^{t + 1}) .

(2)

However, directly obtaining accurate optical flow is challenging due to the existence of missing regions. Inaccurate optical flow will lead to irrelevant information propagation, significantly degrading the quality of the inpainting results. To alleviate this problem, inspired by [47], we combine the deformable convolution with optical flow propagation to improve the efficiency of information propagation performance.

Figure 2 shows the improved pipeline for feature propagation from

f^{t + 1}

to

f^{t}

. The neighbor frame feature

f^{t + 1}

is first aligned with the current frame feature

f^{t}

through Equation (2), obtaining pre-aligned feature maps

f^{t + 1 \to t}

. Then, we concatenate

f^{t + 1 \to t}

and

f^{t}

, and then estimate the offsets

o^{t + 1 \to t}

and modulation masks

m^{t + 1 \to t}

between them with an offset prediction network:

\begin{matrix} o^{t + 1 \to t} = C_{o} (concat (f^{t + 1 \to t}, f^{t})), \\ m^{t + 1 \to t} = σ (C_{m} (concat (f^{t + 1 \to t}, f^{t}))), \end{matrix}

(3)

where

C_{o}

and

C_{m}

denote a stack of convolutional layers, and

σ

is a sigmoid function.

The learned offset

o^{t + 1 \to t}

contains the motion information between frames, which further compensates for inaccurate optical flow. Thus, we add the learned offsets

o^{t + 1 \to t}

and the completed optical flow map

{\hat{F}}^{t \to t + 1}

to generate the refined offsets

{\hat{o}}^{t + 1 \to t}

. The deformable convolution operation is applied to

f^{t + 1}

to generate the final propagation feature

{\hat{f}}^{t + 1 \to t}

:

{\hat{f}}^{t + 1 \to t} = DCN (o^{t + 1 \to t} + {\hat{F}}^{t \to t + 1}, m^{t + 1 \to t}, f^{t + 1}) .

(4)

Finally, we merge the

{\hat{f}}^{t + 1 \to t}

with the current frame feature

f^{t}

, which has several convolution layers. In the implementation, we perform bidirectional propagation across frame features, and a convolution layer with

1 \times 1

kernel size is utilized to fuse the forward and backward propagation features.

3.4. Flow-Guided Spatial Temporal Transformer Block

To effectively leverage the spatial temporal information across video frames, we introduce optical flow into the transformer block to guide the information propagation. Furthermore, for video, in addition to the temporal information across frames, the spatial information of the current video frame can also be leveraged to fill in the missing regions. Thus, both spatial and temporal attention are utilized in transformer block to integrate the complementary content across frames. To this end, a novel Flow-Guided Spatial Temporal Transformer block (FSTT block) is proposed.

The illustration of the FSTT block is shown in Figure 1. The FSTT block mainly consists of three parts: Spatial Multi-Head Self-Attention (Spatial MHSA), Global–Local Temporal Multi-Head Self-Attention (Global–Local Temporal MHSA), and Flow-Guided Fusion Feed-Forward (Flow-Guided F3N). In detail, the input of FSTT block

Z \in R^{t \times n \times d}

is first projected to the query, key, and value features q, k, and v, respectively, as follows:

q, k, v = L_{q} (Z), L_{k} (Z), L_{v} (Z),

(5)

where

L_{q}

,

L_{k}

, and

L_{v}

denote a

1 \times 1

linear projection layer. Then, we conduct MHSA on the temporal dimension and the spatial dimension separately. The query, key, and value features are split into different heads along the channel dimension. For temporal MHSA, the attention retrieval should be performed on the tokens across input frames simultaneously, formulated as:

{MHSA}_{T} (q, k, v) = \frac{softmax (q, k^{T})}{\sqrt{d}} \cdot v,

(6)

where the

q, k

, and v are rearranged into the shape of

n \times (t \times h^{^{'}} \times w^{^{'}}) \times d

(n denotes the number of heads, and

d = c / n

). The Spatial MHSA adopts a similar computational approach as the Temporal MHSA, but it only conducts attention on the frame itself. The shape of

q, k, v

is

(n \times t) \times (h^{^{'}} \times w^{^{'}}) \times d

.

However, calculating temporal attention directly on all input video frames will result in a significant computational cost. To address this issue, inspired by the window partition strategy proposed in the recent transformer method [48], we further design a Global–Local Temporal Multi-Head Self-Attention (Global–Local Temporal MHSA) with a coarse-fine-grained temporal attention calculation, which improves network efficiency while maintaining performance.

After performing temporal and spatial attention, we feed the features into a Flow-Guided Fusion Feed-Forward (Flow-Guided F3N) module. The Flow-Guided F3N utilizes the completed flow to propagate information across frame features to provide more effective information for the subsequent blocks, compensating the inaccuracy introduced by the missing area when conducting MHSA. The whole process in the FSTT block is formulated as:

\begin{matrix} Z^{^{'}} & = {MHSA}_{S} ({MHSA}_{T} ({LN}_{1} (Z))) + Z, \\ \tilde{Z} & = FGF 3 N ({LN}_{2} (Z^{^{'}})) + Z^{^{'}} . \end{matrix}

(7)

3.4.1. Global–Local Temporal MHSA

We introduce a window partition strategy into the temporal MHSA to improve network efficiency. As shown in Figure 3, given the token map

Z \in R^{t \times h \times w \times c}

, we divide it into several non-overlapping windows of size

s_{h} \times s_{w} \times s_{t}

, where the local tokens in a window are denoted as

Z_{L}

. For each token in the hole region, the local temporal window provides the fine-grained information integration. To obtain global information for temporal MHSA, a convolution layer

L_{g}

with a kernel size k and stride s is applied to perform window pooling spatially and generate the global tokens

Z_{G} = L_{g} (Z)

.

In order to simultaneously obtain local and global information, we calculate attentions with local–global interactions. Specifically, for a query in a local window, we find correspondences not only in the local window but also from the global window. We concatenate the local tokens

Z_{L}

and global tokens

Z_{G}

and then project them into the key and value. The query q, key k, and value v are generated as follows:

\begin{matrix} q & = L_{q} (Z_{L}), \\ k & = L_{k} (concat (Z_{L}, Z_{G})), \\ v & = L_{v} (concat (Z_{L}, Z_{G})), \end{matrix}

(8)

where

L_{q}

,

L_{k}

, and

L_{v}

are linear projection layers. The generated

q, k, v

are then used for the temporal MHSA computation.

3.4.2. Flow-Guided Fusion Feed-Forward Yellow Module

To enhance the sub-token fusion capability and learn fine-grained features when applying transformers for video inpainting tasks, Soft Split (SS) and Soft Composition (SC) operations are proposed by F3N [11] to replace the two-layer MLPs in the conventional transformer. The SS operation softly splits video frames into patches with overlapping regions, and the SC operation softly composites these overlapped patches back to images, improving the quality of the inpainted results. Our Flow-Guided Fusion Feed-Forward module (Flow-Guided F3N, FGF3N) is built on the F3N. It coordinates flow-guided feature propagation between the SS and SC operations. The processing of FGF3N is illustrated in Figure 4.

Let

t v

represent the token vectors that are input to FGF3N, where

t v = {LN}_{2} (Z^{^{'}})

. FGF3N first utilizes an MLP layer to process the

t v

, generating a token map

t v_{i}

. Then, the Soft Composition operation, flow-guided feature propagation, and Soft Split operation are applied to

t v_{i}

step by step. The formula of FGF3N is as follows:

\begin{matrix} t v_{i} = MLP (t v), t v_{c} = SC (t v_{i}), \\ t v_{f} = FGFP (t v_{c}, \hat{F}), \\ t v_{s} = SS (t v_{f}), t v^{^{'}} = MLP (t v_{s}), \end{matrix}

(9)

where MLP refers to a multi-layer perceptron, and FGFP denotes the flow-guided feature propagation. The Soft Composition operation composes the one-dimensional vector

t v_{i}

to a 2D feature map, enabling FGF3N to perform flow-guided feature propagation.

3.5. Training Losses

Three loss functions are utilized to train the FSTT: flow estimation loss

L_{f}

, reconstruction loss

L_{r e c}

, and adversarial loss

L_{a d v}

.

We use the flow estimation loss to train the optical flow completion network. The flow estimation loss measures the distance between the completed bidirectional flows and ground-truth flows:

\begin{matrix} L_{f} = \sum_{t = 1}^{T - 1} {∥{\hat{F}}^{t \to t + 1} - F^{t \to t + 1}∥}_{1} + \sum_{t = 2}^{T} {∥{\hat{F}}^{t \to t - 1} - F^{t \to t - 1}∥}_{1}, \end{matrix}

(10)

where

F^{t \to t + 1}

and

F^{t \to t - 1}

represent the ground-truth forward and backward optical flow, respectively. These flows are extracted from the original uncorrupted video frames with a pre-trained flow extraction network.

In addition, we apply the pixel-wise reconstruction loss to both the hole region and the valid region to constrain that the inpainted results approximate to the ground-truth frame. This loss is defined as follows:

L_{r e c} = {∥(1 - M^{t}) ⊙ (Y^{t} - {\hat{Y}}^{t})∥}_{1} + {∥M^{t} ⊙ (Y^{t} - {\hat{Y}}^{t})∥}_{1} .

(11)

Inspired by the recent video inpainting works [10,16], T-Patch GAN loss [10] is also leveraged to supervise the training process. T-Patch GAN improves the perceptual quality and spatio-temporal coherence of video inpainting results through adversarial training. The generator loss for FSTT is:

L_{a d v} = - E_{z \sim P_{\hat{Y}} (z)} [D (z)] .

(12)

The detailed optimization function of the T-Patch GAN discriminator D is formulated as:

L_{D} = E_{x \sim P_{Y} (x)} [ReLU (1 - D (x))] + E_{z \sim P_{\hat{Y}} (z)} [ReLU (1 - D (z))] .

(13)

The overall optimization functions are concluded as follows:

L_{t o t a l} = L_{r e c} + λ_{a d v} \cdot L_{a d v} + λ_{f} \cdot L_{f} .

(14)

Following the previous work [16], the weights of

L_{r e c}

,

L_{a d v}

, and

L_{f}

are set to 1, 0.01, and 1, respectively. All modules in the proposed FSTT are jointly optimized in an end-to-end manner.

4. Experiments

4.1. Datasets

Two widely-used datasets in video inpainting tasks, DAVIS [18] and YouTube-VOS [19], are adopted to train and evaluate the proposed network.

The YouTube-VOS dataset is a large-scale benchmark dataset for the task of video object segmentation. The videos are sourced from YouTube and cover a wide range of categories, such as animals, sports, music, and news. The dataset contains 4453 high-resolution video sequences, covering a wide range of object types and motion patterns. The training set, validation set, and test set of YouTube-VOS contain 3471, 474, and 508 video sequences, respectively.

The DAVIS dataset consists of a set of videos with varying degrees of complexity, containing multiple objects with different shapes, sizes, and motions. It contains 150 high-quality video sequences, including 90 for testing and 60 for training, and has been used in a wide range of research projects and competitions. A subset of 90 video sequences provide all the frames densely annotated with pixel-level segmentation masks for the object of interest.

Following [16], two types of free-form masks, moving masks, and stationary masks are used to train the network. The moving masks can be used to simulate real-world applications, such as object removal. The stationary masks can be seen in the task of watermark removal. We train the proposed model on the training set from the YouTube-VOS dataset and evaluate it on both the DAVIS and YouTube-VOS datasets. For ablation studies, we conduct experiments on the DAVIS dataset.

4.2. Implementation Details

The channel for the encoder and decoder in our model is set to 128. Eight stacks of flow-guided spatial temporal transformer blocks are utilized in FSTT, and the dimension of the tokens is set as 512. We conduct experiments on videos with a resolution of

432 \times 256

. We also initialize the optical flow estimation network

F

using the pre-trained weights of SpyNet to provide knowledge about optical flow.

The Adam optimizer [49] with

β_{1} = 0

and

β_{2} = 0.99

is adopted to train the model. The initial learning rate is set to 0.0001 and decreased by a factor of 10 at 400 K iterations. When the interval of the video frame is too large, the object motion will be larger, which will degrade the accuracy of the flow estimation. Thus, during the training, we sample five temporally adjacent frames as local frames and randomly sample an additional three frames as non-local frames that do not perform flow-guided feature propagation operations in the model, to enable the model to capture the distance information. The model is trained on an NVIDIA 2080 Ti GPUs with batch size 8. For ablation studies, we conduct experiments on the DAVIS dataset and train the model for 250 K iterations.

4.3. Quantitative Evaluation

For quantitative comparison, we utilize PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), and VFID (Video-based Fréchet Inception Distance) [50] as evaluation metrics. PSNR and SSIM are two metrics commonly used in image and video processing for evaluating the quality of a compressed or distorted image or video compared to the original. We calculate the scores frame by frame and report the mean value. To further evaluate the quality and temporal consistency of videos, we also introduce the VFID metrics. VFID is an extension of the Fréchet Inception Distance (FID), which calculates the perceptual similarity between videos. In practice, we utilize the I3D pre-trained video recognition model to calculate the VFID.

To evaluate the effectiveness of the proposed model, we compare FSTT with several recent competitive video inpainting methods, including: VINet [8], DFGVI [7], CAP [4], FGVC [9], STTN [10], FuseFormer [11], and E2FGVI [16]. Among them, DFGVI and FGVC are flow-based methods, while STTN and FuseFormer are transformer-based methods. E2FGVI combines flow and transformer to solve the video inpainting problem. We utilize the stationary masks generated by [16] to perform quantitative evaluation.

The quantitative evaluation results on the DAVIS and YouTube-VOS datasets are shown in Table 1. It can be seen that the transformer-based methods ([10,11]) greatly improve the performance of video inpainting. E2FGVI further introduces optical flow into the transformer, achieving better results. Our method takes full advantage of optical flow guidance in the transformer to alleviate the inaccuracies caused by missing pixels and proposes a decomposed spatial temporal MHSA to effectively integrate spatio-temporal information in videos, outperforming all the state-of-the-art video inpainting methods in terms of PSNR, SSIM, and VFID. This demonstrates the superiority of our proposed method.

4.4. Qualitative Evaluation

To present visual quality of the inpainting results, we select the three most representative methods—DFGVI [7], FuseFormer [11], and E2FGVI [16]—for conducting qualitative evaluation. The qualitative results compared with baselines are shown in Figure 5 and Figure 6, where Figure 5 shows the completed result for the static mask and Figure 6 shows the object removal results. From the visual results, we can find that our proposed method is able to generate more perceptually pleasing and temporally coherent results compared with baselines.

In the compared methods, the flow-based method (DFGVI) utilizes completed optical flow to propagate information between frames, making it sensitive to optical flow. Inaccurate flow estimation will mislead the temporal information propagation and degrade the result quality. As shown in Figure 5 and Figure 6, the inpainted results produce obvious artifacts. The transformer-based method (FuseFormer) has difficulty in finding high-quality correspondences when dealing with complex motion, which leads to producing blurry results. Compared to E2FGVI, our model generates more visually pleasant results on both stationary scenes and motion scenes.

Furthermore, we conduct a user study to assess the visual quality of different methods for a more comprehensive comparison. We choose two state-of-the-art video inpainting methods, FuseFormer [11] and E2FGVI [16], to conduct the user study. In practice, 15 participants are invited, and 20 video sequences under two types of masks are sampled for evaluation. In each trial, participants are shown the inpainting results from different methods and asked to rank the inpainting results. The results of the user study are presented in Figure 7. As we can see, our method achieves better results than other methods in most cases.

4.5. Efficiency Analysis

The FLOPs and inference time are utilized to evaluate the efficiency of each method. We compare our method with different transformer-based methods that have achieved promising results among existing video inpainting methods on the DAVIS dataset. The comparison results are reported in Table 2. Compared to other methods, although the FLOPs and running time of FSTT are slightly higher in some cases, our method achieves competitive results, demonstrating the effectiveness of the proposed method.

4.6. Ablation Study

In this section, we validate the effectiveness of the designed modules in FSTT. We mainly perform effectiveness studies on the Flow-Guided F3N module and the decomposed spatial temporal MHSA module. Furthermore, we also analyze the temporal consistency of inpainting results and the effectiveness of the optical flow completion network.

4.6.1. Effectiveness of Flow-Guided F3N Module

The Flow-Guided F3N module propagates information between frames based on the completed optical flow. This module provides more effective information for hole regions and mitigates the degradation caused by pixels within the hole region when performing MHSA in the subsequent stage, improving the inpainting quality. To investigate the impact of the Flow-Guided F3N module, we replace the Flow-Guided F3N module with F3N and take this model as the baseline. We analyze the effectiveness of the Flow-Guided F3N module in detail with four settings: (a) without any feature propagation; (b) only involving deformable convolution to perform feature propagation; (c) only utilizing optical flow; (d) combining deformable convolution with flow guidance. The results of the quantitative comparison are reported in Table 3, and the visual comparison results are shown in Figure 8, respectively.

As shown in Table 3, the quantitative performance is improved compared to the model without any feature propagation. This validates the importance of performing feature propagation between frames. Furthermore, the combination of deformable convolution and flow guidance helps our model propagate more accurate information between frames, achieving the best inpainting quality. In Figure 8, the inpainting results produced by the model with F3N tend to generate discontinuous content. With the assistance of deformable convolution and optical flow guidance, the effectiveness information is propagated from adjacent frames to the hole regions, enabling more accurate correspondences to be found. We can see that the structure of inpainting results becomes smoother and more accurate gradually.

4.6.2. Effectiveness of Decomposed Spatial Temporal MHSA Module

To evaluate the effectiveness of decomposed spatial temporal MHSA, we compare the performance of models with different attention mechanisms, including: the model without spatial MHSA, the model with general temporal MHSA, and the proposed global–local temporal MHSA. As shown in Table 4, the introduction of spatial MHSA effectively integrates the spatial information in video frames, which improves the performance of the model. Although the performance has not been improved much, it is because, in the DAVIS dataset, most of the scenes are dynamic. Therefore, the temporal MHSA has a greater impact, and the spatial MHSA has less influence, which accurately reflects the real-world situation. The general temporal MHSA achieves the best quantitative performance. However, it suffers from heavy computation. Our proposed global–local temporal MHSA achieves comparable performance while the computational cost is decreased.

4.6.3. Optical Flow Completion

We study the effectiveness of the optical flow completion network. We compared the inpainting optical flow produced by our proposed model with that produced by DFGVI. The comparison results are shown in Figure 9. DFGVI employs a multi-stage network for optical flow completion. However, it can be seen from Figure 9 that DFGVI fails to inpaint the optical flow well, resulting in severe artifacts in the completed results. Our method adopts an end-to-end way to training the network, allowing the model to learn the optical flow adaptively. While our model does not exactly recover the optical flow, it produces similar results that can provide valuable information for the subsequent transformer blocks, leading to promising results.

4.6.4. Temporal Consistency

Furthermore, to show the temporal consistency of the proposed method, we also visualize the temporal profile of the corresponding videos. The results are shown in Figure 10. In the temporal profile, we observe that our method can produce sharp and smooth edges. This indicates that the completed videos include many fewer flickering artifacts and maintain temporal consistency.

5. Conclusions and Future Work

In this paper, a novel Flow-Guided Spatial Temporal Transformer (FSTT) architecture is proposed for deep video inpainting. The FSTT aims to explore how to effectively utilize the transformer to establish correspondences between missing regions and valid regions in both spatial and temporal dimensions with the guidance of completed optical flow, which captures spatio-temporal information to perform video inpainting. Two elaborately designed modules, the Flow-Guided Fusion Feed-Forward (Flow-Guided F3N) module and the decomposed spatial temporal MHSA module, are utilized to address the problems of previous methods. The Flow-Guided F3N module provides more effective information for the subsequent stages with flow-guided propagation and alleviates the inaccuracy caused by hole pixels when performing transformers. The decomposed spatial temporal MHSA module effectively integrates the spatio-temporal information in videos. Furthermore, a Global–Local Temporal Attention Mechanism based on the window partition strategy is designed to improve the efficiency of the proposed model. The quantitative and qualitative experimental results on DAVIS and YouTube-VOS datasets demonstrate the superiority of the proposed FSTT.

An improved version [51] of [17] also explores how to make full use of the guidance of optical flow in transformers to solve video inpainting and achieves promising results. Different from our method, the solution in [51] further introduces the completed optical flow into the temporal MHSA and spatial MHSA to enhance feature integration. Furthermore, reference [51] elaborately designs a individual flow completion network and introduces edge loss to train the flow completion network, which improves the quality of the completed flow. The method in reference [51] dealing with combining optical flow and transformers provides a direction for us. However, as mentioned in reference [51], their method also has some significant limitations. First, their method depends highly on the quality of the completed flows. Incorrect optical flow will greatly degrade the final inpainting quality. Additionally, the computational speed is slow because of operations such as Poisson blending and pixel-level propagation. Our method performs information propagation at the feature level and adaptively learns the optical flow in an end-to-end manner, which improves the efficiency and effectiveness of the model.

We also note that FSTT still has some limitations. When the video is occluded by a large mask, FSTT tends to generate blurry inpainting results, as shown in Figure 11a. We infer that when the missing region is too large and that the amount of information in the receptive field is limited. As a result, it becomes difficult to capture enough image patches from the valid regions, leading to blurry results. It is possible to perform a scheme that progressively fills the hole region starting from the hole boundary. This approach ensures that the receptive field captures more relevant contextual information, leading to improved quality of inpainting results. Additionally, as shown in Figure 11b, FSTT fails to generate plausible content when the moving objects have a large number of missing details. Due to the motion inconsistency between foreground and background, completing motion foreground objects with a large number of missing details is very challenging for current video inpainting methods. One promising way is to separate foreground and background and inpaint them separately.

Author Contributions

Conceptualization, R.L.; Methodology, R.L.; Software, R.L.; Validation, R.L.; Formal analysis, R.L.; Investigation, R.L.; Data curation, R.L.; Writing—original draft, R.L.; Writing—review & editing, R.L. and Y.Z.; Visualization, R.L.; Supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62006007.

Data Availability Statement

Publicly available datasets were utilized in this study. The datasets DAVIS and YouTube-VOS can be obtained from: https://data.vision.ee.ethz.ch/csergi/share/davis/DAVIS-2017-trainval-480p.zip (accessed on 20 November 2022) and https://codalab.lisn.upsaclay.fr/competitions/7685#participate-get_data (accessed on 20 November 2022), respectively.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, C.; Huang, H.; Han, X.; Wang, J. Video inpainting by jointly learning temporal structure and spatial details. In Proceedings of the AAAI Conference on Artificial greenIntelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 5232–5239. [Google Scholar]
Chang, Y.; Liu, Z.Y.; Lee, K.; Hsu, W.H. Free-form video inpainting with 3d gated convolution and temporal patchgan. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9066–9075. [Google Scholar]
Liu, R.; Li, B.; Zhu, Y. Temporal Group Fusion Network for Deep Video Inpainting. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 3539–3551. [Google Scholar] [CrossRef]
Lee, S.; Oh, S.W.; Won, D.; Kim, S.J. Copy-and-Paste Networks for Deep Video Inpainting. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4413–4421. [Google Scholar]
Oh, S.W.; Lee, S.; Lee, J.Y.; Kim, S.J. Onion-Peel Networks for Deep Video Completion. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4403–4412. [Google Scholar]
Li, A.; Zhao, S.; Ma, X.; Gong, M.; Qi, J.; Zhang, R.; Tao, D.; Kotagiri, R. Short-Term and Long-Term Context Aggregation Network for Video Inpainting. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 728–743. [Google Scholar]
Xu, R.; Li, X.; Zhou, B.; Loy, C.C. Deep Flow-Guided Video Inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3723–3732. [Google Scholar]
Kim, D.; Woo, S.; Lee, J.Y.; Kweon, I.S. Deep video inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5792–5801. [Google Scholar]
Gao, C.; Saraf, A.; Huang, J.B.; Kopf, J. Flow-edge Guided Video Completion. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 713–729. [Google Scholar]
Zeng, Y.; Fu, J.; Chao, H. Learning Joint Spatial-Temporal Transformations for Video Inpainting. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 528–543. [Google Scholar]
Liu, R.; Deng, H.; Huang, Y.; Shi, X.; Lu, L.; Sun, W.; Wang, X.; Dai, J.; Li, H. FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14020–14029. [Google Scholar]
Geng, Z.; Liang, L.; Ding, T.; Zharkov, I. RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17420–17430. [Google Scholar]
Liu, C.; Yang, H.; Fu, J.; Qian, X. Learning Trajectory-Aware Transformer for Video Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5677–5686. [Google Scholar]
Wu, Z.; Ren, Z.; Wu, Y.; Wang, Z.; Hua, G. TxVAD: Improved Video Action Detection by Transformers. In Proceedings of the ACM International Conference on Multimedia, Lisboa, Portugal, 10–4 October 2022; pp. 4605–4613. [Google Scholar]
Zhao, J.; Zhang, Y.; Li, X.; Chen, H.; Shuai, B.; Xu, M.; Liu, C.; Kundu, K.; Xiong, Y.; Modolo, D.; et al. TubeR: Tubelet Transformer for Video Action Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13588–13597. [Google Scholar]
Li, Z.; Lu, C.; Qin, J.; Guo, C.; Cheng, M. Towards An End-to-End Framework for Flow-Guided Video Inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17541–17550. [Google Scholar]
Zhang, K.; Fu, J.; Liu, D. Flow-Guided Transformer for Video Inpainting. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 74–90. [Google Scholar]
Caelles, S.; Montes, A.; Maninis, K.; Chen, Y.; Gool, L.V.; Perazzi, F.; Pont-Tuset, J. The 2018 DAVIS Challenge on Video Object Segmentation. arXiv 2018, arXiv:1803.0055. [Google Scholar]
Xu, N.; Yang, L.; Fan, Y.; Yue, D.; Liang, Y.; Yang, J.; Huang, T. Youtube-vos: A large-scale video object segmentation benchmark. arXiv 2018, arXiv:1809.03327. [Google Scholar]
Bertalmío, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar]
Bertalmío, M. Strong-Continuation, Contrast-Invariant Inpainting with a Third-Order Optimal PDE. IEEE Trans. Image Process. 2006, 15, 1934–1938. [Google Scholar] [CrossRef]
Liu, D.; Sun, X.; Wu, F.; Li, S.; Zhang, Y. Image Compression with Edge-Based Inpainting. IEEE Trans. Circuits Syst. Video Technol. 2007, 17, 1273–1287. [Google Scholar]
Criminisi, A.; Pérez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
Jin, D.; Bai, X. Patch-Sparsity-Based Image Inpainting Through a Facet Deduced Directional Derivative. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 1310–1324. [Google Scholar] [CrossRef]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5505–5514. [Google Scholar]
Wang, N.; Li, J.; Zhang, L.; Du, B. MUSICAL: Multi-Scale Image Contextual Attention Learning for Inpainting. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 3748–3754. [Google Scholar]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. 2017, 36, 107:1–107:14. [Google Scholar] [CrossRef]
Zhang, H.; Hu, Z.; Luo, C.; Zuo, W.; Wang, M. Semantic Image Inpainting with Progressive Generative Networks. In Proceedings of the ACM Multimedia Conference on Multimedia Conference, Seoul, Republic of Korea, 22–26 October 2018; pp. 1939–1947. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Liao, L.; Xiao, J.; Wang, Z.; Lin, C.; Satoh, S. Image Inpainting Guided by Coherence Priors of Semantics and Textures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 6539–6548. [Google Scholar]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.; Ebrahimi, M. EdgeConnect: Structure Guided Image Inpainting using Edge Prediction. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 3265–3274. [Google Scholar]
Xiong, W.; Yu, J.; Lin, Z.; Yang, J.; Lu, X.; Barnes, C.; Luo, J. Foreground-Aware Image Inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5840–5848. [Google Scholar]
Li, W.; Lin, Z.; Zhou, K.; Qi, L.; Wang, Y.; Jia, J. MAT: Mask-Aware Transformer for Large Hole Image Inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10748–10758. [Google Scholar]
Patwardhan, K.A.; Sapiro, G.; Bertalmio, M. Video inpainting of occluding and occluded objects. In Proceedings of the IEEE International Conference on Image Processing, Genoa, Italy, 11–14 September 2005; pp. 69–72. [Google Scholar]
Patwardhan, K.A.; Sapiro, G.; Bertalmío, M. Video inpainting under constrained camera motion. IEEE Trans. Image Process. 2007, 16, 545–553. [Google Scholar] [CrossRef]
Wexler, Y.; Shechtman, E.; Irani, M. Space-time completion of video. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 463–476. [Google Scholar] [CrossRef]
Newson, A.; Almansa, A.; Fradet, M.; Gousseau, Y.; Pérez, P. Video inpainting of complex scenes. SIAM J. Imaging Sci. 2014, 7, 1993–2019. [Google Scholar] [CrossRef]
Gao, C.; Moore, B.E.; Nadakuditi, R.R. Augmented robust PCA for foreground-background separation on noisy, moving camera video. In Proceedings of the Global Conference on Signal and Information Processing, Montreal, QC, Canada, 14–16 November 2017; pp. 1240–1244. [Google Scholar]
Huang, J.B.; Kang, S.B.; Ahuja, N.; Kopf, J. Temporally coherent completion of dynamic video. ACM Trans. Graph. 2016, 35, 196. [Google Scholar] [CrossRef]
Li, S.; Zhu, S.; Huang, Y.; Liu, S.; Zeng, B.; Imran, M.A.; Abbasi, Q.H.; Cooper, J. Short-Long-Term Propagation-based Video Inpainting. IEEE Multimed. 2023; early access. [Google Scholar] [CrossRef]
Hu, Y.T.; Wang, H.; Ballas, N.; Grauman, K.; Schwing, A.G. Proposal-Based Video Completion. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 38–54. [Google Scholar]
Wang, J.; Yang, Z.; Huo, Z.; Chen, W. Local and nonlocal flow-guided video inpainting. Multimed. Tools Appl. 2023. [Google Scholar] [CrossRef]
Zhang, W.; Cao, Y.; Zhai, J. SwinVI:3D Swin Transformer Model with U-net for Video Inpainting. In Proceedings of the International Joint Conference on Neural Networks, Gold Coast, Australia, 18–23 June 2023; pp. 1–8. [Google Scholar]
Ranjan, A.; Black, M.J. Optical Flow Estimation Using a Spatial Pyramid Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2720–2729. [Google Scholar]
Chan, K.C.K.; Zhou, S.; Xu, X.; Loy, C.C. BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5962–5971. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3192–3201. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Wang, T.; Liu, M.; Zhu, J.; Yakovenko, N.; Tao, A.; Kautz, J.; Catanzaro, B. Video-to-Video Synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 1152–1164. [Google Scholar]
Zhang, K.; Peng, J.; Fu, J.; Liu, D. Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting. arXiv 2023, arXiv:2301.10048. [Google Scholar]

Figure 1. Illustration of the proposed Flow-Guided Spatial Temporal Transformer (FSTT).

Figure 2. The pipeline of flow-guided feature propagation.

Figure 3. Illustration of the Global–Local Temporal MHSA.

Figure 4. Illustration of the Flow-Guided Fusion Feed-Forward module.

Figure 5. Visual comparison results of static mask.

Figure 6. Visual comparison results of object removal.

Figure 7. User study results.

Figure 8. The ablation studies of the effectiveness of the Flow-Guided F3N module.

Figure 9. The effectiveness of the optical flow completion network. (a) is the ground-truth optical flow extracted by pre-trained SpyNet. (d) is the frame that is utilized to compute optical flow, and the corresponding mask is also presented. (b,e) are the completed flow and frame with DFGVI, respectively. (c,f) are the completed flow and frame with proposed FFST, respectively.

Figure 10. The temporal profile of the red scan line.

Figure 11. The failure cases.

Table 1. Quantitative evaluation results on the DAVIS and YouTube-VOS data. ↑ denotes that higher is better, while ↓ denotes that lower is better.

	Accuracy
	YouTube-VOS			DAVIS
Models	PSNR ↑	SSIM ↑	VFID ↓	PSNR ↑	SSIM ↑	VFID ↓
VINet [8]	29.20	0.9434	0.072	28.96	0.9411	0.199
DFGVI [7]	29.16	0.9429	0.066	28.81	0.9404	0.187
CAP [4]	31.58	0.9607	0.071	30.28	0.9521	0.182
FGVC [9]	29.67	0.9403	0.064	30.80	0.9497	0.165
STTN [10]	32.34	0.9655	0.053	30.67	0.9560	0.149
FuseFormer [11]	33.29	0.9681	0.053	32.54	0.9700	0.138
E2FGVI [16]	33.71	0.9700	0.046	33.01	0.9721	0.116
FSTT (Ours)	34.33	0.9731	0.044	33.77	0.9756	0.109

Table 2. Efficiency analysis between the proposed method and transformer-based video inpainting methods.

Method	STTN	FuseFormer	E2FGVI	Ours
FLOPs	478 G	580 G	493 G	523 G
Runtime (s/frame)	0.102	0.176	0.137	0.158

Table 3. Ablation study of the effectiveness of the Flow-Guided F3N module on DAVIS. F3N represents the model without any feature propagation.

Method	PSNR ↑	SSIM ↑
F3N	31.91	0.9659
F3N + DCN	32.29	0.9679
F3N + Flow	32.33	0.9683
Flow-Guided F3N	32.65	0.9708

Table 4. Ablation study on the effectiveness of the decomposed spatial temporal MHSA module.

Method	PSNR ↑	SSIM ↑	FLOPs ↓
w/o spatial MHSA	32.62	0.9706	506 G
w/temporal MHSA	32.71	0.9710	609 G
w/global–local temporal MHSA	32.65	0.9708	523 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, R.; Zhu, Y. FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting. Electronics 2023, 12, 4452. https://doi.org/10.3390/electronics12214452

AMA Style

Liu R, Zhu Y. FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting. Electronics. 2023; 12(21):4452. https://doi.org/10.3390/electronics12214452

Chicago/Turabian Style

Liu, Ruixin, and Yuesheng Zhu. 2023. "FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting" Electronics 12, no. 21: 4452. https://doi.org/10.3390/electronics12214452

APA Style

Liu, R., & Zhu, Y. (2023). FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting. Electronics, 12(21), 4452. https://doi.org/10.3390/electronics12214452

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting

Abstract

1. Introduction

2. Related Works

2.1. Image Inpainting

2.2. Video Inpainting

3. Proposed Methods

3.1. Network Overview

3.2. Optical Flow Completion Module

3.3. Flow-Guided Feature Propagation

3.4. Flow-Guided Spatial Temporal Transformer Block

3.4.1. Global–Local Temporal MHSA

3.4.2. Flow-Guided Fusion Feed-Forward Yellow Module

3.5. Training Losses

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Quantitative Evaluation

4.4. Qualitative Evaluation

4.5. Efficiency Analysis

4.6. Ablation Study

4.6.1. Effectiveness of Flow-Guided F3N Module

4.6.2. Effectiveness of Decomposed Spatial Temporal MHSA Module

4.6.3. Optical Flow Completion

4.6.4. Temporal Consistency

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI