Joint Subtitle Extraction and Frame Inpainting for Videos with Burned-In Subtitles

: Subtitles are crucial for video content understanding. However, a large amount of videos have only burned-in, hardcoded subtitles that prevent video re-editing, translation, etc. In this paper, we construct a deep-learning-based system for the inverse conversion of a burned-in subtitle video to a subtitle ﬁle and an inpainted video, by coupling three deep neural networks (CTPN, CRNN, and EdgeConnect). We evaluated the performance of the proposed method and found that the deep learning method achieved high-precision separation of the subtitles and video frames and signiﬁcantly improved the video inpainting results compared to the existing methods. This research ﬁlls a gap in the application of deep learning to burned-in subtitle video reconstruction and is expected to be widely applied in the reconstruction and re-editing of videos with subtitles, advertisements, logos, and other occlusions.


Introduction
As an important clue to the semantics of a video, subtitles use text to emphasize, supplement, or explain the non-visual content. As video becomes a mainstream medium for information interaction, subtitles play an increasingly important role as they enrich the on-screen information, e.g., subtitles may imply commentaries or thoughts from the creator. In addition, subtitles effectively compensate for simultaneous sound and enhance the understanding of the video for viewers with hearing impairments.
For more convenient transmission, subtitles exist mainly in the form of burned-in video frames, especially in most short and old videos. However, the language-specific burned-in subtitles pose great challenges for the re-editing and communication of the video between different languages [1], e.g., the translation of the video. Hence, subtitle extraction has been gaining attention, and some techniques have emerged for the automatic recognition of subtitles to facilitate the understanding and transcription of videos [2,3]. On the other hand, a video is seriously damaged after the extraction and removal of the burned-in subtitles, while an intact, subtitle-free video is desired, e.g., for re-adding the translated subtitles. Hence, the inpainting of the subtitle-removed frames is of great value for the reuse of the video.
The reconstruction of the burned-in subtitle video is realized by the combined subtitle removal and video restoration, which can be generally divided into two stages: text detection and frame inpainting. Existing video reconstruction techniques are based on traditional text detection and texture reconstruction approaches and have achieved some success. However, there are still problems. Previous video reconstruction methods employed a traditional text detection pipeline [4][5][6][7][8], which consists of a series of steps, such as stroke filtering, positioning, segmentation, and verification.
The performance of the methods heavily relies on character detection, while the complex steps result in the propagation of errors and, hence, poor robustness and reliability [9]. Few works could generate the subtitle files directly. Furthermore, traditional frame inpainting approaches are generally diffusion-based or patch-based [10][11][12][13][14]. However, both the diffusion method based on differential operators and the patching method based on similar source image filling do a poor job of inpainting heavily damaged and complex details [15].
Recently, deep learning methods have achieved remarkable success in text recognition and image inpainting [2,3,9,15,16]. For example, Yan et al. [3] used a residual neural network for subtitle recognition, and Nazeri et al. [15] used a generative adversarial model for image restoration, and both showed excellent performance. Hence, deep learning methods open up wide prospects and provide powerful tools for the reconstruction of burned-in subtitle videos.
However, video reconstruction is a complex task that demands several deep modules. Thus, how to realize the seamless collaboration among the modules becomes a vital issue. For now, there is still a lack of an effective method that employs deep learning approaches to solve the burned-in subtitle video reconstruction challenge.
In this paper, we propose a novel pipeline for burned-in subtitle video reconstruction, based on deep learning. The pipeline unites subtitle extraction and frame inpainting and consists of three stages: (1) text detection; (2) text recognition; and (3) frame inpainting, and is implemented by three state-of-the-art deep neural networks (CTPN [9], CRNN [16], and EdgeConnect [15], respectively). An intermediate-process as well as a post-process are designed to implement the coupling of the models and the transformation of the results. Our contributions are four-fold: • The inverse conversion of the burned-in subtitle video to an independent subtitle file and subtitle-free video. • A novel framework for burned-in subtitle video reconstruction based on deep learning. • The first application of the state-of-the-art deep learning techniques for burned-in subtitle video reconstruction with significantly enhanced subtitle extraction and frame inpainting. • A general pipeline can be applied in the reconstruction and re-editing of videos with subtitles, advertisements, logos, and other occlusions.
The rest of the paper is structured as follows. Section 2 introduces the related work. Section 3 describes the framework and methodology for burned-in subtitle video reconstruction in detail. Section 4 presents and discusses the experimental results. Section 5 concludes our work and looks forward to future work.

Related Work
Over the past decade, a few works have addressed the challenge of burned-in subtitle video reconstruction. In 2010, Favorskaya et al. [4] first proposed a hybrid method based on contour and color information from sequential frames for text detection, and reconstructed the texture by statistical analysis in the time-space domain. Then, a priority-based matching algorithm was proposed by Khodadadi et al. [5] for reconstruction in areas with texture variation.
Subsequently, Favorskaya et al. [17] proposed a neural network based on time-space parameters for inpainting small-area damage of videos. In 2016, Vuong et al. [18] proposed a reconstruction system capable of detecting and extracting burned-in subtitles in the form of text, avoiding the waste of the original subtitles.
Previous burned-in subtitle video reconstruction methods were based on traditional text detection and texture reconstruction, which still have many problems despite some success, e.g., poor robustness due to the complex text detection pipeline (see Section 2.1 for details), poor generality due to the lexicon-based text recognition (see Section 2.2 for details), and the loss of high-frequency information for image restoration (see Section 2.3 for details). We summarize the pipeline of burned-in subtitle video reconstruction into three subtasks: (1) text detection, (2) text recognition, and (3) frame inpainting. The following introduces the related work in each of these three subtasks.

Text Detection
Previously, there were two common approaches for text detection in videos or images with complex backgrounds. The primitive methods are based on low-level properties of the frame such as the contour, color, or gradient, including the gradient method, stroke filtering, color threshold segmentation, etc. [4][5][6][7][8]. Text detection is implemented through a series of filtering components, which leads to the transfer and accumulation of errors, resulting in low accuracy and robustness, especially when dealing with complex backgrounds.
With the development of CNN, character-based text detection methods emerged [19][20][21][22], which detect candidate characters by densely moving a multi-scale window through an image. The content in the window is judged by a pre-trained classifier. However, dense window sliding imposes a huge computational overhead, which severely limits the detection speed. In addition, precise text line positioning is difficult for the above methods. The Connectionist Text Proposal Network (CTPN) [9] is a mature text detection framework that combines CNN and Long Short-Term Memory (LSTM) deep networks to greatly improve the localization accuracy through a vertical anchor mechanism, while overcoming the inefficiency of sliding window methods.

Text Recognition
Traditional text recognition is based on character recognition and word recognition. The primitive approaches crop and detect individual characters from a word image by sliding a window, and then recombine all characters into a complete word [23,24]. These approaches require a powerful character detector and strongly rely on a fixed lexicon to synthesize words. Subsequently, word-based approaches emerged [25], which treat text recognition as a word image classification task, assigning a category label to each word.
Despite the impressive results achieved by these methods, they require an ultramulti-classification model, are seriously confined by the number of classes, and have poor generalizability. CNN and RNN are important branches of the deep neural network family, specializing in image feature extraction and sequence analysis, respectively [26][27][28]. Shi et al. proposed a novel network called the Convolutional Recurrent Neural Network (CRNN) [16] that integrates CNN and RNN into the text recognition task, to solve the problems that exist in traditional methods. Compared with previous text recognition systems, CRNN is end-to-end trainable, able to handle sequences of arbitrary length, and not limited by any predefined lexicon. It is also an efficient but small model that is well-suited to real-life scenes.

Image Inpainting
Previous video frame restoration techniques can be divided into two perspectives: spatial and temporal domains, and three basic approaches: overlaying (as a temporal algorithm), diffusing, and patching (as spatial algorithms). The overlaying methods cover the missing texture region on the current frame by the real texture fragment of the previous or next frame without texture smoothing and compositing [4]. It is difficult to solve the micro-displacement or out-of-tune state of the texture fragments on the image. The diffusion methods propagate local background information to the missing regions [10][11][12].
However, such methods do not take full advantage of the global information and, thus, cannot recover meaningful structures in the missing regions and poorly handle a large missing region. Meanwhile, the diffusion methods require a significant time overhead to reach appreciable inpainting effects, which is unacceptable for the inpainting of videos. With the application of deep learning to image inpainting, the patch-based methods have emerged [13,14], which implement inpainting by copying similar regions from the image set.
Such methods strongly rely on the image set and, thus, are suitable for highly patterned scenes but have difficulty in inpainting unique patterns. Recently, generative adversarial networks (GANs) have achieved impressive performances in inpainting [15,[29][30][31]. Edge-Connect [15] is a new GAN-based inpainting method, inspired by the creative idea of "lines first, color next", achieving coherence in the inpainting content and refinement of details by global edge-connecting with high time efficiency.

Method
As shown in Figure 1 (model diagram) and Figure 2 (processing flow), the entire pipeline of the proposed method contains three main modules plus an intermediate process and a post process. Given a video frame with burned-in subtitles as input, a text detection network is first adopted to precisely locate the subtitle text region. By taking the text region bounding box, an intermediate process is conducted to separate the processed video frame into two parts.
The cropped subtitle image is fed into a text recognition network to recognize the subtitle character contents, and the video frame together with the subtitle character mask are sent to an image inpainting network to fill up the missing pixels inside the region of subtitle characters. After the subtitle recognition and frame inpainting, a post-process is required to construct the subtitle text file and assemble the inpainted frames into a video file. The following subsections depict the technical details of each module.

Text Detection
The subtitle text region detection module employs the CTPN method [9], which utilizes a seamless combination of CNN and RNN to achieve the high-accuracy detection of horizontal text in complex scenes. CTPN enables the input video frame of an arbitrary size (H × W × 3) for text detection. At the beginning of detection, a CNN based on VGG-16 is first adopted to extract the deep features of input raw images. The feature map of layer conv5 is obtained as the last layer of VGG-16, with the total stride and receptive field fixed as 16 and 228 pixels, respectively.
Then, a 3 × 3 sliding window with a step size of 1 is performed on this feature map to obtain 256-D feature vectors. A RNN based on the bi-directional LSTMs (BiLSTMs) is used to learn feature sequences and predict the position of text according to the preceding and following texts. The feature vectors corresponding to all windows are fed into a BiLSTM network, consisting of two 128-D forward and inverse LSTMs. The output of the BiLSTM network is then fed into three regression layers through a 512-D fully connected layer.
Among the three regression layers, the 2 k vertical coordinates and k side-refinement are obtained to locate the k proposals (fixed-width, slender rectangular boxes), while 2 k scores are obtained to determine whether the proposal is text. Finally, every two adjacent proposals with scores > 0.7 are merged to obtain the bounding box of the subtitle text region. The network configuration summary of CTPN is detailed in Table A1.

Text Recognition
The recognition module used to recognize the characters in the video subtitles is mainly based on CRNN [16]. The architecture of CRNN consists of three components from the bottom to top, including the convolutional layer (CNN, for extracting features), the recurrent layer (RNN, for predicting distributions) and the transcription layer (CTC, for synthesizing sequences), to achieve accurate recognition of indefinitely long text sequences. At the beginning of recognition, The gray-scale image of the subtitle text region is sent to the CRNN, and the image is deflated to 32 × W and then fed into a CNN based on the VGG network.
After a series of convolution, pooling, and batch normalization operations on the image, the CNN extracts a 512 × 1 × 40 feature map and converts it into 40 × 512-D feature vectors for the prediction in recurrent layers. On top of the convolutional layer, a BiLSTM-based RNN is built, which uses a 256-D BiLSTM network to learn feature vectors and predict the probability distribution of the labels.
At the end of the RNN, the propagated sequence is concatenated again into a map and fed back to the CNN, implementing a custom network layer called "Map-to-Sequence", which serves as a bridge between the CNN and RNN. On top of the recurrent layer, the transcription layer converts the label probability distribution from the RNN into an indefinitely long text sequence by de-duplication and integration, as the final output result. The network configuration summary of CRNN is detailed in Table A2.

Frame Inpainting
The inpainting of subtitle-removed frames is based on an adversarial edge learning image inpainting network named EdgeConnect [15]. EdgeConnect consists of two GAN cascades, including an edge generator and an image completion network, to generate hallucinated edges and inpaint the missing pixels by edge-guiding, via adversarial learning. Each GAN follows the adversarial model, consisting of a generator and discriminator.
For the GAN of EdgeConnect, the generator consists of an encoder, eight residual blocks, and a decoder, and the discriminator consists of five convolution layers. In the generator of the first-stage GAN (edge generator), the gray-scale map of the subtitleremoved frame and subtitle mask are used as pre-conditions to predict the edge map of the masked area. The input image is down-sampled twice by the encoder and fed into the residual blocks for dilated convolutions with a dilation factor of 2, resulting in a receptive field of 205 at the final residual layer. The final map is up-sampled twice by the decoder and resized to its original scale.
Similar to the first stage, the generator of the second-stage GAN (image completion network) takes the RGB map of the subtitle-removed frame and the predicted edge map as pre-conditions to complete the image by combining the background area of the ground truth edges with the predicted edges in the damaged area. For discriminators, a 70 × 70 PatchGAN architecture is used, which determines whether or not overlapping image patches of size 70 × 70 are real.
The discriminator of the edge generator discriminates whether the generated edge map is real with a joint loss as the training goal, including an adversarial loss and featurematching loss. The discriminator of the image completion network discriminates whether the inpainted color map is real, with a joint loss as the training goal, including an L1 loss, adversarial loss, perceptual loss, and style loss. The network configuration summary of EdgeConnect is detailed in Table A3.

Intermediate-Process
As shown in Figure 2, an intermediate-process stage was designed to connect the text region detection stage and the following text recognition and frame inapinting stages. This process takes the bounding box (bbox) of the subtitle text area obtained from CTPN and the original video frame as input, and consists of three main steps. In the first step, the original frame is copied and cropped by the bbox. The cropped subtitle text image is fed into the CRNN, achieving end-to-end recognition of the subtitle by CRNN.
In the second step, the contour of the original frame in the bbox is extracted and expanded. The subtitle mask is obtained, which ensures the complete removal of subtitle text at the cost of minimal information loss. In the third step, the original frame is corroded by the subtitle mask. And the subtitle-removed frame is fed into EdgeConnect along with the subtitle mask. The whole process improves the accuracy of subtitle recognition through the precise-segmentation of the subtitle text area, and minimizes the error of frame inpainting through the careful-removal of subtitle text.

Post-Process
In order to obtain the final subtitle file and the subtitle-free video file, a post-process stage is needed. The subtitle text sequences are output by CRNN, while the inpainted frames are output by EdgeConnect. Hence, a post-process is required to synthesize the outputs into the subtitle file and video. As shown in Figure 2, the post-process takes all the inpainted frames and subtitle text sequences as input, where each inpainted frame and each sequence has an index corresponding to its position in the original video. The post-process consists of two parallel steps.
In the first step, all subtitle text sequences are sorted by the indices and the beginning and end of each subtitle in the time domain are calculated according to the original frame rate. Then, each subtitle is time-stamped and synthesized to a subtitle file. In the second step, all the inpainted frames are sorted by the indices and are assembled into a video at the original frame rate. Finally, the entire pipeline is completed with the post-process to convert the burned-in subtitle video to an integral subtitle file and video in reverse.

Results and Discussions
The proposed burned-in subtitle video reconstruction algorithm was implemented based on python programs. The three deep neural networks in the system were each trained on different training sets by adopting different strategies. Among them, CTPN was trained end-to-end on 3000 natural images by standard error back-propagation and stochastic gradient descent (SGD), with a learning rate of 10 −3 for the initial 16 K iterations and 10 −4 for the subsequent 4K iterations, using 0.9 momentum and 0.0005 weight decay. CRNN was also trained end-to-end on the Synth dataset [32] by back-propagation and SGD, using ADADELTA [33] to automatically calculate the learning rate for each dimension and iterating until convergence.
EdgeConnect uses the Adam optimizer [34] to optimize the model, with β1 = 0 and β2 = 0.9. The generators were trained end-to-end until convergence on the Places2 [35] dataset with learning rates set to 10 −4 , 10 −5 , and 10 −6 , gradually, while the discriminator's learning rate was one-tenth of the generator's. Finally, the network was fine-tuned by removing the discriminator of the first-stage GAN. The entire pipeline of burned-in subtitle video reconstruction was tested on 2186 video frames, and the experimental results of each stage are discussed in detail next.

Text Detection
As shown in Figure 3, frames with both Chinese and English subtitles are input to CTPN for text detection. The bboxes of the Chinese and English subtitle text areas are obtained at the output side of CTPN, and the Intersection over Union (IoU) [36] is calculated to measure the accuracy of subtitle detection.

IoU =
The overlapping area o f prediction and ground − true bounding boxes The union area o f prediction and ground − true bounding boxes After testing, the IoUs of the output Chinese and English subtitle detection were 91.9% and 91.1%, respectively. As the input-processing-layer of the joint deep networks, CTPN achieved the precise positioning of multilingual subtitles in detection, ensuring accurate extraction and removal of the burned-in subtitles.

Intermediate-Process
As shown in Figure 4, the bboxes of the Chinese/English subtitles from the CTPN and the original video frame were fed to the intermediate-process pipeline, and two groups of outputs were obtained: (1) images of the text area of the Chinese/English subtitle, and (2) subtitle masks and the subtitle-removed frames. The first group was input to the subtitle recognition network, and the second group was input to the frame inpainting network.
The whole process is based on the precise positioning of the subtitle text area, and the end-to-end recognition, and the minimal removal of subtitles is achieved by subtitle area segmentation and text contour extraction, which enhances the recognition accuracy of the subtitle text and the inpainting effect of the subtitle-removed frames.

Text Recognition
The text images of Chinese and English subtitles were input to the CRNN for text recognition. As shown in Figure 5, the recognized texts of Chinese and English subtitles were output by the CRNN, respectively. Recognition was also performed on the entire frames without processing, as a comparison to demonstrate the advantages of the end-to-end recognition strategy. The recognition accuracy was calculated for the numerical evaluation.  Table 1 lists the accuracies of the entire frame recognition and end-to-end recognition for Chinese/English subtitles, which indicates that the end-to-end recognition strategy significantly improved the dual-recognition accuracy of Chinese/English subtitles by minimizing the interference of irrelevant background information. Despite the acceptable result obtained by CRNN, it can still be seen that some non-negligible errors existed in the recognition of subtitles, due to the complex video image background. Fortunately, the proposed joint deep networks are partially modifiable; hence, the boosted text recognition network can be used to replace the existing text recognition part in the future.

Accuracy = Number o f words correctly recognized
Total number o f recognized words Table 1. Accuracy of Chinese and English subtitle recognition under different strategies.

Frame Inpainting
Subtitle masks and the subtitle-removed frames were input to the EdgeConnect for inpainting, and the inpainted frames were obtained at the output of the EdgeConnect network. A traditional method and a state-of-the-art deep learning method were also tested as a comparison. As a representative diffusion-based inpainting method, the Fast Marching Method (FMM) [37], which utilizes existing domain pixels for gradient estimation to achieve fast marching of missing pixels, is suitable for video processing with high inpainting efficiency among the traditional methods.
As a representative GAN-based inpainting method, Globally and Locally Consistent Image Completion (GLCIC) [31] uses a fully convolutional network as a generator to inpaint pixels in arbitrarily shaped missing regions and discriminates the global and local consistency of the inpainted content by means of two discriminators. Hence, these two methods are used as traditional and state-of-the-art deep learning inpainting strategies, respectively, compared with our strategy.
In order to make an objective comparison between other existing methods and our method in terms of frame inpainting, the image quality metrics: Peak Signal-to-Noise Ratio (PSNR) [38], Structural SIMilarity (SSIM) [39], Normalized Root Mean Square Error (NRMSE), and Fréchet Inception Distance (FID) [40] were calculated for the entire inpainted frames to evaluate the inpainting performance. Figures 6 and 7 show the inpainting effects of the traditional method (FMM), state-of-the-art deep learning method (GLCIC), and our method (EdgeConnect).
It can be seen that the textures inpainted by FMM method are blurred with insufficient details, while the textures inpainted by the GLCIC method are far from the ground true texture, albeit with more details. The inpainted textures of our method are significantly more vivid than those of the other existing methods and fit excellently with the ground true frames with higher fineness and realism. The frames inpainted by EdgeConnect are visually coherent and were produced faster than the FMM and GLCIC methods, making the video reconstruction system ideal for real-life applications.  Table 2 lists the evaluation metrics of the traditional method (FMM), state-of-theart deep learning method (GLCIC), and our method (EdgeConnect). PSNR was used to measure the distortion, SSIM was used to measure the similarity, NRMSE was used to measure the pixel error, and FID was used to measure the feature vector distance between the ground-truth frames and the inpainted frames, using a pre-trained Inception-V3 model. Our method recovered the lost high-frequency information by edge-connecting based on adversarial learning, outperforming other existing methods in all the evaluation metrics; thus, the inpainted frames from our method demonstrated higher realism and more information.

Post-Process
The outputs of the joint deep networks were fed to a pipeline for post-processing. As shown in Figure 8, the Chinese/English subtitle text sequences from the CRNN were synthesized into Chinese and English subtitle files, while the inpainted frames from the EdgeConnect were assembled into a video, during the post-process. The post-process finally realized the reconstruction of the burned-in subtitle video to the independent Chinese/English subtitle file and subtitle-free video, achieving the completeness of the entire reconstruction pipeline and facilitating users' re-editing.

Conclusions and Future Work
In this paper, we performed a deep-learning-based intelligent reconstruction system for burned-in subtitle videos. The novel system realized the seamless integration of CTPN, CRNN, and EdgeConnect through a well-designed intermediate-process. High-accuracy text extraction and high-quality frame restoration were achieved through joint deep neural networks. Finally, the system completed the inverse conversion from the burned-in subtitle video to the independent subtitle file and subtitle-free video by post-processing.
We evaluated the performance of the system, and found that the deep learning approach achieved high accuracy detection and recognition of subtitles and significantly enhanced the video inpainting compared to existing methods. This result is expected to be widely used in the field of reconstruction and re-editing of digital videos with subtitles, advertisements, logos, and other occlusions.
Future work can be continued in two aspects. The first is to improve the sub-net of the joint deep networks, especially for the text recognition network. According to the experimental results, both the text detection network (CTPN) and the frame inpainting network (EdgeConnect) achieved excellent performance; however, the accuracy of the text recognition network (CRNN) was still hindered by the complex video image background. We plan to combine audio recognition or a grammar checking network to enhance the subtitle recognition accuracy.
The second aspect is to polish the intermediate-processing steps for the coupling of deep networks, in particular for contour extraction. We plan to use a more accurate method for contour extraction to achieve the perfect removal of burned-in subtitles with minimal information loss.  Data Availability Statement: Not Applicable, the study does not report any data.

Acknowledgments:
The authors would like to thank all the anonymous reviewers for their valuable suggestions to improve this work.

Conflicts of Interest:
The authors declare no conflict of interest.