Knowledge Translator: Cross-Lingual Course Video Text Style Transform via Imposed Sequential Attention Networks

Zhang, Jingyi; Zhao, Bocheng; Zhang, Wenxing; Miao, Qiguang

doi:10.3390/electronics14061213

Open AccessEditor’s ChoiceArticle

Knowledge Translator: Cross-Lingual Course Video Text Style Transform via Imposed Sequential Attention Networks

by

Jingyi Zhang

,

Bocheng Zhao

^*,

Wenxing Zhang

and

Qiguang Miao

School of Computer Science and Technology, Xidian University, Xi’an 710126, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1213; https://doi.org/10.3390/electronics14061213

Submission received: 19 February 2025 / Revised: 13 March 2025 / Accepted: 17 March 2025 / Published: 19 March 2025

(This article belongs to the Special Issue Applications of Computational Intelligence, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Massive Online Open Courses (MOOCs) have been growing rapidly in the past few years. Video content is an important carrier for cultural exchange and education popularization, and needs to be translated into multiple language versions to meet the needs of learners from different countries and regions. However, current MOOC video processing solutions rely excessively on manual operations, resulting in low efficiency and difficulty in meeting the urgent requirement for large-scale content translation. Key technical challenges include the accurate localization of embedded text in complex video frames, maintaining style consistency across languages, and preserving text readability and visual quality during translation. Existing methods often struggle with handling diverse text styles, background interference, and language-specific typographic variations. In view of this, this paper proposes an innovative cross-language style transfer algorithm that integrates advanced techniques such as attention mechanisms, latent space mapping, and adaptive instance normalization. Specifically, the algorithm first utilizes attention mechanisms to accurately locate the position of each text in the image, ensuring that subsequent processing can be targeted at specific text areas. Subsequently, by extracting features corresponding to this location information, the algorithm can ensure accurate matching of styles and text features, achieving an effective style transfer. Additionally, this paper introduces a new color loss function aimed at ensuring the consistency of text colors before and after style transfer, further enhancing the visual quality of edited images. Through extensive experimental verification, the algorithm proposed in this paper demonstrated excellent performance on both synthetic and real-world datasets. Compared with existing methods, the algorithm exhibited significant advantages in multiple image evaluation metrics, and the proposed method achieved a 2% improvement in the FID metric and a 20% improvement in the IS metric on relevant datasets compared to SOTA methods. Additionally, both the proposed method and the introduced dataset, PTTEXT, will be made publicly available upon the acceptance of the paper. For additional details, please refer to the project URL, which will be made public after the paper has been accepted.

Keywords:

massive open online classes (MOOCs); scene text transform; video text style imitation; cross-lingual text generation

1. Introduction

After over a decade of development, the global tally of massive open online courses (MOOCs) has surpassed 10,000 courses, attracting over a billion users. MOOCs have emerged as a significant knowledge source for students worldwide. Nevertheless, a notable limitation is that most MOOC content is recorded in a single language, creating a barrier for non-native speakers who struggle to fully grasp the nuances of many outstanding courses. While modern algorithms facilitate audio and subtitle language translations, the majority of current scene text translation methods overlook the significance of stylized transfer when translating elements like writing on a board or PPT slides in MOOC videos. Consequently, this oversight can result in issues like text overlap, blurriness, and mismatched background colors in the translated text, ultimately impeding the comprehension and effective communication of MOOC content to non-native speakers.

To cater to learners from diverse regions and linguistic backgrounds, the textual content featured in MOOC videos should undergo meticulous language conversion. Cross-lingual refers to the phenomenon or technology where the input text and the output text are in different languages.This intricate process encompasses not just text translation, but also cultural adaptation, aiming to facilitate effective communication and comprehension of the material across a wide array of cultures. Considering the global outreach of MOOC videos, it becomes imperative for MOOC production teams to guarantee that their video content resonates accurately with learners hailing from different linguistic and cultural backgrounds. Achieving this demands not just the expertise of proficient translators adept in various languages for manual translation and proofreading, but also a committed video processing team tasked with seamlessly integrating the translated text at precise timestamps within the video. Consequently, the manual creation of multilingual MOOC videos proves to be a time-consuming, labor-intensive, costly, and often inefficient endeavor. Existing research has consistently demonstrated that manual annotation is prohibitively costly [1,2].

Text style transfer strives to substitute the original text in an image with text that bears a particular style. The primary obstacles it encounters encompass diverse elements like language, font, color, direction, stroke size, and spatial perspective. Significantly, cross-language text style transfer poses extraordinary challenges for models in extracting features and producing high-quality images, primarily because of the distinct glyph shapes inherent to different languages, such as Chinese and English. Moreover, directly applying the existing text style transfer methods to MOOC videos poses challenges for several reasons. Firstly, the majority of font generation research concentrates on languages that share similar characteristics, such as Chinese, Japanese, and Korean. Stylized translated text, on the other hand, relies on the mutual conversion between Chinese and English for style migration, despite their vastly different glyph structures. Secondly, while font generation primarily focuses on individual characters, stylized translated text aims to transform entire paragraphs. Lastly, font generation tasks prioritize specific elements like character strokes, font thickness, and glyph structure. To minimize interference from fonts and background colors, datasets predominantly feature high-contrast color schemes, such as white backgrounds with black text, or vice versa.

Major Challenges

The task of text language transfer in MOOC videos, as addressed in this paper, primarily faces three major challenges: language diversity, readability, and complex backgrounds. These challenges specifically impact the design of scene transfer algorithms for MOOC videos in the following three aspects:

Language Diversity: Most existing font generation research has focused on languages with similar characteristics, such as Chinese, Japanese, and Korean. In contrast, MOOC videos often require style transfer between languages with vastly different glyph structures, such as Chinese and English. This disparity complicates feature extraction and style adaptation.
Complex Backgrounds: MOOC videos often feature text embedded in complex backgrounds, such as blackboards or PPT slides. Existing methods, typically trained on high-contrast datasets (e.g., white text on black backgrounds), struggle to maintain text clarity and style consistency in such environments.
Readability: Preserving the original text’s color integrity during style transfer is crucial for visual quality. However, existing methods often fail to maintain color consistency, leading to issues like mismatched background colors or distorted text appearance.

To address these challenges, this paper introduces a character-level style transfer algorithm grounded in an attention mechanism, specifically targeting the prevalent issue of subpar image quality in current cross-language text style transfers. Our algorithm adeptly handles the demanding task of style transfer across diverse languages, a challenge that has proven elusive for many existing algorithms. To bolster the efficacy of the style transfer, we propose an imposed sequential attention (IS-attention) mechanism to pinpoint the precise location of text within the image. These positional data then serve as a guide to oversee the style transfer outcome for each individual character. Furthermore, we devised a novel color loss function that preserves the uniformity of text color throughout the style transfer process. By integrating this loss function, our algorithm ensures a more faithful retention of the original text’s color integrity during image generation, elevating the overall visual quality of the output. Experimental findings underscored the effectiveness of our proposed character-level style transfer algorithm, leveraged by the attention mechanism. It adeptly mitigates issues like text blurring and color discrepancies, ultimately yielding images of superior quality.

2. Related Work

The text style transfer algorithm, as an offshoot of style transfer algorithms, can be traced back to the inception of its parent field. The origins of style transfer algorithms date to the 1990s, a time when neural networks had yet to gain prominence. In this era, researchers frequently resorted to non-parametric techniques [3]. They meticulously described various image features by conducting in-depth analyses of images’ textural features and using statistical models and mathematical formulas. Despite achieving certain results, this approach presented significant challenges: (1) slow generation speed and inefficient processing; (2) labor-intensive feature description, limiting practical applications; and (3) extracted features tended to be superficial, inadequately capturing the deeper meanings conveyed by images. Ideally, a robust style transfer algorithm should efficiently extract high-level features from the target image, ensuring a clean separation of image content and style [4].

The remarkable advancements achieved by deep convolutional neural networks (CNNs) [5] in the ImageNet large-scale visual recognition challenge paved the way for innovations in the field. Notably, Simonyan et al. introduced the renowned VGG network, which claimed victory in the ILSVRC2014 localization task and secured second place in the classification task. Leveraging the widespread adoption of CNNs, Gatys et al. [6,7,8] built on this foundation and presented a CNN-based image transfer algorithm. This algorithm employs the VGG network to extract multi-layer features from both style and target images. The deeper the network extracts features, the more abstract and semantically rich they become. Gatys et al. emphasized that CNNs can effectively separate and articulate the content and style features extracted from images.

Despite breaking traditional boundaries in image style transfer, Gatys’s algorithm faced limitations due to its reliance on iterating over every pixel in the image, often demanding hundreds of iterations for convergence. Furthermore, the extracted high-dimensional features occasionally lost crucial underlying image information, resulting in content distortion and unpredictable errors in the stylized images. Subsequent scholars [9,10,11] delved deeper into Gatys’s algorithm and proposed various methods to address the convergence issues stemming from pixel-based iteration.

Huang et al. [12] made a groundbreaking achievement by enabling real-time conversion of arbitrary styles. Through rigorous experimentation, they established that instance normalization (IN) could standardize a style by normalizing feature statistics. They introduced adaptive instance normalization, which adjusts the affine parameters of IN to alter the standardization of feature statistics, thereby generating a variety of styles. Huang’s paper presented comprehensive experimental results, affirming the efficacy of adaptive instance normalization in style transfer and offering a novel approach for multi-style transfer.

The emergence of generative adversarial networks (GANs) [13] marked a seismic shift in the realm of image generation. GANs consist of two neural networks: a generator that learns the data distribution and produces images, and a discriminator that distinguishes between real and generated images. Through a competitive mechanism, GANs yield more realistic and diverse imagery. The advent of GANs ushered in a new era of image generation, sparking numerous GAN-based style transfer studies [14,15,16,17] and paving the way for future research.

NVIDIA’s introduction of StyleGAN [18] represented a significant milestone, achieving unsupervised separation of high-level attributes and utilizing latent space mapping to decouple input variables. Subsequently, StyleGAN2 [19] considered the “blistering” phenomenon observed in StyleGAN’s output images, revealing disruptions in information flow between feature maps during normalization with adaptive instance normalization. StyleGAN3 [20] further identified issues related to image positioning and feature adhesion in StyleGAN2 images, introducing an optimization strategy to address aliasing caused by point-wise non-linearity. This innovation allowed image translation, rotation, and other invariances, significantly elevating the quality of the generated images.

As the application of GANs in image style transfer has gradually matured, numerous scholars have shifted their focus to a crucial subtask within this domain: font generation. Font generation tasks typically involve languages with extensive character sets and intricate character structures, such as Chinese and Korean, posing higher demands than standard image style transfer tasks. This is due to the significant structural differences between various Chinese characters, where even minor errors can profoundly impact the overall appearance of the font. The common approach to this task is typically based on the encoder–decoder architecture, with strong constraints imposed on the output characters [21,22].

To address the limitations of learning and outputting only a single target font style at a time, as well as the limited migration effects for more stylized fonts, Tian [23] introduced the zi2zi model. Meanwhile, ZiGAN [24] further enhanced the zi2zi model by incorporating the CycleGAN framework. This addition of an extra encoder–decoder allows for mapping the generated styled font back to the original standard font, significantly improving the quality of the generated font shapes. Moreover, the discriminator incorporates a CAM (Class Activation Map) to sharpen the model’s focus on local information. FUNIT [25] employs a network to extract the glyph features of standard characters and another network to capture the unique style of stylized characters. These two types of features are then fused together, enabling the reconstruction of Chinese characters with unchanged glyphs but an entirely new style. SC-Font [26] incorporates semantic information at the stroke level as auxiliary data, gradually integrating it into the main UNet structure to ensure the stroke stability of the generated images. AGIS-Net [27] utilizes two independent decoders to process the fused features: one decoder is dedicated to generating the basic glyph, while the other decoder is responsible for further rendering more delicate style details based on the previously generated glyph.

DM-Font [4] underscores the importance of emphasizing local text features rather than overall morphology for font generation. Through meticulous observations of the glyphic structures of Chinese, Japanese, and Korean characters, the authors discovered that these scripts are governed by radicals and proposed the innovative concept of “component reuse”. They maintain that by meticulously manipulating these components, it is theoretically feasible to synthesize any character. Kong [28] introduced an attention mechanism to oversee the accuracy of component generation during the text generation process. Meanwhile, Xie [29] put forth the DG-Font model, arguing that styled fonts can be achieved by applying traditional transformations to standard fonts. This approach offers a novel perspective on font generation, emphasizing the potential of traditional techniques in creating unique and stylish fonts.

However, text style transfer differs from font generation tasks: (1) Font generation tasks typically involve the use of the same or similar languages, such as Chinese, Japanese, and Korean. (2) Font generation tasks aim to generate individual characters, whereas stylized translation text targets a segment of text. (3) Font generation tasks place greater emphasis on the stroke connections, thickness, and glyphic structure of characters. To minimize the influence of font color and background color, datasets often adopt a white background with black text or other color schemes with high contrast. Addressing these issues, Wu L [30] designed SRNet (Style Retention Network), the first attempt to solve the problem of text editing in natural scene images. The algorithm consists of three modules: a text transformation module, a style background restoration module, and a fusion module. The text transformation module changes the text content, converting the source image to the target text, while preserving the original text. The style background restoration module erases the original text and fills the text area with appropriate textures. The fusion module combines information from the first two modules and generates the edited text image. SRNet’s architecture ensures that the original style of the content image is maintained, while replacing the text content, achieving a visual effect consistent with the original text image. Yang [31] built upon the SRNet architecture and proposed the CSTN (Content Shape Transformation Network) module to address curved and irregular text in stylized images. This module utilizes N reference points to model the geometric transformation of text. Krishnan [32] introduced self-supervision to decouple background and foreground text in stylized images, enabling the one-time transfer of a source style to new content.

In MOOC scenarios, scene text editing faces significant challenges, particularly in cross-language text style transfer. Current methods, often based on single large models, struggle with readability and are typically integrated approaches involving background separation, image fusion, and style transfer. While these methods work well for ordinary scenes like street views or menus, MOOC scenarios present unique challenges due to fixed shooting angles and dense text stacking, such as in PPTs, which can lead to text overlaps after translation.

Cross-language text style transfer is further complicated by the structural differences between languages, such as the complex strokes of Chinese characters versus the simpler shapes of English letters. Most existing research has focused on holistic style transfer, which can degrade the quality of individual characters. Additionally, GAN-based methods often impose simplistic style constraints, leading to an overpowered discriminator that hinders the generator’s ability to update parameters effectively. These limitations highlight the need for more robust solutions tailored to the specific demands of MOOC scenarios.

3. Method

The main objective of cross-language text style transfer is to acquire the style embodied by a text in a styled image of one language via convolutional neural networks and transpose it onto text in another language. Nonetheless, several obstacles arise during this endeavor: Defining and recognizing style itself proves to be intricate; Given the substantial disparities in glyph structures among different languages, preserving a text’s inherent structural traits during style transfer is imperative; Translated texts might exhibit varying word lengths, posing a challenge in automatically resizing the text to prevent it from spilling over the image boundaries; Ensuring color parity between the original and output images remains an unresolved concern during style transfer. To tackle these obstacles, this study employs advanced techniques like generative adversarial networks (GANs), attention mechanisms, and latent space mapping, culminating in a network model adept at pinpointing textual positions within images. This model adeptly extracts pertinent features from designated locations and processes each character’s traits independently, facilitating meticulous character-level style transfer and color coordination. This methodology preserves the text’s structural coherence, while seamlessly transferring cross-language text styles and maintaining color harmony between the source and output images.

3.1. Text Style Transfer Networks

The cross-language text style transfer network presented in this paper is essentially an image generation GAN. The generator component employs an encoder–decoder framework, which efficiently processes input images and translates them into a sequence of feature representations. During the encoding phase, the model utilizes multiple layers of convolution operations to extract crucial feature information from the image, while during the decoding phase, this information is leveraged to progressively recreate the image, ultimately producing an output image with a distinctive style. The encoder–decoder framework guarantees comprehensive extraction of image features and facilitates the transfer of image styles.

This study incorporates the style module from StyleGAN. By integrating this module, we achieve decoupling and infusion of style features. This module seamlessly blends the characteristics of the style image with the text features, ensuring that the generated image preserves the original text information, while embodying the desired style attributes. To elevate the quality of the generated images even further, we introduce an attention mechanism into the discriminator. This mechanism predicts the location of each text within the generated image. Subsequently, the predicted text positions are employed to extract corresponding style and text features, enabling character-level supervision. Thus, the discriminator, equipped with an attention mechanism, aims to oversee the style and text features of each text, thereby enhancing its discriminatory capabilities. The adversarial training between the generator and discriminator of the GAN then serves to direct the generator towards producing images of superior quality. The specific architectural details of the model are shown in Figure 1.

It is worth noting that AdaIN components play a crucial role in both the encoding and decoding stages of the network. A style vector is introduced into each upsampling convolution block’s AdaIN component, which then integrates it into every layer of feature maps, facilitating feature fusion. The calculation formula of AdaIN is shown in Equation (1). Distinguishing it from other normalization techniques, the statistics

μ (y)

and

σ (y)

are derived from an intermediate vector. AdaIN receives a combined feature map comprising both style and text features. Through analysis of this combined map, we ascertain the statistics

μ (y)

and

σ (y)

, employing a normalization formula to seamlessly integrate the style vector into the feature map. The style encoding is denoted as

x

, and the content encoding is represented as y, both of which have a dimensionality of 256.

A d a I N (x, y) = σ (y) (\frac{x - μ (x)}{σ (x)}) + μ (y)

(1)

3.2. Imposed Sequential Attention

Prior to implementing the attention mechanism, corresponding data preparation is required. The first step involves using an encoder to extract features from the input image and output a feature map of the generated image. Assuming the parameters of the feature map are

C \times H \times W

, a dimension transformation is applied to set

L = H \times W

, converting the feature map of the generated image into L C-dimensional vectors

{h_{1}, h_{2}, h_{3} \dots h_{L}}

, denoted as

E_{o}

. The second step involves preparing the text sequence vector for the generated image. Assuming a text sequence is represented as

{t_{1}, t_{2}, t_{3} \dots t_{n}}

, where n denotes the length of the text sequence and t denotes the text. Next, the corresponding index numbers for the words in the vocabulary are found, such as the index number of the word t in the vocabulary being ′t₃′. This generates the corresponding text vector

t = {^{'} {t_{1}}^{'},^{'} {t_{2}}^{'},^{'} {t_{3}}^{'}, \dots,^{'} {t_{n}}^{'}}

. Finally, start and end symbols are added to the text vector t, with index numbers corresponding to 0 and 1 in the vocabulary, resulting in the text vector, denoted as

E_{i n}

. The third step involves using a vector of all zeros as the hidden state of the encoder in its final step, denoted as

P_{h}

.

The three aforementioned pieces of data,

E_{o}

,

P_{h}

, and

E_{i n}

, are inputted into the attention mechanism to initiate the first time-step computation process. The attention mechanism mainly consists of two steps. The first step focuses on calculating weighting parameters to determine which elements of

E_{o}

have a greater impact on the current word

E_{i n}

. The weights for

E_{o}

are generated using a combination of

P_{h}

and

E_{i n}

. After completing the aforementioned data preparation, the specific computational steps for the strong temporal attention mechanism unfold as follows: Initially, the attention scores between the hidden layer variables are computed. Utilizing the vector embedding operation as in Equation (2), word embedding is conducted to yield the

C_{e m}

vector. Next,

C_{e m}

and

P_{h}

are concatenated, followed by applying a

S o f t m a x

function to derive

W_{a t}

as in Equation (3). Subsequently,

W_{a t}

and

E_{o}

are multiplied to generate

C_{a t}

, as outlined in Equation (4). In the final step,

W_{a t}

and

C_{e m}

are aggregated, and a fully connected layer facilitates a dimensional transformation to produce

O_{t m p}

, as defined in Equation (5), where ⊙ indicates element-wise multiplication.

C_{e m} = E m b e d d i n g (y_{t - 1})

(2)

W_{a t} = S o f t m a x [c o n c a t (C_{e m} + P_{h})]

(3)

C_{a t} = W_{a t} ⊙ E_{o}

(4)

O_{t m p} = F C s (C_{a t} + C_{e m})

(5)

The following represents the central procedure of the method introduced in this paper: imposing temporal constraints on the encoder output of the attention mechanism. By doing so, we compel the algorithm to systematically review every component of the text within the writing sequence, encompassing elements like the radicals of Chinese characters and all the letters of the English alphabet. Our primary approach involves utilizing a GRU network to model the output of each writing time step, effectively constraining and refining both the hidden layer state and the prediction outcomes. The specific process begins by utilizing prevh, acquired from the preceding step, as the input hidden state. Additionally, the output vector outt from the attention mechanism serves as the temporal input. Subsequently, the GRU network model is employed to generate the new output and hidden state. Within the two-step operation of the attention mechanism, the input data eo, ph, and in are leveraged to compute the prediction results and the hidden layer parameters of the attention mechanism through the formulas mentioned earlier. Upon completion of the first time step, ph undergoes an update in the subsequent time step. Here, the hidden layer outcome hidden in the previous time step is adopted as phid, allowing the continuation of the same process until all time steps have been finalized.

O = S o f t m a x [G R U (P_{h}, O_{t m p})]

(6)

In Figure 2, the input data are represented by the brown area, the blue area depicts the process of learning parameters, the green area signifies operations that require no parameter learning, and the yellow section displays the output results. In the initial phase of the attention mechanism, a correlation connection is established between the input data to guarantee a precise one-to-one correspondence between the text sequence and its corresponding feature map. This ensures the extraction of feature maps for each individual text within the sequence, facilitating effective transfer and supervision of text structural information. Referring to the discriminator network shown in Figure 1, subsequent to the extraction of feature maps for each text, this study proceeds to analyze these features to ascertain the accuracy of style transfer and color consistency.

3.3. Loss Function

The

L_{a t t e n t i o n} (O_{t})

serves as a specifically designed loss function tailored for the attention mechanism. In this chapter, the algorithm leverages an attention mechanism to extract a feature map of each text present within the image. However, the attention mechanism’s output extends beyond merely the feature map; crucially, it offers the probability of predicting the type of text the current time step corresponds to. The cross-entropy loss function effectively quantifies the difference between the predicted result and the actual result. Its calculation formula is outlined in Equation (7), where CLS represents the cross-entropy loss function,

o_{t e x t}

signifies the predicted text probability result derived from the attention mechanism, and

t_{t e x t}

designates the target text sequence.

L_{a t t e n t i o n} (O_{t}) = C L S (o_{t e x t}, t_{t e x t})

(7)

L_{r g b} (O_{t})

is the color loss function designated for each text element within the image, aiming to maintain color consistency throughout the generated image. The computational approach involves the attention mechanism generating a corresponding feature map for each textual region within the image. Subsequently, a convolutional neural network is applied to each text feature map, tasked with predicting the RGB value for each text region based on the extracted features. To guarantee the precision of the prediction outcomes, a cross-entropy loss function is employed to quantify the discrepancy between the predicted and target results. The calculation formula is detailed in Equation (8), where

o_{r g b}

signifies the predicted RGB value, and

t_{r g b}

denotes the desired RGB value.

L_{r g b} (O_{t}) = C L S (o_{r g b}, t_{r g b})

(8)

L_{i d} (O_{t})

stands for the font loss function, sharing similarities in design concept with

L_{r g b} (O_{t})

. In the experiments detailed in this paper, we curated a comprehensive dataset encompassing 172 distinct font files, serving as the primary source for textual styling in style images. The task of predicting the font file associated with a given text is tantamount to predicting its stylistic attributes. The core process involves leveraging the attention mechanism to meticulously extract a feature map for each textual region within the image. These feature maps encapsulate the pivotal characteristics of the text, serving as the basis for subsequent font type predictions. Subsequently, these extracted feature maps are employed and a font classifier is deployed to ascertain the font type corresponding to each textual region. By contrasting the prediction outcomes with the authentic font types, the font loss is determined, thereby strengthening the constraints on style transfer. The computational formula for this process is outlined in Equation (9).

L_{i d} (O_{t}) = C L S (o_{i d}, t_{i d})

(9)

L_{c o n t e n t} (O_{t}, I_{c})

calculates the difference between the text features of

O_{t}

and the text features of

I_{c}

. During style transfer, it is necessary for the text features of the generated image

O_{t}

and the text image

I_{c}

to remain consistent. Therefore,

L_{c o n t e n t} (O_{t}, I_{c})

aims to ensure that the text features in the image are preserved during the feature fusion process and are not destroyed. The calculation process involves re-extracting the text features of the generated image

O_{t}

and the text image

I_{c}

using a text feature extractor, and then utilizing the L1 loss function to calculate the difference between the text features. Here,

E_{c} (O_{t})

represents the text features of

O_{t}

, and

E_{c} (I_{c})

represents the text features of

I_{c}

. The calculation formula is shown in Equation (10).

L_{c o n t e n t} (O_{t}, I_{c}) = L 1 (E_{c} (O_{t}), E_{c} (I_{c}))

(10)

The adversarial loss aims to assist the generator G in synthesizing a realistic image that is indistinguishable from real samples, while making the discriminator D be unable to distinguish between the generated image and the real sample, as shown in Equation (11).

L_{a d v} (O_{t}, t_{t}) = E_{I_{s} \in P_{s}, I_{c} \in P_{c}, t_{t} \in P_{t}} [l o g D (t_{t}) + l o g (1 - D (O_{t}))]

(11)

Lastly, the loss function of the entire network is divided into two parts: the generator G and the discriminator D, as shown in Equations (12) and (13).

L_{G} = L_{a d v} + L_{c o n t e n t} + L_{i d} + L_{r g b} + L_{a t t e n t i o n}

(12)

L_{D} = - L_{a d v} + L_{c o n t e n t} + L_{i d} + L_{r g b} + L_{a t t e n t i o n}

(13)

Based on the aforementioned model framework and loss function settings, this paper improved upon the classic GNN training method and designed a training procedure specifically tailored for video scene text images. This training procedure ensures the convergence of the generative network’s output, while minimizing the two proposed loss functions, namely

L_{c o n t e n t}

and

L_{a d v}

. The detailed steps of the training process are outlined in Algorithm 1.

Algorithm 1: Pseudo-code for cross-lingual text style transfer algorithm.

1: Initialize networks and sample $I_{c}$ , $I_{s}$ , $I_{t}$ from training set M.
2: for $t \leftarrow 1$ to T do
3: for $k \leftarrow 1$ to K do
4: Freeze G
5: Compute $f = E (I_{c})$ , $f_{s} = E (I_{s})$ , and $w = MLP (f)$ .
6: Generate output $O = Mix (f, f_{s}, w)$ .
7: Calculate $L_{D}$ via Equation (13)
8: Update D by $L_{D}$ .
9: end
10: for $k \leftarrow 1$ to K do
11: Freeze D parameters (no backpropagation).
12: Compute $f = E (I_{c})$ , $f_{s} = E (I_{s})$ , and $w = MLP (f)$ .
13: Generate output $O = Mix (f, f_{s}, w)$ .
14: Calculate $L_{G}$ via Equation (12)
15: Update generator G via backpropagation.
16: end
17: Return G.

4. Experiments

In this section, the experimental results are showcased, validating the robust scene text editing abilities of our model. Moreover, comparisons between our methodology and alternative approaches are offered to illustrate the superior effectiveness of our method. Furthermore, an ablation study was conducted to further assess our technique. Before detailing the experiment, it is necessary to elucidate the objectives of this study. The method proposed herein aims to address the issues of language diversity, readability, and complex backgrounds present in current MOOC videos. Specifically, regarding language diversity, the output text language differs from the input text language, while retaining similar stylistic characteristics, such as the thickness of strokes and the curvature of edges in handwriting. In terms of background complexity, we hoped for the output text and background colors to align with the input text and background colors. As for readability, we ensure that the output text is devoid of blurring, typos, and other such imperfections. Notably, these aspects can be evaluated using metrics such as FID, SSIM, and LPIPS, which assess style similarity.

4.1. Experimental Settings

Cross-lingual Synthesis Dataset for Chinese and English: A synthetic dataset, produced by the SRNet-Datagen open-source project and provided by Youdao Company, was chosen to create paired data sharing the same style but featuring distinct texts. The image style predominantly relies on 172 font files, incorporating randomized text and background colors; the latter are primarily drawn from a palette of 50 solid colors. Additionally, random blurring was applied to mimic the effects of real-world conditions on image clarity. Consequently, a training dataset comprising 50,000 images and a test dataset comprising 2000 images were generated.

Real-world Dataset: The ICDAR2013 [33] and ICDAR2019-LSVT [34] datasets, sourced from street scenes and primarily designed for detecting and recognizing horizontal text in natural environments, were utilized. Each image in these datasets boasts detailed text labels, with annotations provided by rectangular bounding boxes. Notably, every image encompasses one or multiple text boxes. However, due to the absence of corresponding target images in these datasets and the impracticalities of manually creating them, a subset of 2000 images was carefully chosen exclusively for testing in this study. The network proposed in this study was trained for 50 epochs on the aforementioned dataset, with mixed training conducted in each epoch. During the training process, Chinese and English samples were alternately selected. For instance, when training for English-to-Chinese translation, English words were sampled from the dataset, and the corresponding translated Chinese handwritten characters were set as the output, and vice versa. The training was performed on a hardware configuration comprising an Intel Core i9-13900K processor (manufacturer: Intel, Santa Clara, CA, USA), an NVIDIA RTX 4090 GPU (manufacturer: NVIDIA, Santa Clara, CA, USA) with 24GB of memory, and 128GB of DDR5 RAM. The implementation was based on the PyTorch toolkit (version 1.2.0) with Python 3.8. The seed size configured in this study was set to 256. In addition, the Adam optimizer was employed to train our model, with

β_{1}

as 0.5 and

β_{2}

as 0.999, until the output stabilized throughout the training phase. The learning rate was fixed at

2 \times 10^{- 4}

and the batch size was consistently maintained at 8. The specific parameters of the neural network involved in this project are listed in table below. We primarily employed three metrics to evaluate the success of our approach: IS (Inception Score), LPIPS (Learned Perceptual Image Patch Similarity), and FID (Fréchet Inception Distance). IS is used to assess the quality and diversity of generated images, though it has limitations in evaluating realism. LPIPS measures the perceptual similarity between generated and target images, making it suitable for assessing details and structural fidelity. FID evaluates the distribution distance between generated and real images in feature space, providing a comprehensive assessment of both quality and diversity.The specific details and scale parameters of the model are shown in Table 1 and Table 2.

4.2. Results and Analysis

Cross-lingual Style Transfer. To assess the efficacy of our proposed algorithm in achieving cross-lingual style transfer, this section exhibits the experimental outcomes of our cross-lingual text style transfer module. We also benchmark the image generation capabilities of our algorithm against SRNet [30] as a reference point. Precisely, four distinct language pairs were established for style transfer in our experiments: Chinese to Chinese, English to English, Chinese to English, and English to Chinese. Furthermore, to demonstrate the algorithm’s robustness in handling scenarios where the word counts in the style image and the text image diverge, we intentionally varied the word counts across different languages.

From a quantitative standpoint, the method introduced in this paper exhibited significantly improved readability of text after undergoing style transfer, when juxtaposed with the baseline approach. Additionally, given that this study involves cross-language style transfer—necessitating that the input text for target style differs linguistically from the content text—our proposed method remained impressively resilient to linguistic disparities. As an illustration, refer to the second row of Figure 3, where the baseline method incorrectly incorporated Chinese characters even though the intended content is exclusively in English. Conversely, our method not only accurately reproduced the intended textual content but also maintained the defining attributes of the target style, including text color and stroke thickness, among others. Remarkably, in certain experimental outcomes, the method adeptly replicated the curvature traits of the text strokes. Notably, in the third row of Figure 3, the curvature of the rendered Chinese characters bears a striking resemblance to that of the English source text, a similarity that is discernible even to the casual observer.

Intra-lingual Style Transfer To further evaluate the results of style transfer within the same language, we set the MOSTEL [35] algorithm as a comparison method in this section and present the style transfer effects from English to English. As shown in Figure 4, we compared the results of our proposed algorithm with those of MOSTEL on the English dataset.

It should be noted that there have been few works on cross-language text style transfer, and some of them did not publish their code. Therefore, we selected the SRnet method, which published its source code, to compare between different languages. We also chose MOSTEL as a comparative method for style transfer within the same language. The above comparison methods maintain the same configuration during the training process. As can be seen in Figure 4, the edited text structure is regular; the font remains consistent with the previous one, the background texture is more reasonable, and the overall feeling is similar to the real image. The quantitative comparison in Table 3 shows that our method outperformed the comparison methods in all indicators.

Compared to the chosen baseline method, the approach presented in this paper exhibited noteworthy advantages in text clarity, correctness, foreground contrast, and image signal-to-noise ratio. By amalgamating these outcomes with those from prior cross-language model experiments, it is unequivocally evident that the enhancements introduced in this study were remarkably effective. Furthermore, Table 3 encapsulates the quantitative findings from the intra-language experiments. Among all the experimental data in the tested dataset, our method emerged as the top performer across all three metrics for image similarity and quality assessment. Remarkably, our approach achieved a substantial lead in the FID metric, which came as a pleasant surprise. We further scrutinized the range of improvements suggested in this paper via ablation experiments.

4.3. Results of Scene Text Style Transfer

To further validate the effectiveness of the algorithm, a relevant real-world dataset was adopted as the test set to verify the generalization ability of the algorithm. In particular, considering that the real-world dataset lacked corresponding target images and it was difficult to manually create them for training, the real dataset was only used for testing during the training process. The results are shown in Figure 5. We performed cross-language style transfer experiments focusing on real-world scene texts within images. The outcomes revealed that our approach effectively transferred attributes like text color and stroke thickness from the reference style image. Notably, the results presented in the fourth row of Figure 5 show that our method adeptly captured and reproduced the connected writing patterns inherent in the input style text. This finding underscores the role of our algorithm’s robust temporal attention mechanism, which considered the image traits of the transitions between various text elements during the training phase. Undoubtedly, this constitutes a promising outcome.

The statistical results of this experiment are presented in Table 4, and in this section, we elaborate on the experimental findings. In practical, real-world experiments, the dataset lacked labeled target text specifically for migration purposes. To evaluate the stylistic consistency, we computed FID and IS metrics by comparing the migrated text with the original style text input images (marked as S-r). It is important to note that the experimental results depicted in Figure 5 are associated with

I_{S}

, which may explain why these outcomes were marginally less impressive when compared to the benchmarks in Table 4. Despite this, the results unequivocally demonstrated that our method notably surpassed the baseline approach in the field of scene text style transfer. This serves as additional proof of the remarkable generalization capabilities of the method introduced in this paper.

Furthermore, we conducted additional tests by employing the input text content

I_{C}

as the benchmark for the target image, denoted as C-ref. This particular metric served to primarily ascertain the correctness of the text generated by our method. Referring to Table 4, it becomes evident that our approach excelled among the comparative methods in terms of textual accuracy, aligning seamlessly with the outcomes presented in Figure 3, Figure 4 and Figure 5.

4.4. Ablation Study

To assess the influence of the color constraints introduced in this paper on the ultimate outcomes, Figure 6 distinctly presents effect diagrams of the text style transfer, comparing scenarios with and without the application of color constraints.

Upon analyzing the ablation experimental results related to color constraints, it becomes evident that the algorithm without the inclusion of color constraints struggled to accurately capture color information during the transfer process, leading to color confusion in the resulting diagrams. Furthermore, in cases where the color information of the generated image is pale, the text font structure could not be fully rendered, significantly compromising the quality of the generated image. In conclusion, the integration of color constraints significantly boosted the algorithm’s capacity to learn color features, thereby effectively enhancing the quality of text style transfer.

Apart from the previously mentioned ablation experiments focusing on color constraints, this paper further conducted ablation studies on the attention mechanisms, text-level style loss, and content encoder loss functions. The outcomes of these ablation experiments are summarized in Table 5, which utilized three widely recognized style transfer quality evaluation metrics: IS, LPIPS, and FID, to quantitatively assess the effectiveness of the ablation studies.

Through a comparative analysis of the results in Table 5, the following conclusions can be drawn: The absence of color constraints significantly impacted the IS, LPIPS, and FID metrics, indicating that without color constraints, the consistency between generated images and target images was notably reduced. This suggests that the generated images were highly dependent on color and sensitive to color recognition. In the cases of missing character-level style constraints and text constraints, the FID metric was also affected to some extent, demonstrating that FID is sensitive to text-related features and struggles to generate results consistent with the target image when text information is lacking. Notably, the results without the attention mechanism were worse across all three metrics compared to those without color constraints, highlighting that relying solely on the attention mechanism, without additional technical support, can weaken the overall performance of the algorithm.

4.5. MOOC Applications

Leveraging the proposed text migration methodology, along with established models like text detection, this paper introduces a system tailored for MOOC-specific scene text language migration. This comprehensive system comprises five primary modules: data preprocessing, third-party model inference invocation, cross-language text style transfer inference, data post-processing, and video keyframe extraction, which identifies crucial frames within MOOCs and pinpoints the timestamps where text alterations occur.

The system also incorporates third-party model inferences, specifically utilizing PaddleOCR’s [37] text detection and recognition models alongside machine translation algorithms. PaddleOCR employs a two-pronged approach, initially employing a text detection algorithm to ascertain the precise coordinates of styled images. The end result of this meticulously designed system is exemplified in Figure 7, showcasing its efficacy and precision in MOOC-related text migrations.

5. System

This section delves deeply into and introduces the logical flow of the real-time stylized translation text system for videos. The system achieves efficient processing and stylized translation of video content through four carefully designed core steps, successfully applying advanced algorithms to practical scenarios. Leveraging an existing front-end and back-end technology stack, we successfully built a feature-complete, visually interfaced real-time stylized translation text system for videos. This system not only fully realizes its core functions but also incorporates intuitive and user-friendly function buttons and operational interfaces, enabling users to easily operate and manage the process, thereby greatly enhancing user experience and system practicality.

5.1. System Logical Flow

For the specific optimization algorithm in the MOOC scenario, there are three key components: video keyframe extraction, background separation, and cross-language text style transfer. Other components are implemented using third-party open-source algorithms. The logical flow of the system is divided into four main steps:

(1) The front-end interface uploads a video, sends the video file to the back end, and the back end sends the video to the video keyframe extraction module to extract a sequence of keyframes from the video. (2) Text detection is performed on the keyframe sequence to detect the location coordinates of the text in the image. These coordinates are used to extract the corresponding text image as the style image. Then, a text detection algorithm is used to recognize the text information in the text image. Finally, the text is translated through a text translation module into the corresponding language, generating a content image. (3) The text style transfer module and background extraction module process in parallel. The content image and style image are input into the cross-language style transfer module to achieve style transfer for the content image. The background separation module extracts the background image by erasing the text information from the style image. (4) The image fusion module is responsible for fusing the background image and the style-transferred image to produce the output result image. Finally, the resulting image is returned to the video frame based on the text position determined by text detection.

5.2. System Overall Design

The system is designed based on a B/S architecture, utilizing front-end–back-end separation technology to complete the overall development. As shown in Figure 8, the front end employs technologies such as H5, Vue, and AntDesign to build an intuitive and interactive user interface. Additionally, Ajax technology is used to implement asynchronous operations, ensuring efficient and smooth data transmission between the front end and back end. The back end is divided into three layers: the business logic layer, the model inference layer, and the database layer. The business logic layer primarily uses Python (version 3.8) and Java (Java 21) languages to handle the business logic, including data preprocessing, model inference invocation, and user data management functions. The model inference layer mainly relies on deep learning frameworks such as PyTorch (version 1.2.0) and Paddle (version 2.1.0), leveraging their rich functional libraries and GPU acceleration technology to efficiently process data and perform calculations. In the database layer, technologies such as SpringBoot, Aop, Mybatis, and Jdbc are used to handle database creation, maintenance, storage, connection, and session management tasks, enabling user registration, login, and data storage functionalities.

5.3. System Function Design

This section focuses on the back-end part of the system, which is mainly divided into five modules, as shown in Figure 9: data preprocessing, third-party model inference invocation, a cross-language text style transfer inference module, data post-processing, and user data management. Data preprocessing includes video keyframe extraction, background separation algorithms, and data format conversion. Video keyframe extraction is responsible for extracting keyframes from videos and detecting time nodes where text changes in the video. The background separation algorithm erases text traces from the detected text images, preserving non-text areas. Data format conversion involves unifying images to the size required for model inference, normalizing the data, and converting it into tensor operations.

Third-party model inference invocation includes PaddleOCR’s text detection and text recognition models, as well as machine translation algorithms. PaddleOCR’s text detection and text recognition model mainly adopts a two-stage method. First, a text detection algorithm is used to obtain the coordinate positions of style images. Then, the coordinate information is utilized to crop out the style images for text recognition, to identify the content of the text in the current style image. Machine translation translates the currently recognized text content into the corresponding language, such as from Chinese to English, and finally generates a content image in the translated language.

The cross-language style transfer inference algorithm applies text stylization to the content image in the translated language, maintaining the visual appearance of the style image. An image fusion algorithm is then used to fuse the background image with the stylized image, outputting the final result image. The data post-processing module is divided into result image quality assessment and text translation quality assessment. It mainly relies on manual methods to judge whether the results meet subjective human visual quality standards and whether the machine-translated content is accurate. Finally, post-processing is used to implement error correction for the entire system.

User data management is used to manage all user information and data, including user identity information, uploaded video data, and subsequently compiled video data.

5.4. System Interface Diagram

This section presents the most crucial aspects of the implementation of the front-end interface: After logging in through the user interface, users can access the video upload interface. They can select the video file to be uploaded and click the “Open” button directly to upload the video to the front-end interface. The front-end interface sends the video data to the back-end business logic layer and utilizes the data preprocessing module to extract keyframes from the video. Upon completing the parsing of the entire video file, the back end saves the user data, such as uploaded video files and extracted keyframe sequences, by combining user information through user data management in the business logic layer. Finally, the back end returns the keyframes to the front-end interface to display the result. Users can flip through the keyframe sequence using the “Previous” and “Next” buttons, making it convenient for them to observe the keyframes.

When the current keyframe is selected, the system navigates to the image recognition interface. The business logic layer invokes third-party models for inference. It first locates the text in the keyframe through text detection and recognizes the text content. Then, it translates the text using a text translation algorithm. Lastly, it utilizes the text position information obtained during text detection to return the translated text to the original image. The “User Data Management” in the back end saves the output image and sends it to the front-end interface for display. This allows users to easily observe whether the text translation results are correct. When a translation error is found, it can be manually corrected through the text box above, and the changes can be saved by clicking the “Save” button. The “Previous” and “Next” buttons are used to navigate through the text in the current video frame.

After invoking the third-party models, the text style transfer inference model in the back-end business logic layer inputs the prepared data into the cross-language text style transfer, background separation, and image fusion models for inference. Finally, it returns the final result image.

This system demonstrates the process of cross-language text stylization in the simplest way. It not only combines machine translation with text detection and recognition technology to support image content recognition and translation, but also utilizes current text style transfer technology to automatically return the translated text in stylized images to the video frame.

Nevertheless, the current system exhibits several limitations that warrant further refinement. Firstly, the inherent disparities in pronunciation duration and rhythm between Chinese and English lead to challenges in subtitle alignment within MOOC videos. Enforced alignment may cause certain subtitle segments to transition too rapidly, potentially disrupting user engagement. Secondly, the approach adopted in this study necessitates the superimposition of generated cross-language text over the original PPT, which can inadvertently mask transient elements such as symbols or underlines present in the original content. Lastly, the system’s processing time for an entire video remains notably lengthy, highlighting the need for optimization to enhance efficiency.

6. Conclusions

This paper centers on the migration of stylized content from online course videos and texts, including PowerPoint presentations, within MOOC applications. Addressing the limitations of existing methods, we introduced several groundbreaking measures, such as an imposed temporal attention mechanism and color constraint loss. Our approach delivered impressive results in the cross-language style transfer dataset compiled for this study, surpassing the current state-of-the-art (SOTA) methods in numerous comparative metrics. Moreover, our method proved its efficacy in various real-world text editing scenarios, consistently showing notable advantages in experimental outcomes. Through ablation studies, we comprehensively validated the effectiveness of each proposed innovation. Ultimately, our methodology has been tentatively integrated into text translation for MOOC settings, significantly contributing to the global advancement of MOOCs.

Future Work

The current system is not without its imperfections and still requires enhancements to the algorithms and technical capabilities to achieve optimal performance: First, OCR accuracy is dependent on the language/script, and while current OCR algorithms perform well on printed fonts, further optimization of multilingual OCR models is needed. Second, subtitles, incorrect matches, or unnecessary line breaks in PPTs may lead to semantic segmentation. We are exploring solutions based on large models, but no significant progress has been made yet. Third, the current algorithm does not specifically handle cross-language scenarios, resulting in poor readability and inconsistent style in generated fonts, as demonstrated in the experimental section. Fourth, complex backgrounds may cause OCR algorithms to fail in detecting text, but MOOC videos typically use solid-color backgrounds to enhance contrast, making this issue rare in practice. Resolution significantly impacts video processing results, and we require the original video resolution to be at least 1080p. Fifth, when there is a significant difference in the length of content before and after translation, layout issues may arise. In the project implementation, we prioritized the readability of generated text, partially sacrificing layout neatness. When processing large datasets, excessively long videos or too many PPT slides can increase the computation time. However, actual tests showed that computation time increases linearly with video length, and the system typically processes a MOOC video within two to five minutes, which is considered acceptable.

Author Contributions

Methodology, J.Z.; Software, B.Z.; Validation, W.Z.; Formal analysis, Q.M.; Resources, Q.M.; Data curation, W.Z.; Writing—original draft, J.Z.; Writing—review & editing, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The work was jointly supported by the National Science and Technology Major Project under grant No.2022ZD0117103, the National Natural Science Foundations of China under grant No.62272364, the Provincial Key Research and Development Program of Shaanxi under grant No.2024GH-ZDXM-47, and the Teaching Reform Project of Shaanxi Higher Continuing Education under Grant No.21XJZ004.

Data Availability Statement

A portion of the dataset used in this paper, comprising 50,000 images, is publicly available at the web page: https://pan.baidu.com/s/1Zp3BfFknHdSMIdEizSR6gQ?pwd=ZZZB.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhang, Q.; Zhu, Y.; Cordeiro, F.R.; Chen, Q. PSSCL: A progressive sample selection framework with contrastive loss designed for noisy labels. Pattern Recognit. 2025, 161, 111284. [Google Scholar] [CrossRef]
Zhang, Q.; Zhu, Y.; Yang, M.; Jin, G.; Zhu, Y.; Chen, Q. Cross-to-merge training with class balance strategy for learning with noisy labels. Expert Syst. Appl. 2024, 249, 123846. [Google Scholar] [CrossRef]
Portilla, J.; Simoncelli, E.P. A parametric texture model based on joint statistics of complex wavelet coefficients. Int. J. Comput. Vis. 2000, 40, 49–70. [Google Scholar] [CrossRef]
Cha, J.; Chun, S.; Lee, G.; Lee, B.; Kim, S.; Lee, H. Few-shot compositional font generation with dual memory. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 735–751. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. A neural algorithm of artistic style. arXiv 2015, arXiv:1508.06576. [Google Scholar] [CrossRef]
Gatys, L.; Ecker, A.S.; Bethge, M. Texture synthesis using convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Istanbul, Turkey, 9–12 November 2015; Volume 28. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 694–711. [Google Scholar] [CrossRef]
Ulyanov, D.; Lebedev, V.; Vedaldi, A.; Lempitsky, V. Texture networks: Feed-forward synthesis of textures and stylized images. arXiv 2016, arXiv:1603.03417. [Google Scholar]
Dumoulin, V.; Shlens, J.; Kudlur, M. A learned representation for artistic style. arXiv 2016, arXiv:1610.07629. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM 2014, 63, 139–144. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computervision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar] [CrossRef]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
Lyu, P.; Bai, X.; Yao, C.; Zhu, Z.; Huang, T.; Liu, W. Auto-encoder guided GAN for Chinese calligraphy synthesis. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1095–1100. [Google Scholar] [CrossRef]
Azadi, S.; Fisher, M.; Kim, V.G.; Wang, Z.; Shechtman, E.; Darrell, T. Multi-content gan for few-shot font style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7564–7573. [Google Scholar] [CrossRef]
Tian, Y. Master Chinese Calligraphy with Conditional Adversarial Networks. 2017. Available online: https://kaonashi-tyc.github.io/2017/04/06/zi2zi.html (accessed on 6 April 2017).
Wen, Q.; Li, S.; Han, B.; Yuan, Y. Zigan: Fine-grained chinese calligraphy font generation via a few-shot style transfer approach. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 621–629. [Google Scholar] [CrossRef]
Liu, M.Y.; Huang, X.; Mallya, A.; Karras, T.; Aila, T.; Lehtinen, J.; Kautz, J. Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10551–10560. [Google Scholar] [CrossRef]
Jiang, Y.; Lian, Z.; Tang, Y.; Xiao, J. Scfont: Structure-guided chinese font generation via deep stacked networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 2019–1 February 2019; Volume 33, pp. 4015–4022. [Google Scholar] [CrossRef]
Gao, Y.; Guo, Y.; Lian, Z.; Tang, Y.; Xiao, J. Artistic glyph image synthesis via one-stage few-shot learning. ACM Trans. Graph. (TOG) 2019, 38, 185. [Google Scholar] [CrossRef]
Kong, Y.; Luo, C.; Ma, W.; Zhu, Q.; Zhu, S.; Yuan, N.; Jin, L. Look closer to supervise better: One-shot font generation via component-based discriminator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13482–13491. [Google Scholar] [CrossRef]
Xie, Y.; Chen, X.; Sun, L.; Lu, Y. Dg-font: Deformable generative networks for unsupervised font generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5130–5140. [Google Scholar] [CrossRef]
Wu, L.; Zhang, C.; Liu, J.; Han, J.; Liu, J.; Ding, E.; Bai, X. Editing text in the wild. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1500–1508. [Google Scholar] [CrossRef]
Yang, Q.; Huang, J.; Lin, W. Swaptext: Image based texts transfer in scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14700–14709. [Google Scholar] [CrossRef]
Krishnan, P.; Kovvuri, R.; Pang, G.; Vassilev, B.; Hassner, T. Textstylebrush: Transfer of text aesthetics from a single example. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9122–9134. [Google Scholar] [CrossRef] [PubMed]
Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; De Las Heras, L.P. ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar] [CrossRef]
Sun, Y.; Ni, Z.; Chng, C.K.; Liu, Y.; Luo, C.; Ng, C.C.; Han, J.; Ding, E.; Liu, J.; Karatzas, D.; et al. ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1557–1562. [Google Scholar] [CrossRef]
Qu, Y.; Tan, Q.; Xie, H.; Xu, J.; Wang, Y.; Zhang, Y. Exploring stroke-level modifications for scene text editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2119–2127. [Google Scholar] [CrossRef]
Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4593–4603. [Google Scholar] [CrossRef]
Li, C.; Liu, W.; Guo, R.; Yin, X.; Jiang, K.; Du, Y.; Du, Y.; Zhu, L.; Lai, B.; Hu, X.; et al. PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system. arXiv 2022, arXiv:2206.03001. [Google Scholar]

Figure 1. Figure of the overall framework of the proposed algorithm, revealing a typical GAN network architecture. In contrast to existing methods, this project introduced a unique imposed sequence-constrained attention mechanism, aiming to enhance the local similarity of text styles. Furthermore, it incorporates numerous loss functions specifically tailored for text color and style patterns.The Chinese text in the figure indicates “Style transfer”.

Figure 2. The proposed algorithm flow for the FS-attention mechanism outlined in this paper comprises two main components: (a) the temporal constraint segment, which relies on the GRU network, and (b) the segment dedicated to attention and hidden layer state computation.

Figure 3. Cross-lingual style transfer results, where

I_{S}

represents the input target style image and

I_{C}

represents the target textual content image, the method proposed in this paper demonstrated superior performance in terms of readability, style similarity, color similarity, and several other aspects. The experimental results clearly indicate the effectiveness and advantages of our approach compared to other methods. From the fourth column, it can be observed that our algorithm largely restored the fonts and colors of the first column. It is obvious that the generated text in the “ours” column, whether in terms of color or glyph, closely resembles the style input in the “IS” column, and they can be either in the same language or different languages. The Chinese characters in the figure are random characters without specific meanings.

Figure 3. Cross-lingual style transfer results, where

I_{S}

represents the input target style image and

I_{C}

represents the target textual content image, the method proposed in this paper demonstrated superior performance in terms of readability, style similarity, color similarity, and several other aspects. The experimental results clearly indicate the effectiveness and advantages of our approach compared to other methods. From the fourth column, it can be observed that our algorithm largely restored the fonts and colors of the first column. It is obvious that the generated text in the “ours” column, whether in terms of color or glyph, closely resembles the style input in the “IS” column, and they can be either in the same language or different languages. The Chinese characters in the figure are random characters without specific meanings.

Figure 4. Intra-lingual style transfer results, where IS served as the input target style image and IC as the target textual content image, and the method proposed in this paper showed distinct advantages in subjective metrics such as image clarity and handwriting similarity. The experimental results indicate that, compared to the models in the third column, our method in the fourth column accurately identified the textual content image and precisely restored the font of the target style image with high clarity.

Figure 5. Figure of experimental results for real-world scene texts, where

I_{S}

represents the reference style input, while

I_{C}

denotes the target text input. From the third column of our results, it can be observed that in real-world scene text experiments, we were able to accurately restore the font color and thickness from the reference style.

Figure 5. Figure of experimental results for real-world scene texts, where

I_{S}

represents the reference style input, while

I_{C}

denotes the target text input. From the third column of our results, it can be observed that in real-world scene text experiments, we were able to accurately restore the font color and thickness from the reference style.

Figure 6. In a comparative analysis of the third and fourth columns, where

I_{S}

serves as the input target style image and

I_{C}

as the target textual content image, it is evident that the third column successfully restores the reference style, whereas the absence of color constraints in the fourth column fails to accurately reproduce the font color. The illustration comparing our algorithm with the algorithm without color constraints clearly demonstrates the importance of color constraints in the algorithm.

Figure 6. In a comparative analysis of the third and fourth columns, where

I_{S}

serves as the input target style image and

I_{C}

as the target textual content image, it is evident that the third column successfully restores the reference style, whereas the absence of color constraints in the fourth column fails to accurately reproduce the font color. The illustration comparing our algorithm with the algorithm without color constraints clearly demonstrates the importance of color constraints in the algorithm.

Figure 7. The figure presents the outcomes of our experiment. In the upper portion of the figure, the PowerPoint slides from the original video are showcased, featuring content composed in Chinese. The lower portion reveals the freshly generated video content, crafted through style mimicry. Notably, the generated video embodies a distinct handwriting style, while its background color harmonizes with the primary section. Although the background of the original yellow-font text at the top of the image isn’t an exact replica, a significant resemblance in font style is evident. Furthermore, the generated text boasts excellent readability.

Figure 8. The illustration depicts a system designed based on the B/S (Browser/Server) architecture, utilizing a front-end and back-end separation technique for overall development. The front end primarily undertakes the responsibilities of displaying results and facilitating video uploads, while the back end focuses on data preprocessing, model inference, and data storage tasks.

Figure 9. Illustration primarily depicts the back-end components of a system, organized into five key modules: data preprocessing, third-party model inference invocation, cross-language text style transfer inference module, data post-processing, and user data management. These modules collectively constitute a real-time, video-oriented system for stylized translation of textual content.

Table 1. Encoder and decoder composition structure.

Modules	Blocks	Input Channels	Output Channels
Encoder	DS1	3	64
Encoder	DS2	64	128
Encoder	DS3	128	256
Encoder	DS4	256	512
Encoder	DS5	512	1024
Decoder	US1	2048	512
Decoder	US2	1024	256
Decoder	US3	512	128
Decoder	US4	256	64
Decoder	US5	128	64
Decoder	RC	64	3

Note: DS represents the downsampling convolution block, and US represents the upsampling convolution block.

Table 2. Hyperparameters of downsampling and upsampling convolution blocks.

	Network Unit	Parameter	Value
DS	Convolution Layer	Kernel Size	3
		Stride	1
		Padding	1
	Pooling Layer	Kernel Size	2
US	Convolution Layer	Kernel Size	3
		Stride	1
		Padding	1

Note: DS represents the downsampling convolution block, and US represents the upsampling convolution block.

Table 3. Results of cross- and intra-language experiments.

	Cross-Lingual			Intra-Lingual
	IS↑	LPIPS↓	FID↓	IS↑	LPIPS↓	FID↓
SRNet [30]	2.9242	0.4090	132.0778	3.0496	0.1942	46.9365
MOSTEL [35]	-	-	-	3.1721	0 2351	130.4090
Ours	3.9023	0.1743	52.0432	3.1732	0.2148	66.2780

↑ indicates an increase, ↓ indicates a decrease

Table 4. Experimental results of real-world scene texts. S-r represents the original style text input images, and C-r denotes the benchmark for the target image based on the input text content

I_{c}

.

Table 4. Experimental results of real-world scene texts. S-r represents the original style text input images, and C-r denotes the benchmark for the target image based on the input text content

I_{c}

.

	ICDAR2013			ICDAR2019-LSVT
	FID↓ (S-r)	FID↓ (C-r)	IS↑	FID↓ (S-r)	FID↓ (C-r)	IS↑
SRnet [30]	223.0527	194.6246	2.9293	179.7259	175.4423	2.9949
SText [36]	210.2651	171.0176	3.4185	156.3115	156.3115	3.7229
our	208.5253	173.1425	3.6636	152.5595	177.0958	4.0579

↑ indicates an increase, ↓ indicates a decrease

Table 5. Ablation experiment results table.

	IS↑	LPIPS↓	FID↓
No_Attention	2.7639	0.3594	58.8344
No_Color	1.7233	0 3987	97.1562
No_Style	3.7577	0.5889	83.4631
No_Text	3.7211	0.5286	75.9571
All	3.9023	0.1743	52.0432

↑ indicates an increase, ↓ indicates a decrease

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Zhao, B.; Zhang, W.; Miao, Q. Knowledge Translator: Cross-Lingual Course Video Text Style Transform via Imposed Sequential Attention Networks. Electronics 2025, 14, 1213. https://doi.org/10.3390/electronics14061213

AMA Style

Zhang J, Zhao B, Zhang W, Miao Q. Knowledge Translator: Cross-Lingual Course Video Text Style Transform via Imposed Sequential Attention Networks. Electronics. 2025; 14(6):1213. https://doi.org/10.3390/electronics14061213

Chicago/Turabian Style

Zhang, Jingyi, Bocheng Zhao, Wenxing Zhang, and Qiguang Miao. 2025. "Knowledge Translator: Cross-Lingual Course Video Text Style Transform via Imposed Sequential Attention Networks" Electronics 14, no. 6: 1213. https://doi.org/10.3390/electronics14061213

APA Style

Zhang, J., Zhao, B., Zhang, W., & Miao, Q. (2025). Knowledge Translator: Cross-Lingual Course Video Text Style Transform via Imposed Sequential Attention Networks. Electronics, 14(6), 1213. https://doi.org/10.3390/electronics14061213

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Knowledge Translator: Cross-Lingual Course Video Text Style Transform via Imposed Sequential Attention Networks

Abstract

1. Introduction

Major Challenges

2. Related Work

3. Method

3.1. Text Style Transfer Networks

3.2. Imposed Sequential Attention

3.3. Loss Function

4. Experiments

4.1. Experimental Settings

4.2. Results and Analysis

4.3. Results of Scene Text Style Transfer

4.4. Ablation Study

4.5. MOOC Applications

5. System

5.1. System Logical Flow

5.2. System Overall Design

5.3. System Function Design

5.4. System Interface Diagram

6. Conclusions

Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI