Sign Language Translation: A Survey of Approaches and Techniques

: Sign language is the main communication way for deaf and hard-of-hearing (i


Introduction
Sign language is a special visual language for both congenital DHH and acquired DHH people, and it uses both manual and nonmanual information [1] for visual communication. Manual information includes shape, orientation, position, and motion of hands, while nonmanual information includes body posture, arm movements [2], eye gaze, lip shape, and facial expressions [3]. Sign language is not a simple word-for-word translation of spoken language but has independent grammar, semantic structure, and specific language logic [4]. Continuous changes in hand and body movements represent different units of meaning. According to statistics by the World Federation of the Deaf, the number of DHH people is 70 million, and there are over 200 sign languages in the world [5]. Therefore, improving the translation technology of sign language can bridge the communication gap between DHH and non-DHH individuals.
For the task of sign language translation (SLT), previous works [6][7][8] were mainly focused on sign language recognition (SLR), recognizing sign language as corresponding gloss. However, SLT is a conversion of recognized gloss into spoken language text, which is not a direct prediction [9] of the spoken language text from sign language videos. Unlike the spoken language text, gloss [10] includes the grammatical and semantic information on tense, order, and direction or position in sign language. Gloss may also include information about the repeated number of a sign. Figure 1 shows the difference between SLR and SLT.
SLR can be divided into two categories: isolated sign recognition [11][12][13] and continuous sign recognition [7,14,15]. The former refers to the fine-grained recognition of individual sign movements, where one video corresponds to only one gloss, and segmenting the sign videos requires a large amount of manual effort. The latter maps continuous sign language videos into one sequence of glosses, where the order of glosses in the sequence is consistent with sign language. With a deep neural network employed in the area of natural language processing, Camgoz et al. [9] treated SLT as ordinary spoken language text. Unlike SLR, they recognized sign language as a sequence of glosses. They aimed to generate spoken language text that non-sign language users could also understand. As shown in Figure 2, there are four common frameworks [16]: • Sign2gloss2text (S2G2T) [17][18][19], which recognizes the sign language video as gloss annotations first and then translates the gloss into spoken language text. • Sign2text (S2T), which directly generates spoken language text from sign language video end to end. • Sign2(gloss+text) (S2(G+T)) [16,20,21], which multitasks by outputting glosses and text and can use external glosses as supervision signals. • Gloss2text (G2T) [22][23][24], which can reflect the translation performance from gloss sequence to text. In this work, we aimed to classify and summarize the literature into three types: improving the accuracy of SLR, improving the performance of SLT through different models, and addressing the scarcity of data resources. The first type of literature introduces how to improve the performance of SLR through the task of SLR, and the second type of literature introduces how to modify the structure of the network for better capturing the visual and textual semantic information. The third type of literature aims to settle the problem of the scarcity of sign language for SLT. Finally, we introduce the most common datasets and evaluation metrics employed on the task of SLT.
We noticed that there were some other well written review articles on the same topic in the published literature [5,[25][26][27], and our manuscript is different from the current published literature in the following aspects: Firstly, our manuscript clearly analyze the concepts of sign language recognition and translation, explain the common areas of SLR and SLT, and clarify the boundaries of SLR and SLT. SLR can be regarded as a substep of SLT, a two-step process of first recognition and then translation, but SLT can also be implemented without relying on SLR, in an end-to-end way, namely, S2T. Secondly, according to the characteristics of SLT, we analyze and compare the existing technologies and methods. We classify them into three types: improving the performance of SLR, changing the network structure for improving the performance of the translation, and solving the problem of sign language scarcity. The existing review literature may be presented in the chronological order of SLT, without classification, which is difficult for people to grasp the various problems of sign language translation. Finally, we describe the latest situation in the field of SLT, which brings readers a broad vision and brand-new inspiration. Table 1 shows the current differences between past review papers and our research. In Section 2, we introduce the basic framework of SLT. In Section 3, we present the methods or models used for SLT. In Sections 4 and 5, we introduce the datasets and metrics employed for the task of SLT. In Section 6, we discuss the current challenges for the task of SLT.

The Background of SLT
SLT is a typical sequence-to-sequence problem that translates continuous sign language videos into fluent spoken language text. The celebrated encoder-decoder architecture of SLT is shown in Figure 3.
Firstly, the rich semantic information in sign language video frames is first encoded into dense vectors. Secondly, the decoder module takes the dense vectors as input and generates the target spoken text, sequentially. The framework of SLT consists of three modules: • Spatial and word embedding layers, which map the sign language video frames and spoken text into feature vectors or dense vectors, respectively. • A tokenization layer, which tokenizes the feature vectors. • The encoder-decoder module, which predicts the spoken text and adjusts the network parameters through backpropagation to reduce the difference between the target text and generated text. The spatial embedding layer extracts features from the visual information of sign language, and there are various network structures, such as 2D-CNN [28], 3D-CNN [29], and GCN [30,31]. In addition to the general structures, the multi-Cue network have also been applied when tailored to sign language visual information. The clue features, such as facial expressions, body posture, and gesture movements, are fused through corresponding fusion mechanisms and then fed into the tokenization layer.
The word-embedding module can learn a dense vector through a linear projection layer. For the tokenization layer, both "frame-level" and "gloss-level" token schemes are available, and RNN-HMM is a typical method for the "gloss level". The encoder-decoder module may consist of multiple RNN or LSTM cells and their variants such as Bi-LSTM or GRU cells. To address the long-term dependency issue, multiple attention mechanisms can be incorporated, such as in Bahdanau et al. [32] and Luong et al. [33]. Moreover, other structures such as graph convolutional networks and transformers [34] have also been employed in SLT. Figure 4 shows the celebrated transformer framework employed in SLT. Firstly, the visual information is obtained through structures such as S3D [35], OpenPose [36], VGG-Net [37], and STMC [1], and undergoes spatial embedding. Next, the semantic representation of visual and textual information enters N encoder modules based on the self-attention mechanisms in the transformers. In the decoder module, the input of word embeddings enters a masked multihead attention module, with masking indicating the usage of only existing tokens when extracting contextual information to prevent overfitting of the model. Finally, the predicted words are outputted via a softmax layer.

Literature Review of SLT
In this section, we classify the SLT into three types: improving the performance of SLR, network structure for improving the performance of translation, and solving the problem of the scarcity of sign language. The first type of literature aims to improve the performance of SLT by improving the performance of SLR. The second type of literature improves the performance of SLT by modifying the structure of the network. The third type of literature aims to solve the problem of the scarcity of sign language.

Improving the Performance of SLR
SLR can be regarded as a substep of SLT, a two-step process of first recognition and then translation, but SLT can also be implemented without relying on SLR in an end-toend way, namely, S2T. Experiments have shown that improving the performance of sign language recognition is an effective way to improve sign language translation. In this section, we review these solutions in detail.
To better capture the global visual semantic information, He et al. [38] employed the faster R-CNN model to locate and recognize hand gestures in sign language videos. They combined a 3D-CNN network with the LSTM-based encoder-decoder framework for SLR. To meet the high accurate demand of sign language video segmentation, Li et al. [4] proposed a temporal semantic pyramid model to partition the video into segments with different levels of granularity. They employed the time-based hierarchical feature learning method and attention mechanisms to learn local information and non-local contextual information. To explore the precise action boundaries and learn the temporal cues in sign language videos, Guo et al. [39] proposed a hierarchical fusion model to obtain the visual information with different visual granularities. First, a 3D-CNN framework and a Kinect device were used for extracting the RGB features and skeleton descriptors, separately. Then, an adaptive clip summarization (ACS) framework was proposed for automatically selecting key clips or frames of variable sizes. Next, multilayer LSTMs were employed to learn features at the frame, clip, and viseme/signeme levels. Finally, a query-adaptive model was designed to generate the target spoken text. To fully leverage the significant information of body postures and positions, Gan et al. [40] proposed a skeleton-aware model, which considered the skeletons as a corresponding representation of human postures, and they sliced the video into clips. To improve the robustness of the proposed model, Kim et al. [41] proposed a robust key-point normalization method which normalized the position of key points through the neck-shoulder framework. The normalized key-points were then used as input sequence for a transformer network. For the area of German sign language, there exists an obvious problem, where some signs have the same hand gesture but differ only in lip shape. To settle this problem, Zheng et al. [3] proposed a semantic focus model for extracting facial expression features for German sign language.
Unlike some works [9,[42][43][44] that focus more on specific appearance features, Rodriguez and Martinez [2] focused more on the motion variations in sign language and used optical flow images instead of RGB frames for SLT. They first used a 3D-CNN framework to obtain the optical flow representation and extract spatial motion patterns. Then, they used bidirectional recurrent neural networks for the motion analysis and for learning the nontemporal relationships of optical flow. Finally, they combined the gestural attention mechanism with spatiotemporal descriptors to enhance the correlation with the spoken language units. Different from some works [14,42,45] that only consider a single feature such as hand features, Camgoz et al. [46] took different articulators' information separately. They designed a multichannel transformer model to improve the performance of the feature extraction. Zhou et al. [1] proposed a spatiotemporal multicue model (STMC) to explore the hidden information contained in sign language videos.Unlike some common approaches that process multiple aspects of visual information contained in sign language, Kan et al. [47] proposed a novel model using graph neural networks to transform sign language features into hierarchical spatiotemporal graphs. Their hierarchical spatiotemporal graphs consisted of high-level and fine-level graphs, the former representing the states of the hands and face, while the latter represented more detailed patterns of hand joints and facial regions. Table 2 shows the performance of some celebrated models mentioned above.

Network Structure for Improving the Performance of Translation
To improve the performance of SLT, many researchers have proposed various advanced network frameworks for capturing deep visual and text semantic information. In this section, we review these solutions in detail.
Fang et al. [51] proposed a DeepASL system for American Sign Language (ASL) translation. First, DeepASL treated hand shape, relative position, and hand movement as the skeleton information of American sign language (ASL), employing a leap motion sensor device worn by the user. Then, a hierarchical bidirectional model was used to capture semantic information and generate word-level translation. Finally, a connectionist temporal classification was employed for sentence-level translation. Koller et al. [8] proposed a hybrid framework for sequence encoding and employed a Bayesian framework to generate sign language text. Wang et al. [48] proposed the connectionist temporal fusion (CTF) framework for SLT. First, the C3D-ResNet was employed to extract visual information from the video clips. Then, the visual information was sent into temporal convolution (TCOV) and bidirectional GRU (Bi-GRU) modules to obtain the long-term and short-term transitions, respectively. Next, a fusion module (FL) was employed to connect the modules and learn complementary relationships. Finally, a connectionist temporal fusion (CTF) mechanism was designed to generate sentences. Although their proposed model could solve the framelevel alignment problem, it could not address the correspondence between the jumbled word order and visual content. Therefore, Guo et al. [43] proposed a hierarchical-LSTM (H-LSTM) model, which embedded features at the frame-level, clip-level, and viseme-level. They used a C3D network [29] to extract the visual features and utilized an online adaptive key-segment mining method to remove irrelevant frames. They proposed three pooling strategies to reduce less important clips.
To the best of our knowledge, Camgoz et al. [9] were the first to take the sign language videos into spoken language text with an end-to-end model. Their model employed a 2D-CNN framework for spatial embedding, an RNN-HMM framework for word segmentation, and a sequence-to-sequence attention model for sequence mapping. To address the issues of gradient vanishing, Arvanitis et al. [52] employed a gated recurrent unit (GRU [53])-based seq2seq framework [32,54,55] for SLT. They employed three different Luong attention mechanisms [33] to calculate the weight parameters of hidden states. Guo et al. [49] proposed a dense temporal convolutional network (DenseTCN) that captured the details of sign language movements from short-term to long-term network. They used a 3D-CNN framework for extracting the visual representation and designed a temporal convolution (TC) framework [56] to capture local context information. Inspired by DenseNet [57], they expanded the network into a dense hierarchical structure to capture global context information and computed the CTC loss at the top of each layer to optimize the parameters. Camgoz et al. [16] argued that using the glosses as inputs could undermine the performance of the SLT system. To settle these issues, they proposed a transformer model which took the SLR and SLT in an end-to-end way, without requiring an explicit gloss representation. Yin and Read [58] proposed an STMC-Transformer framework for SLT, where the gloss sequence was identified by a transformer-based encoder-decoder network [16]. After exploring spatial and temporal cues in STMC [1], Yin and Read [17] predicted the gloss sequence using Bi-LSTM and CTC frameworks and generated spoken text using a transformer framework. To overcome the difficulty of modeling long-term dependencies and consuming a large quantity of resources, Zheng et al. [59] proposed a frame stream density compression framework and a temporal convolution and dynamic hierarchical model for SLT. Voskou et al. [60] proposed an SLT architecture with a novel layer in the transformer network that avoided using gloss sequences as an explicit representation. Qin et al. [61] proposed the video transformer net (VTN) framework for the task of SLR and SLT. Their framework was a lightweight SLT architecture that used the Resnet-34 and transformer framework as encoder and decoder, respectively. To address the weakly supervised problem in SLT, Song et al. [50] proposed a parallel temporal encoder framework to extract both global and local information simultaneously. To address the problem of multilanguage SLT (MSLT), Yin et al. [62] proposed a transformer model for translating different types of sign language into corresponding texts in an end-to-end way. To capture the nonlocal and global semantic information of the video and spoken language text, Guo et al. [63] proposed a locality-aware transformer (LAT) framework for the task of SLT. To acquire the multicue information of sign language, Zhou et al. [20] proposed a spatial-temporal multicue (STMC) framework for the task of SLT. Table 3 shows the performance of some of the celebrated models mentioned above.

Solving the Problem of the Scarcity of Sign Language
As we all know, recording high-quality sign language data in large quantities is extremely expensive, and the sign-German pairs of RWTH-PHOENIX-Weather-2014T has less than 9000 samples [19], which is an order of magnitude smaller than the task of neural machine translation [69]. Data scarcity is a major challenge and bottleneck for the task of SLT. To settle this issue, many solutions have emerged, such as backtranslation, data augmentation, transfer learning, and leveraging generative models. In this section, we review these solutions in detail.
Orbay et al. [64] considered that using gloss as supervision data in SLT could improve translation performance. Since the quantity of gloss data is limited and expensive to obtain, they explored semisupervised labeling methods. They proposed two labeling methods: the first method utilized OPENPose [70] to extract hands from video frames, then hand shape recognition was performed using a 2D-CNN. The second approach employed a pretrained model for action recognition. Experiments showed that frame-level labeling may be better than scarce gloss, 3D-CNNs may be more effective for SLT in the future, and the labeling of the right-hand information contributed more to translation quality. Because of a lack of corpora between Myanmar sign language (MSL) and Myanmar language, Moe et al. [71] investigated unsupervised neural machine translation (U-NMT) in Myanmar. To settle the issue of sparse data annotations, Albanie et al. [72] proposed an automatic annotation method, and they proposed the BSL-1K dataset. To improve the performance of SLT, Zhou et al. [19] used a large number of monolingual texts to augment the dataset of SLT. Inspired by backtranslation models [73], they introduced the SignBT algorithm, which added newly generated parallel sample pairs to the dataset. To settle the scarcity of sign language, Nunnari et al. [74] proposed a data augmentation model, which could help eliminate the background and personal effects of the signer. Gomez et al. [75] proposed a transformer model for text-to-sign gloss translation. Their proposed model recognized syntactic information and enhanced the discriminative power for low-resource SLT tasks without significantly increasing model complexity.
To address the lack of parallel corpus pairs, Coster et al. [21] proposed a frozen pretrained transformer (FPT) model, and they initialize the transformer translation model with pretrained BERT-based and mBART-50 models. Coster et al. [65] continued their research on the effectiveness of frozen pretrained model, and they found that the performance improvement was not due to the changes of their proposed model but rather to the written language corpora. To settle the issue of small SL datasets, Zhao et al. [66] aimed to learn the linguistic characteristics of spoken language to improve translation performance. They proposed a framework consisting of a verification model that queried whether words existed in sign language videos, a pretrained conditional sentence generation module that combined the existing words into multiple sentences, and a cross-modal reranking model was employed to select the best-fit sentence.
To address the low-resource problem, Fu et al. [67] proposed a novel contrastive learning model for the task of SLT. They fed the recognized gloss twice to the transformer translation network and used the hidden layer representations as two types of "positive examples". Correspondingly, they randomly selected K tokens from the vocabulary as "negative examples", which were not present in the current sentence. With the progress of transfer learning in areas of speech recognition, Mocialov et al. [76] introduced transfer learning into the low-resource task of SLT. They designed two transfer learning techniques for language modeling and used the large corpus Penn Treebank to import the English language knowledge into stacked LSTM models. Chen et al. [68] proposed a transfer learning strategy for SLT. Their model was a progressive pretraining approach that utilized a large quantity of external data from a general domain to pretrain the model step by step. As opposed to the research that focuses on improving SLR, Cao et al. [24] aimed to improve the translation part in SLT. They proposed a task-aware instruction network (TIN) that leveraged a pretrained model and a large number of unlabeled corpora to enhance translation performance. To reduce the differences between gloss and text, they proposed a data augmentation strategy that performed upsampling at the token level, sentence level, and dataset level. Moryossef et al. [22] focused on the gloss-text task in SLT and proposed two rule-based augmentation strategies to address the problem of scarce resources. They proposed general rules and language-specific rules to generate pseudo-parallel gloss-text pairs, which were then used for backtranslation to improve the model's performance. Ye et al. [77] proposed a domain text generation model for the gloss-text translation task, which could generate large-scale spoken language text for backtranslation (BT). To settle the problem of data scarcity and the modality gap between sign video and text, Zhang et al. [78] proposed a novel framework for the task of SLT. To obtain more data resources, they paid attention to the task of machine translation. Similarly, Ye et al. [79] proposed a cross-modality data augmentation framework to settle the problem of the modality gap between sign and text. Table 3 shows the performance of some of the celebrated models mentioned above.

Datasets
In this section, we introduce the datasets employed for SLT and SLR. Firstly, the dataset employed for the task of SLT contains sign language videos, sign language gloss, and spoken language text. Secondly, the dataset employed for the subtask of SLT (gloss2text) contains spoken language text and sign language gloss. Thirdly, the dataset employed for SLR contains sign language videos and sign language gloss. Additionally, we introduce multilanguage SLT datasets that contain various sign languages and spoken languages. Table 4 shows the dataset of some of the celebrated models mentioned above.
Currently, RWTH-PHOENIX-Weather-2014T [9] is the most widely used dataset compared by most of baseline models in SLT. PHOENIX14T consists of German sign language videos, sign language gloss, and spoken language text. PHOENIX14T is segmented into parallel sentences, where each German video (consisting of multiple sign language frames) corresponds to multiple gloss and German sentences. The sign language videos of PHOENIX14T were collected from nine different signers, the size of the gloss vocabulary is 1066, and the size of the vocabulary of the German spoken language text is 2887. The visuals for PHOENIX14T were sourced from German weather news, while the German sign language annotations were provided by deaf specialists and the German text was sourced from news speakers. CSL-Daily [19] is a celebrated sign language dataset for Chinese SLT. It covers various themes such as family life, school life, medical care, etc. The CSL-Daily dataset includes sign language videos from 10 native signers whose signing expressions are both normative and natural. The sign gloss vocabulary in this dataset consists of 2000 words, and the spoken Chinese text vocabulary consists of 2343 words, which were created under the guidance of sign language linguistics experts and sign language teachers.
RWTH-PHOENIX-Weather 2014 (PHOENIX14) [80] is a celebrated sign language dataset for SLT. The sign language videos of PHOENIX14 are from weather news programs, which were collected from nine signers with a gloss vocabulary of 1081.
ASLG-PC12 [81] is a US sign language dataset created semi-automatically using rulebased methods, and the dataset does not include sign language videos. ASLG-PC12 contains a massive number of gloss-text pairs, and it can be employed for the task of G2T. The size of the vocabulary of spoken language text and American sign gloss are 21,600 and 15,782, respectively.
Spreadthesign-Ten (SP-10) [62] is a multilingual sign language dataset for SLR. The sign language videos and corresponding texts of SP-10 were collected from 10 different languages. Each data point of SP-10 includes 10 sign language videos and 10 spoken language translation texts.
The WER is an evaluation metric from NLP, which refers to the minimum total sum of recognized gloss sequences to the reference sequence, including substitutions, insertions, and deletions.
BLEU is a celebrated evaluation metric employed for SLT, and it assesses the performance of each model through the similarity of generated text and a reference translation. The BLEU score is calculated by computing the frequency of shared n-grams [89] between predicted text and reference sentence, with scores ranging from zero to one to indicate the degree of similarity. The BLEU score is one if the predicted text and the reference sentence are completely similar. Depending on the length of n-grams, BLEU is divided into BLEU-1, BLEU-2, BLEU-3, and BLEU-4.
The ROUGE index evaluates the translation performance by measuring the matching degree between the predicted and reference results, and differs from BLEU by emphasizing the index of recall rather than the index of precision for evaluating the quality of translation.
METEOR uses WordNet [90] and combines multiple evaluation metrics such as word matching, word-order matching, synonymy matching, stemming matching, and nounphrase matching. Table 5 shows the metrics of some of the celebrated models mentioned above. ASL ---

Conclusions
SLT is a typical multimodal cross-disciplinary task, which plays an important role in promoting communication in the deaf community and between non-DHH and DHH individuals. At present, SLT faces various challenges, such as the scarcity of dataset resources and the difficulty of obtaining high-quality SLT resources. In this work, we combined the background of SLT to classify the task of SLT and describe the pipeline of SLT. We extracted the typical framework of SLT and analyzed it with specific examples. For the latest literature and methods of SLT, we classified and summarized them into three types: improving the accuracy of SLR, improving the performance of SLT through different models, and addressing the scarcity of data resources. In addition, we introduced the most commonly used datasets and evaluation metrics employed for the task of SLT.
Currently, there are three feasible directions for improving the performance of SLT: firstly, fine-tuning the model obtained from a deep neural network; secondly, introducing data resources from other fields, such as gesture recognition and machine translation; finally, using generative models to produce the required SLT data resources. However, the ultimate step for improving translation performance through these methods is to make targeted innovations suitable for the domain of SLT, which may be the most important aspect of performance improvement.