Towards Hybrid Multimodal Manual and Non-Manual Arabic Sign Language Recognition: mArSL Database and Pilot Study

: Sign languages are the main visual communication medium between hard-hearing people and their societies. Similar to spoken languages, they are not universal and vary from region to region, but they are relatively under-resourced. Arabic sign language (ArSL) is one of these languages that has attracted increasing attention in the research community. However, most of the existing and available works on sign language recognition systems focus on manual gestures, ignoring other non-manual information needed for other language signals such as facial expressions. One of the main challenges of not considering these modalities is the lack of suitable datasets. In this paper, we propose a new multi-modality ArSL dataset that integrates various types of modalities. It consists of 6748 video samples of ﬁfty signs performed by four signers and collected using Kinect V2 sensors. This dataset will be freely available for researchers to develop and benchmark their techniques for further advancement of the ﬁeld. In addition, we evaluated the fusion of spatial and temporal features of different modalities, manual and non-manual, for sign language recognition using the state-of-the-art deep learning techniques. This fusion boosted the accuracy of the recognition system at the signer-independent mode by 3.6% compared with manual gestures.


Introduction
According to the World Health Organization (WHO), the deaf and hard-hearing community forms around 6.1% of world's population in 2018, which is close to 470 million people worldwide and is expected to be over 900 by 2050 [1]. Hearing impairment can be classified in several categories ranging from mild to profound. This community depends mainly on sign language to communicate with their society. This language is a complete natural language that has its own vocabulary and linguistic properties. It is not universal and there are many sign languages worldwide. There is no clear connection between spoken language and sign language and even countries that speak one language can have different sign languages such as American Sign Language (ASL) and British Sign Language (BSL) [2]. Moreover, some countries may have several sign languages in the same way as having several dialects of spoken language [3].
Unfamiliarity with sign language adds a barrier between deaf people and society. With the advances in computer vision and machine learning, different digital-aid systems have been developed to automatically recognize, synthesize and translate sign languages. Existing work on sign language can be classified in a variety of ways, e.g., based on the part of the body or type of features considered. Sign gesture is the basic component of sign language that can be classified, based on motion involvement, as static and dynamic. A static sign does not involve motion and largely depends on the shape and rotation of the signer's hands and fingers during signing [4]. Fingerspelling of alphabet letters and digits in most sign languages is expressed mostly using static signs. In contrast, dynamic signs involve motion of hands and other parts of the body during signing [5]. The majority of sign gestures can be categorized as dynamic where the motion plays a crucial role to convey the meaning. Sign gestures can be manual or non-manual or a combination of both. Gestures that involve hands and body motion can be considered as manual gestures. Non-manual gestures depend on other parts of the body such as facial expressions and head movement to express thoughts and clarify or emphasize meaning [6]. These gestures, manual and non-manual, are simultaneously utilized for the majority of signs.
Facial expressions are the dominant component of non-manual gestures in sign languages. They depend on mouth, eyes, eyebrows, lips, noses and cheeks to express feelings and emotions that can not be conveyed by manual gestures. In addition, facial expressions play an important role in expressing the linguistic proprieties of sign languages. They are used for grammatical structure, lexical distinction, and discourse functions such as negation and adverbial and adjectival contents [7]. An example of two signs of German Sign Language (GSL), "BROTHER" and "SISTER", that use the same hand gestures can be found in [8]. The difference between these signs depends on the facial expressions through lip pattern. Lip pattern is a commonly used parameter of non-manual gestures. Few lip patterns are dedicated to sign languages whilst the majority of lip patterns correspond to the pronunciation of the signed words in the spoken language. Deaf people are good lip readers and they read the lip patterns to gain a full understanding of the signs, especially from people who can hear.
Eyebrows and forehead are other examples primary components of the facial expressions in sign languages. They can be used alone or in combination with other facial expression components such as lip patterns. Figure 1a shows how eyebrows and forehead are employed with the face posture of "UGLY" sign of Arabic sign language. This face posture is also used with "BEAUTIFUL" sign, but with different facial expressions. Head motion is also an important component of the non-manual articulators of sign languages. Head posture can be used as an independent sign or integrated with manual gesture such as the "SLEEP" sign in Arabic sign language that consists of hand gestures and head motion, as shown in Figure 1b. Arabic Sign Language (ArSL) is one of the languages that is used in Arab countries. This language is the unified language of several sign languages that exist in Arabic countries [9]. It was proposed in 1999 by the League of Arab States (LAS) and the Arab League Educational, Cultural and Scientific Organization (ALECSO) and a dictionary consisting of 3200 sign words was published in two parts in 2000 and 2006 [10,11]. This language is currently used mainly in the Arab gulf countries and is the main language used in the media channels such as Al-Jazeera. Research on automatic recognition of ArSL is still in its infancy and one of the main challenges associated with ArSL recognition systems is the lack of databases with sufficient numbers of relevant videos representing different articulators of the sign language [9]. In this work, we propose a multi-modality ArSL database with a focus on signs that employ manual and non-manual gestures. The proposed database consists of 6748 videos of 50 signs of ArSL performed by four signers. The signs of this database were recorded using Microsoft Kinect V2 sensors. In addition, we propose a hybrid model and quantitatively assess its effectiveness as a baseline for benchmarking future contributions on the dataset. The proposed model combines manual and non-manual gestures to enhance the recognition rates of a sign language recognition system. The prominent component of non-manual gestures, facial expressions, is integrated with the manual gestures and a high accuracy is obtained compared with only manual gestures. These experiments are conducted on different input representations of the sign gestures, such as RGB and depth data.
The remainder of this paper is organized as follows: Section 2 briefly reviews most of the related work dealing with non-manual features. Section 3 describes the motivations and details of the constructed database. Section 4 presents a pilot study using several state-of-the-art deep learning techniques for sign language recognition using examples of both manual and non-manual features. Finally, Section 5 concludes the paper and highlights the contributions of the paper.

Literature Review
Several techniques have been proposed in the last two decades for automatic recognition of sign language. The majority of these techniques targeted the dominant features of manual gestures. However, few approaches studied non-manual features (e.g., facial expressions) either alone or in combination with manual features. One of the main challenges of recognition systems is the lack of datasets, especially for ArSL. This section reviews available datasets of sign languages and surveys the most relevant recognition techniques of sign languages.

Sign Language Databases
The availability of databases is one of the challenges for advancing the research and development of sign language recognition and translation [12]. Although there is a large number of sign languages videos available online, these videos are not annotated, which makes them not useful for recognition and translation systems.
Sign language databases can be categorized into three main categories: fingerspelling, isolated signs, and continuous. Fingerspelling databases contain signs that depend mainly on finger shape and orientation. Most of the digits and alphabet letters of sign languages are static and use only fingers. Isolated sign words are equivalent to spoken words and they can be static or dynamic. Continuous sign language databases contain more than one sign word performed continuously. This section will focus on isolated signs databases since other databases are out of the scope of this paper.
Isolated sign language databases can be classified based on the acquisition device into sensor-based or vision-based databases. Sensor-based databases are collected using cumbersome sensors which can be worn on the signer's hand or wrist. The most commonly sensors used sensors for this purpose are electronic gloves. The need to wear these sensors during signing was one of the main issues with sensor-based recognition techniques and motivated researchers to use vision-based techniques. Vision-based databases are collected using single-or multi-camera acquisition devices. Single-camera devices provide a single piece of information about the signer, such as color video stream. A multi-camera device consists of more than one camera that provides different information about the signer, such as color and depth data. Multi-modal Kinect is one example of these devices that provide several types of information such as color, depth, and joint point information.
Several aspects are important to consider for evaluation of sign language databases, such as variability, size, and sign representation. The number of signers in the database is one of the factors that controls the database variability. This factor is important for evaluating the generalization of the recognition systems. Increasing the number of signers serves signer-independent recognition systems that are evaluated on signers different than signers involved in the system training. The number of samples per sign is another factor for sign language database evaluation. Having several samples per sign with some variations per sample is an important for training machine learning-based techniques that require large numbers of samples per sign. Sign representation data are also an important factor for evaluating databases. All the samples of vision-based sign language databases are available in RGB format. However, some databases [13][14][15][16] are recorded using multimodality devices that provides other representations for the sign sample such as depth and joint points. Table 1 lists the surveyed sign language databases of non-Arabic sign languages at the sign word level. As shown in the table, the majority of the databases are for ASL. It is also noticeable that databases published before 2012 were only available in RGB format since multi-modality acquisition devices were released in 2011. In addition, datasets with large numbers of signs [16][17][18] do not have large numbers of samples per sign relative to their signs number compared with databases with low numbers of signs [13,19].

Sign Language Recognition Systems
The correlation between manual and non-manual gestures of sign language has been studied by Krnoul et al. [30]. This study was conducted on Czech sign language and the findings of this study showed that hand and head gestures are correlated mainly in signs with vertical movement of the head and hands. Caridakis et al. [31] discussed the grammatical and syntactic relation of manual and non-manual gestures to sign language. They also investigated the efficiency of including facial expressions in sign language recognition. Sabyrov et al. [32] used Logistic Regression for Kazakh-Russian sign language recognition. OpenPose was used to extract key points from manual gestures and facial expressions. The reported results show that combining manual key points with mouth key points improved the accuracy by 7%, whereas eyebrow key points improved the accuracy by only 0.5%. This conclusion was also reported by Elons et al. [33], who found that combining facial features with manual gestures improved the accuracy from 88% to 98%.
Paulraj et al. [34] extracted the area and discrete cosine transform (DCT) coefficients from the signer hand and head separately. Theses features were combined and classified using a simple neural network model to obtain an accuracy of 92.1% with 32 signs of Malaysian Sign Language. However, this technique depends on wearing colored gloves to facilitate hand segmentation, which makes this approach difficult to deploy in real time. DCT was also used by Rao and Kishore [35] for Indian sign language recognition. The 2D-DCT was used to extract features from the signer's head and hands, which were detected using a Sobel edge detector. This approach was evaluated on a dataset consisting of 18 signs and an accuracy of 90.58% was reported. DCT with HMM was used by Al-Rausan et al. [36] for ArSL recognition. A dataset consisting of 30 signs performed by 18 signers was used to evaluate this approach and accuracies of 96.74% and 94.2% were reported with signer-dependent and -independent modes, respectively. HMM was also used by Kelly et al. [37] to classify a set of statistical features extracted from the signer's hands and head.
An active appearance was used by Agris et al. [8] to detect the signer's mouth and eyes and a numerical description was computed from those components. For the signer's hands, a set of geometric features were extracted and concatenated with facial expressions features. This fusion of features improved the accuracy of GSL recognition by around 1.5%. Sarkar et al. [38] reported an improvement of around 4.0% using 39 signs of ASL through combining manual and non-manual gestures. A support vector machine (SVM) was used by Quesada et al. [39] for classifying manual and non-manual markers captured using a natural user interface device. This device captures hand shapes, body position, and facial expressions using 3D cameras. This approach achieved an accuracy of 91.25% using five face gestures and four handshapes of ASL. Kumar et al. [40] used two sensors to capture signers' hands and facial expressions. They used leap motion controller for manual gesture acquisition, whereas Kinect was used for capturing facial expressions. Then, HMM was used to recognize each component separately to be combined later using a Bayesian classification method. This combination boosted the recognition accuracy over a single modality by 1.04%. Camgoz et al. [41] employed a multi-channel transformer for continuous sign language recognition. Fusing a signer's hands and face improved the results to 19.21% compared with 16.76% using the signer's hands only. This approach was evaluated on the RWTH-PHOENIX-Weather-2014T [42] dataset, which consists of 1066 signs of GSL performed by nine signers.

mArSL Database
In this section, we present our proposed multi-modality database for Arabic sign language (ArSL). We will first explain the motivation for proposing this database and its properties as compared to other available ArSL. We will then describe the recording setup and sign capturing system and discuss the database components and organization.

Motivation
ArSL sign language is a low-resource language, yet there are 22 Arab countries with a total population of more than 430 million (https://worldpopulationreview.com/countryrankings/arab-countries (accessed on 30 May 2021)). Although the vocabulary of this language is limited compared with the spoken language, no database is available that accommodates all sign words. However, few datasets have been proposed recently with a main focus on the signs that depend only on manual gestures. These datasets ignored signs that combine manual and non-manual gestures either by excluding them explicitly or by not differentiating them from other signs. Subsequently, it becomes difficult to propose and evaluate techniques that incorporate other important non-manual articulators such as facial expressions, head and shoulder movements, and mouth shapes, which can provide extra information to enrich the meaning, provide clarity of similar signs with different meanings, represent grammatical markers, and show emotions and attitudes. This motivated us to propose a database for signs that combine manual and non-manual gestures to correctly recognize them. The proposed database involves two types of similarities: intra-class similarity and inter-class similarity. For intra-class similarity, each sign is performed under different conditions, including different signers. For inter-class similarity, each sign has various levels, such as "HUNGRY" and "VERY HUNGRY". This database will help also in studying the linguistic properties of ArSL using facial expressions as well as the relations and roles of manual and non-manual features. In addition, the proposed database will be freely available for researchers (https://faculty.kfupm.edu.sa/ICS/hluqman/mArSL.html (accesseed on 30 May 2021)).

Recording Setup
The mArSL database was recorded in an uncontrolled environment to resemble realworld scenarios. The database has been recorded in more than one session to ensure the variability of the signer's clothes and settings without any restrictions regarding clothes of specific colors. Figure 2a shows the recording room where signers are acting while sitting since the recorded signs require only the face and upper part of the signer's body. In addition, the distance between sensors and the signer was 1.5 m, which is enough to capture the signer's body and provide accurate skeleton information.

Sign Capturing Vision System
In order to obtain 3D multi-modal information of various signs of interest, we used Microsoft Kinect V2 as an acquisition device. The Kinect sensor is a motion sensing device that was initially designed by Microsoft for better user experience during video gaming and entertainment [43]. Two versions of Kinect were released by Microsoft in 2010 and 2015, respectively. Kinect V2 came with new features, such as the number of captured joint points, which was 25 in contrast to Kinect V1 that provides only 18 joint points.
The Kinect V2 sensor consists of color and infrared (IR) cameras. The color camera outputs a high resolution RGB video stream with a resolution of 1920 × 1080 pixels. The IR camera captures the modulated infrared light sent out by the IR projector/emitter to output depth images/maps that determine the distance between the sensor and each point of the scene based on the Time-of-Flight (ToF) and intensity modulation technique. The depth images are encoded using 16 bits and have a resolution of 512 × 424 pixels. In addition, the Kinect V2 sensor provides information about the signer's skeleton through 25 joint points and it is equipped with a microphone array to capture sounds, as shown Figure 2b.
The capturing software packaged with the Kinect device by Microsoft does not align with our requirements of having synchronized recordings of all data sources. To address this issue, we used the tool developed by Terven et al. [44] to capture several modalities synchronously. The recording system was developed using Matlab and provides five modalities for each sign gesture. These are color, depth, joint points, face, and face HD information. The color and depth information were saved in an MP4 format while other modalities were saved as a Matlab matrix. More information about each data modality will be discussed in the following subsections.

Database Statistics
The mArSL database consists of a set of signs that synchronously employ manual and non-manual gestures. The dominant non-manual component that appears with all database signs is the facial expressions, which rely on the movement pattern of eyes, eyebrows, lips, nose and cheeks to visually emphasize or express the person's mood or attitude. We intentionally selected these set of signs for words and phrases after studying a large number of signs in the sign language to focus on those requiring both manual and non-manual features.
The database signs can be divided into two parts based on the number of postures in the sign. The first part consists of signs with one posture, e.g., the sign for "Hungry" shown in Figure 3a. The second category of the database signs consists of more than one posture, such as the"VERY SMALL" sign shown in Figure 3b.
The proposed database consists of 6748 videos of 50 signs of ArSL. They were performed by four signers trained on ArSL. Each sign was repeated 30 times (except one signer who has more than 30 samples per sign) in different sessions. The duration of each sample differs from sign to sign and from signer to signer. The total number of frames of the entire database is 337,991 frames. Table 2 shows a comparison between mArSL and other available ArSL databases. As shown in the table, few databases are available for ArSL compared with other languages such as ASL. In addition, all other databases are not designed to include non-manual articulations since their focus was on the dominant manual articulations. The SignsWorld Atlas [45] is the only dataset that has some samples for facial expressions. This database is designed to include signs from different types of databases, such as fingerspelling and continuous ones. However, this database is not suitable for recognition systems since the number of samples per sign is not unified where some signs have only one sample while others have 10 samples. In addition, the non-manual gestures are represented using still images and they are not integrated with manual gestures; this can make them suitable for facial expression studies rather than sign language recognition.

Database Organization
Each captured sign is represented by five modalities: color, depth, joint points, face, and faceHD. An illustrative example is shown in Figure 4. A description of each of these modalities is provided below: on the color image provided by the color camera. The depth space describes the 2D location of the joint point on the depth image. The coordinates of the joint point in the camera space are 3D (x, y, z) and are measured in meters. The coordinates (x, y) can be positive or negative, as they extend in both directions from the sensor while the z coordinate is always positive as it grows out from the sensor. In addition, the orientation information of each joint point is provided by Kinect as a quaternion which consists of four values (q w , q x , q y , q z ) and is mathematically represented by a real part and 3D vector as follows: Q = q w + q x i + q y j + q z k, where i, j, and k are unit vectors in the direction of the x, y and z axes, respectively.

Pilot Study and Benchmark Results
We evaluated manual and non-manual features for sign language recognition using the proposed dataset. We started by evaluating the manual gestures for automatic recognition of sign language. Then, we extended the experiments by fusing the manual gestures with facial expressions. Two evaluation settings were used to evaluate the proposed systems: signer-dependent and signer-independent settings. The signer-dependent mode evaluates the system on signer(s) already seen in the training data. In contrast, the signer-independent mode evaluates the system on signer(s) unseen in the training data. This type of evaluation is more challenging for machine learning algorithms compared with signer-dependent evaluations.

Manual Gestures
Sign language words are dynamic gestures where the motion is a primary part of the sign. Learning these gestures intuitively requires a sequential modeling technique such as Long short-term memory (LSTM) and Hidden Markov Model (HMM). These techniques are efficient for learning temporal data though they do not pay much attention to spatial information in the video stream. To address this issue, we used a convolutional neural network (CNN) to extract the spatial features from the gesture frames and feed them into a stacked LSTM.
We evaluated transfer learning using several state-of-the-art pre-trained models for spatial information extraction of images. We fine-tuned Inception-V3 [48], Xception [49], ResNet50 [50], VGG-16 [51], and MobileNet [52] models, which were pre-trained on Ima-geNet for large scale image classification with 14,197,122 images and 21,841 subcategories. In addition, we proposed a CNN model consisting of five layers with a number of kernels ranging from 16 to 64. The first two layers use a kernel of size 5 × 5, while other layers used a 3 × 3 kernel size. These layers were followed by a rectified linear (ReLU) activation function to remove the non-linearity of the input data to this function. We also used maximum pooling layers of size 2 × 2 to down-sample the feature maps. The extracted features were fed into a stacked LSTM consisting of two LSTM layers. Each layer consists of 1024 neurons followed by a dropout layer to reduce the overfitting. These layers were followed by a Softmax classifier. Adam optimizer was used with a learning rate of 0.0001. The framework of the proposed system is shown in Figure 5. Two data representations were used as inputs to the proposed systems: color and depth data. We fed 25 frames of each data input to the proposed models. These frames were selected by taking an interval between consecutive frames relative to the total number of sign frames. We replicated the last frame of sign samples that have less than 25 frames. Table 3 shows the obtained results using color and depth frames with signer-dependent and signer-independent modes. As shown in the table, the accuracy results of the signerdependent mode using color data are higher than depth data with almost all the models. For signer-dependent and color images, the highest average accuracy of 99.7% was obtained with the MobileNet-LSTM model while the lowest accuracy of 94.9% was obtained using ResNet50-LSTM with both data representations. In contrast, the results dropped to 99.5% and 72.4% when using depth images with MobileNet-LSTM and ResNet50-LSTM, respectively. This can be caused by the signer information learnt with color data, leading to overfitting. This information is excluded with depth data, which explains the better performance of depth data representation. It is also noticeable in Table 3 that the signer-independent mode is more challenging than the signer-dependent one. The sharp decrease in the recognition accuracies with the signer-independent mode can be attributed to the models that started to overfit the signers during system learning. In addition, the variations between signs performed by signers have an effect on the recognition accuracy. To address this issue, we excluded the signer identity information and made the models focus only on the signer's motion through calculating the optical flow of the input data and fed them into the proposed models. Optical flow provides discriminative temporal information that helps in gesture recognition. We used the Dual TV-L1 algorithm, which is based on the total variation regularization and L1 norm, to compute the optical flow between two image frames [53,54]. Figure 6 shows a sequence of color frames with their optical flows. Table 4 shows the obtained recognition accuracies using the optical flow of color and depth data. As shown in the table, there is a significant improvement in the recognition accuracies of both data representations compared with raw color and depth data. It was also noticed that the optical flow of color data outperformed the optical flow of depth data. In addition, MobileNet-LSTM outperformed all other models with an average accuracy of 72.4% (compared to 54.7% without optical flow). Table 4. Signer-independent recognition accuracies using the optical flow, computed either from color or from depth data.

Non-Manual Gestures
In this subsection, we evaluated an important component of non-manual articulators, facial expressions. We used animation units (AUs) of the signer face provided by the Kinect sensor as an input to the proposed system (more information about face data can be found in Section 3.5).
We started by evaluating the facial expressions alone, and then we fused this information with the best model of the manual gesture recognition discussed in the previous section. We used a stacked LSTM model consisting of two LSTM layers with 1024 neurons to learn the temporal information of the facial landmarks. The extracted face features are fused at the classification level with the manual gesture features that were extracted in the previous section as shown in Figure 7.  Table 5 shows the obtained results with facial expressions using animation units (AU) in the signer-independent mode. We fused these features with the features that were extracted from the color and depth data of the manual gestures. The table compares the results before and after fusing manual and non-manual gestures. As shown in the table in the column labeled AU, using facial expressions alone for sign recognition does not give good results since similar facial expressions can be associated with multiple signs, especially when emphasizing something; hence, they are not necessarily linked to specific signs. In addition, a few signs in sign language depend on the non-manual gestures without manual gestures. Therefore, we concatenated these features with the best manual model reported in the previous section (namely, MobileNet-LSTM). As shown in the table, the highest recognition accuracy was obtained with optical flow of the color data and was improved by about 3% compared with the results without fusion. Based on these experiments, we can conclude that fusing manual gestures with facial expressions can improve the accuracy of the sign language recognition system. In addition, optical flow is efficient in calculating the motion between frames, which helps improve the accuracy of the signer-independent evaluation mode. Table 5. Signer-independent recognition accuracies of manual gestures and the principal non-manual articulators using facial expressions (AU in the table refers to animation units representing facial expressions alone).

Conclusions and Future Work
This paper introduced a new multi-modality video database for sign language recognition. Unlike existing databases, its focus is on signs that require both manual and non-manual articulators which can be used in a variety of studies related to sign language recognition. Though the signs are performed to match the guidelines of the Arabic sign language (ArSL), which is still in the developmental stage, the database can be beneficial for other researchers as well such as those working on pattern recognition and machine learning. Moreover, the paper presented a baseline pilot study to evaluate and compare six models based on the state-of-the-art deep learning techniques for spatial and temporal processing of sign videos in the database. Two cases are considered for the signer-dependent and signer-independent modes using manual and non-manual features. In the first case, we used color and depth images directly, whereas in the second case we used optical flow to extract more relevant features to the signs themselves not the signers. The best results are obtained when using MobileNet-LSTM with transfer learning and fine tuning with 99.7% and 72.4% for signer-dependent and signer-independent modes, respectively. As future work, more analysis on the effectiveness of each part of the non-manual gestures will be conducted. In addition, we are going to explore other deep learning approaches for isolated sign language recognition and investigate the generalization of the proposed techniques to other sign language datasets. Moreover, we will target continuous sign language recognition of ArSL as this problem has not been explored as deeply as isolated sign language recognition.