FASSD-Net Model for Person Semantic Segmentation

: This paper proposes the use of the FASSD-Net model for semantic segmentation of human silhouettes, these silhouettes can later be used in various applications that require speciﬁc characteristics of human interaction observed in video sequences for the understanding of human activities or for human identiﬁcation. These applications are classiﬁed as high-level task semantic understanding. Since semantic segmentation is presented as one solution for human silhouette extraction, it is con-cluded that convolutional neural networks (CNN) have a clear advantage over traditional methods for computer vision, based on their ability to learn the representations of appropriate characteristics for the task of segmentation. In this work, the FASSD-Net model is used as a novel proposal that promises real-time segmentation in high-resolution images exceeding 20 FPS. To evaluate the proposed scheme, we use the Cityscapes database, which consists of sundry scenarios that represent human interaction with its environment (these scenarios show the semantic segmentation of people, difﬁcult to solve, that favors the evaluation of our proposal), To adapt the FASSD-Net model to human silhouette semantic segmentation, the indexes of the 19 classes traditionally proposed for Cityscapes were modiﬁed, leaving only two labels: One for the class of interest labeled as person and one for the background. The Cityscapes database includes the category “human” composed for “rider” and “person” classes, in which the rider class contains incomplete human silhouettes due to self-occlusions for the activity or transport used. For this reason, we only train the model using the person class rather than human category. The implementation of the FASSD-Net model with only two classes shows promising results in both a qualitative and quantitative manner for the segmentation of human silhouettes.


Introduction
There are many high-level computer vision tasks which relay in human detection in video sequences, such as intelligent video surveillance. The objective of an intelligent video surveillance system (IVVS) is to efficiently detect an interesting event from a large number of videos in order to prevent dangerous situations [1], where the main interest, of course, is the normal and abnormal behavior of human beings [2]. The applications of IVVS is becoming more specific, e.g., environmental home monitoring related to human activities, such as remote monitoring and automatic fall detection for elderly people at home [3]. Another main application for IVVS is video storage and retrieval, where the surveillance system may be prone to record the video if human beings are in the scene, saving time, data storage and, of course, resources. Nowadays, another major application for highlevel computer vision and IVSS is Human Computer Interface (HCI), of which identity recognition and specifically, human identification is based on gait analysis. Although the applications seem to be very broad, all of them share the same issues, i.e., human detection in video, human pose estimation, human tracking analysis and understanding of time series data [4]. To address the task of human detection in video sequences, the researchers first split the video sequences considering each frame in the video and performing different approaches; for example, image classification, for which the main objective is to assign one or more category labels for a whole image, which results in identifying which objects exist in the image under analysis, e.g., semantic concepts such as a person, car, road, and building are detected, but without the locations of the objects in the image.
In order to obtain the region of the objects, the next approach,i.e., the object detection, assigns the category labels and also locates the objects with annotated rectangles in the images; nevertheless, the rectangles may contain some pixels belonging to other classes or the background. However, to achieve more specific and meaningful results, this research proposes another approach named semantic segmentation. The main objective is to assign each pixel a predefined category label and in consequence, partition each object region from the background region. Although there are many studies using different predefined labels or classes, the present work focuses on two labels-human and background; the idea is using this model in future works to extract specific features for human activities comprehension and human identification. To address the main issues in this high-level semantic comprehension of human activities based on video surveillance, it is necessary to extract the foreground (human silhouettes) from the scene, at pixel level.
Since the semantic segmentation seems to be the natural solution for human silhouette extraction, we determine that the enormous success of recent state-of-the-art approaches for semantic segmentation is based on Convolutional Neural Networks (CNN). The distinctive advantage over traditional machine learning methods is the ability to learn appropriate feature representations for segmentation tasks in an end-to-end training fashion instead of using hand-crafted features that require domain expertise [5] and cannot adjust itself for an incorrect prediction [6]. Of course, there is a vast evolution and research in CNN [7], leading to different algorithms and methods focused on specific objectives, such as real time approaches, accuracy, reducing the number of parameters, computational cost, low energy consumption and storage memory.
As stated above, many high-level tasks for understanding human interaction in video sequences, are based on accurate semantic segmentation of human silhouettes; this also requires that the implementation can be executed on high-resolution images and in real time; therefore, this paper proposes the use of the novel neural network entitled FASSD-Net model [8] adapted specifically for the semantic segmentation of two classes of interest-"person" (human silhouette) and "background"-encouraging the use of human silhouettes in future applications; for example, the human identification [9] by gait analysis with a holistic approach or translating Mexican Sign Language into text.
The main contributions are summarized as follows: • Adaptation of the FASSD-Net model for two-class semantic segmentations ("human silhouette" and "background"). • Reduction of the computational complexity of the original FASSD-Net model that requires 45.1 GFLOPS to segment 19 classes, to 11.25 GFLOPS for two-class segmentation.

FASSD-Net
Although different algorithms are available for semantic segmentation, we focused on FASSD-Net (Fast and Accurate Semantic Segmentation with dilated asymmetric convolutions) [8], which was proposed as a solution to generate a semantic segmentation in real time considering a validation of urban landscapes [8].
The authors of the FASSD-Net model declare two main contributions over the baseline Harmonic DenseNet (HarDNet): the Dilated Asymmetric Pyramidal Fusion (DAPF) module, and the Multi-resolution Dilated Asymmetric (MDA) module [8].
Both modules exploit contextual information without excessively increasing the computational complexity using asymmetric convolutions. As a result, the FASSD-Net provides the following advantages: • Reduced computational complexity allowing its use in real time applications.
• State-of-the-art result of mean intersection over union (mIoU) in the validation of urban landscapes. • Better learning by using two different stages of the network, simultaneously refining spatial and contextual information. • Three versions of the model, FASSD-Net, FASSD-Net-L1 and FASSD-Net-L2, to maintain a better tradeoff between speed and accuracy.
The baseline model for FASSD-Net is the FC-HarDNet-70, based on HarDNet for the task of semantic segmentation [10]. The FC-HarDNet-70 is a U-shape-like architecture [11], which is composed by five encoders blocks and four decoder blocks, all of them HarDBlock (Harmonic Dense Block), which are specifically designed to address the problem of the GPU memory traffic. In the architecture of the FASDD-NET, the following items can be found: Encoder, Decoder, MDA and DAPF modules, as shown in the Figure 1. The DAPF module is designed to produce an increase to the receptive field of the last stage of the network (encoder), as shown in Figure 2, obtaining high-quality contextual characteristics . It is possible to change the number of pyramidal feature maps within the DAPF and thus be able to fit with the number of input feature maps, which significantly reduces the computational complexity for this module [8]. The second module is the Extended Assimilation of Resolution Multiple (MDA). In this module, the feature maps, the asymmetric branch and the non-asymmetric branch, are simultaneously processed, where it seeks to exploit the contextual information from feature maps entry and retrieval of details, taking advantage of the use of dilated convolutions. In contrast, the non-asymmetric branch focuses on refining the details [8].
The FASSD-Net uses the decoder modules to recover lost information while shrinking the resolution at the encoder, in such a way that it concatenates the feature maps encoder with sampled feature maps from the decoder at each stage, to form a ladder-shaped structure. This allows the decoder at each stage to re-learn the relevant characteristics that are lost when grouped in the encoder [8].

Cityscapes
The Cityscapes dataset is composed of a set of stereo video footage recorded on streets of 50 different cities. It contains 5000 images which have a high-quality pixel level, as well about 20,000 additional images. The annotations are divided based on high frequency of occurrence within the images obtained, leaving 19 classes for evaluation [12].
The images are divided into different sets-training, validation, and testing. The images only serve as training data. The data are not divided randomly, but in a way that guarantees each division to be representative of the variability of different scenarios of street scenes. The underlying division criteria imply a balanced distribution of the geographic location and population, as well as the size of individual cities [12].

Database Pre-Processing
The Cityscape dataset has 30 visual classes, from which 19 of them are widely used for evaluation purposes. In order to fit the FASSD-Net model, the images in the Cityscapes training dataset with their respective labels are pre-processed by changing the indexes of the other 18 classes, leaving only two labels: one for background and another for the class of interest labeled as person. In addition, all those scenes with a "void" label are also assigned as background. This pre-processing stage is repeated in the Synscape database to perform the experiments [12].
Performing an analysis of the composition of classes in the Cityscapes training set, it is notorious that they have the category "Human" and that includes the rider class; however, the rider class, derived from segmentation problems, is further used in other tasks of interest, that require semantic segmentation of people in a scene. This is because people in the rider class can include drivers, passengers, or riders of bicycles, motorcycles, scooters, skateboards, horses, Segways, (inline) skates, wheelchairs, road cleaning cars, or convertibles [12].
Please note that a visible driver of a closed car can only be seen through the window. If we consider some people in the rider class, we will have instances of a class that will lack some extremity (legs), biasing the training of scenes containing people with some kind of occlusion; therefore, we avoided using the rider class.

Model Training
The Cityscapes database is trained with the proposed FASSD-Net network [8], modifying the original code that trains the model with a number of classes equal to 19. After customizing the network, it will be trained only with 2 classes to later evaluate it. Performance is primarily measured in intersection over union (IoU), mean intersection over union (mIoU) and frames per second (FPS). Experiments are carried out in order to improve the recognition of the person class, for which each of the databases are pre-processed, homogenizing the indices of other classes. The training is carried out with each of the aforementioned databases.

Implementation Details
The implementation of the proposed FASSD-NET model is carried out with the same configuration used in [13]: PyTorch 1.0 with CUDA 10.2, the training setup use Stochastic Gradient Descent (SGD) with 5 × 10 −4 weight drop, and 0.9 boost is used as the optimizer. We employ the "poly" learning rate strategy lr=(initial lr ) × ( iter total iter ) 0.9 and an initial learning rate of 0.02. Total Cross Iter the entropy loss is calculated following the online start-up strategy [14].
We trained the model for 200,000 iterations with a size of batch 8, setting the initial learning rate at 0.02. The inference speed (in FPS) was measured on an Intel Core i9-9900K desktop with one NVIDIA GTX 2080ti. The speed was calculated from the average FPS rate of 10,000 iterations measured on images of size 1024 × 2048 × 3. As shown in [8], considering the use of the 19 traditional classes from Cityscapes, the original FASSD-Net model, the computational complexity reported in GFLOPS is 45.1, and the required number of parameters is 2.85 M; however, the FASSD-Net model trained to segment human silhouettes, which means only two classes rather than traditional 19, has a computational complexity of 11.25 GFLOPs; the difference from the original model is due to the reduction in the number of classes. The number of parameters required by the two-class model is 2.84 M, which is relatively close to the number of parameters of the original FASSD-Net. The time required for training with a NVIDIA GTX 2080ti is up to 31 h.
As an example of our proposal, we used the Cityscape dataset [12] to measure its performance accuracy in a qualitative and quantitative manner. To train the FASSD-Net model, we used a "from scratch" approach using only Cityscapes training set images [12], and a "pretrained" approach, in which the weights are initialized for the "from scratch" approach.

Results and Discussions
This research work uses data sets belonging to Cityscapes [12] that represent human interaction with various urban landscapes with their respective labels for proofs of concept, in order to adapt the FASSD-Net model [8] to it. Once the tests were carried out, the indices of the other 18 classes were modified, leaving only two labels: one for background and another for the class of interest labeled as person. When performing an analysis of the composition of the classes in the training set From Urban Landscapes, it is noted that they have the category "Human" and that they include the rider class; however, the use of the rider class leads to segmentation problems for future uses of interest that require semantic segmentation of people in a scene. The formation of the FASSD-Net model [8] with only two classes presents promising results for the segmentation of human silhouettes, preparing the data for other applications such as human identification. That is the reason to use only the validation set to measure the accuracy of the proposed models and it is necessary to perform the pre-processing stage during the validation set, to obtain two labels only-background and person.

Evaluation Methods
As a metric to evaluate semantic segmentation reported in Table 1 we use Intersectionover-Union (IoU), which is one of the most frequently used metrics. Doing the calculation by class IoU c , with the following equation IoU c = TP c TP c + FP c + FN c where TP c , FP c and FN c , indicates the number of true positive, false positive and false negative pixels by class [21].
After obtaining the IoU by class, the Mean IoU is calculated from the following: mIoU = average × (IoU c ) ∀ c in Class. The Per-class IoU (%) results for different models on the Cityscapes Validation set are shown in Table 2. The column of Person Class is highlighted, the first row shows the results obtained by the ERFNet model [15] in its proposed "from scratch" training strategy where they only use Cityscapes images resulting in an IoU of 73.0% for the Person class, while the second row shows its second proposed "pre-trained" strategy, where the weights are initialized by training the network using a larger dataset such as ImageNet, obtaining an IoU of 75.2% for the Person class [15]. The third row shows the results obtained by the Fully convolutional Residual Network (FCRN) model [14], resulting an IoU of 77.1% for the person class, while the fourth row shows the results obtained by the FCRN considering the application of its online bootstrapping method, obtaining an IoU of 79.8% for the Person class [14]. Finally, row 5 shows the result obtained by the ContexNet model, with its branch a full resolution (cn124) obtaining 70.9% for the Person class. The training and evaluation conditions of the methods presented in Table 2, can be easily reproduced, allowing a fair comparison between previous models and the proposed model, obtaining, in our case, an IoU of 79.86% [16], achieving the same level of that obtained in fCRN+Bs [14] ( Table 2, row 4). Although the quantitative results seem similar, in the rest of this section, a qualitative evaluation will be performed to determinethe performance of the proposed method, compared to some existing ones. In Figure 3, column (a)-The original images; column (b)-The ground truth; the column (c)-The segmentation results obtained by model LBN-AA DASPP in the Cityscapes validation set [12]; and column (d)-Our segmentation results. The results showing an improvement in the segmentation of the human silhouette have been highlighted by rectangles in green, achieving mainly greater details in the segmentation of the limbs, i.e., legs, feet, arms, and neck. While yellow rectangles have highlighted some errors of the obtained segmentation, it can be noticed that there are regions that are segmented as the Person class, contrasting with the original image and the ground truth where these people are not observed. Yellow rectangles highlight some gross segmentation of human silhouettes. In Figure 4 column (c)-Some results obtained by the ERFNet model are presented in images of the Cityscapes validation set [12]; while the column (d)-The results obtained with our proposal. In the column Results, some green regions have been framed by rectangles where the segmentation quality obtained by our algorithm exceeds the segmentation obtained by the ERFNet model. The differences in segmentation results are observed, involving better segmentation details, specifically in the extremities, i.e., leg-foot separation, arm-torso separation, in general a more detailed segmentation. In the first row, the column (d) of Figure 3 in our results, an error segmentation can be observed, which is highlighted by a blue rectangle, because in the ground truth, there is no human silhouette is labeled in the region in which our model presents a positive result. However, a thorough analysis, as shown in Figure 6, explains that there are actually two human silhouettes in difficult lighting conditions, so it was even a challenge to the manual labeling process for tag assignments in Ground Truth.  [12] in which we have a comparison with the predictions made by the ERFNet model. Figure 5 presents the segmentation results obtained with models ESNet [22], PSP-Net [23], LEDNet [24], ContextNet [16] and our models, respectively applied to the same images, achieving a better qualitative comparison of the results. Row (a)-The original images to use and Row (b)-The ground truth. The segmentation results for the original image in the first column are listed below: The first column of row (c)-The result obtained using the model PSPNet [23]; The first column of row (d)-The result obtained using Con-textNet [16] and the first column of row (e)-The result with our model. Our model is not able to recognize the person who is riding a bicycle as human, and fails in the detection of the person behind the bicycle. Although it is important to remember that we intentionally left the rider class out of the training of our model, this is why it presents problems to segment human silhouettes on bicycles or motorcycles. Considering the original image in the second column, we present its segmentation results: in row (c) second column shows the obtained result using PSPNet [23]; in row (d) second column shows ContextNet [16] and in Row (e) second column shows the segmentation results of our model. The results observed for the original image of the second column show that our model presents a better and more detailed segmentation of people, even in distant human silhouettes. However, it still presents some obvious errors in the extremities in all models, particularly the arm of some of the silhouettes framed in a yellow rectangle in our model. Finally, the segmentation results for the original image of the third column are presented: the third column row (c) column shows the segmentation results obtained using the ESNet [22] model; the third column row (d)-The segmentation results using the LEDNet model [24] and the third column row (e)-The results obtained with our model. It can be observed that in general terms, the application of a model specifically trained to detect people performs a finer detection of human silhouettes than its counterpart that uses multiple classes for training. The Figure 6 shows-in the first row first column (a)-the original image of the validation set. The second row first column shows the ground truth (b), and the third row first column (c) shows the result obtained with the proposed FASSDNet model [8]. The first row second column (d) shows the original section where our model marks the existence of a human being, by making an improvement in the brightness of the image, as shown in the second row second column (e); where the silhouette of a human being sitting and also a head are observed, both being detected by the proposed model in third row second column (f). Although these silhouettes are not recorded in the ground truth, this may result in a decrease in the results obtained, even if the identification was correct.

Conclusions and Future Work
The FASSD-Net model trained only with two classes, presents promising results for human silhouette segmentation, preparing the data for further applications such as Human Identification and automatic fall detection for elderly people at home. The evaluation results show that the proposed scheme using the FASSD-Net provides better results, quantitavely as well as qualitatively, compared to other previously proposed schemes. Furthermore, it may be more accurate if the number of images used in the training model is bigger. Nevertheless, the annotation of the data is a problem, such that the database must be homogenized for training the data. Therefore, in future work, it will be possible to obtain a new training model using more databases, in addition to using the segmented images for other high-level semantic understanding tasks such as person identification, analysis of the march and extraction of ideogram characteristics belonging to the Mexican Sign Language.