Scene Description for Visually Impaired People with Multi-Label Convolutional SVM Networks

: In this paper, we present a portable camera-based method for helping visually impaired (VI) people to recognize multiple objects in images. This method relies on a novel multi-label convolutional support vector machine (CSVM) network for coarse description of images. The core idea of CSVM is to use a set of linear SVMs as ﬁlter banks for feature map generation. During the training phase, the weights of the SVM ﬁlters are obtained using a forward-supervised learning strategy unlike the backpropagation algorithm used in standard convolutional neural networks (CNNs). To handle multi-label detection, we introduce a multi-branch CSVM architecture, where each branch will be used for detecting one object in the image. This architecture exploits the correlation between the objects present in the image by means of an opportune fusion mechanism of the intermediate outputs provided by the convolution layers of each branch. The high-level reasoning of the network is done through binary classiﬁcation SVMs for predicting the presence / absence of objects in the image. The experiments obtained on two indoor datasets and one outdoor dataset acquired from a portable camera mounted on a lightweight shield worn by the user, and connected via a USB wire to a laptop processing unit are reported and discussed.


Introduction
Chronic blindness may occur as an eventual result of various causes, such as cataract, glaucoma, age-related macular degeneration, corneal opacities, diabetic retinopathy, trachoma, and eye conditions in children (e.g., caused by vitamin A deficiency) [1]. Recent factsheets from the World Health Organization, as per October 2018 [1], indicate that 1.3 billion people suffer from some form of vision impairment, including 36 million people who are considered blind. These facts highlight an urgent need to improve the quality of life for people with vision disability, or at least to lessen its consequences.
Towards achieving the earlier endeavor, assistive technology ought to exert an essential role. On this point, the latest advances gave rise to several designs and prototypes, which can be regarded from two distinct but complementary perspectives, namely (1) assistive mobility and obstacle avoidance, and (2) object perception and recognition. The first perspective enables the visually impaired (VI) persons to navigate more independently, while the latter emphasizes consolidating their comprehension of the nature of the nearby objects, if any.
Navigation-focused technology constitutes the bulk of the literature. Many works have been carried out making use of ultrasonic sensors to probe the existence of nearby obstacles via transmitting and subsequently receiving ultrasonic waves. The time consumed during this process is commonly These networks have the ability to learn richer representations in a hierarchical way compared to handcrafted-based methods. Modern CNNs are made up of several alternating convolution blocks with repeated structures. The whole architecture is trained end-to-end using the backpropagation algorithm [32].
Usually, CNNs perform well for datasets with large labeled data. However, they are prone to overfitting when dealing with datasets with very limited labeled data as in the context of our work. In this case, it is has been shown in many studies that it is more appealing to transfer knowledge from CNNs (such as AlexNet [25], VGG-VD [33], GoogLeNet [24], and ResNet [23]) pre-trained on an auxiliary recognition task with very large labeled data instead of training from scratch [34][35][36][37]. The possible knowledge transfer solutions include fine-tuning of the labeled data of the target dataset or to exploit the network feature representations with an external classifier. We refer the readers to [35] where the authors presents several factors affecting the transferability of these representations.
In this paper, we propose an alternative solution suitable for datasets with limited training samples mainly based on convolutional SVM networks (CSVMs). Actually, SVMs are among the most popular supervised classifiers available in the literature. They rely on the margin maximization principle, which makes them less sensitive to overfitting problems. They have been widely used for solving various recognition problems. Additionally, they are also commonly placed on the top of a CNN feature extractor for carrying out the classification task [35]. In a recent development, these classifiers have been extended to act as convolutional filters for the supervised generation of features maps for single object detection in remote sensing imagery [38]. Compared to standard CNNs, CSVM introduces a new convolution trick based on SVMs and does not rely on the backpropagation algorithm for training. Basically, this network is based on several alternating convolution and reduction layers followed by a classification layer. Each convolution layer uses a set of linear SVMs as filter banks, which are convolved with the feature maps produced by the precedent layer to generate a new set of feature maps. For the first convolution layer, the SVM filters are convolved with the original input images. Then the SVM weights of each convolution layer are computed in a supervised way by training on patches extracted from the previous layer. The feature maps produced by the convolution layers are then fed to a pooling layer. Finally, the high-level representations obtained by the network are fed again to a linear SVM classifier for carrying out the classification task. In this work, we extend them to the case of multi-label classification. In particular, we introduce a novel multi-branch CSVM architecture, where each branch will be used to detect one object in the image. We exploit the correlation between the objects present in the image by fusing the intermediate outputs provided by the convolution layers of each branch by means of an opportune fusion mechanism. In the experiments, we validate the method on images obtained from different indoor and outdoor spaces.
The rest of this paper is organized as follows. In Section 2, we provide a description of the proposed multi-label CSVM (M-CSVM) architecture. The experimental results and discussions are presented in Sections 3 and 4, respectively. Finally, conclusions and future developments are reported in Section 5.

Proposed Methodology
Let us consider a set of M training RGB images X i , y i M i=1 of size r × c acquired by a portable digital camera mounted on a lightweight shield worn by the user, and connected via a USB wire to a laptop processing unit, where X i ∈ R r×c×3 , and (r, c) refer the number of rows and columns of the images. Let us assume also y i = [y 1 , y 2 , . . . , y K ] T is its corresponding label vector, where K represents the total number of targeted classes. In a multi-label setting, the label y i = 1 is set to 1 if the corresponding object is present; otherwise, it is set to 0. Figure 1 shows a general view of the proposed M-CSVM classification system, which is composed from K branches. In next sub-sections, we detail the convolution and fusion layers, which are the main ingredient of the proposed method.

A. Convolution Layer
In this section, we present the SVM convolution technique for the first convolution layer. The generalization to subsequent layers is straightforward. In a binary classification setting, the training set { , } is supposed to be composed of positive and negative RGB images and the corresponding class labels are set to ∈ {+1, 1}. The positive images contain the object of interest, whereas the negatives ones represent background. From each image , we extract a set of patches of size ℎ × ℎ × 3 and represent them as feature vectors of dimension , with = ℎ × ℎ × 3. After processing the training images, we obtain a large training set ( = { , } ( of size ( as shown in Figure 2. Next, we learn a set of SVM filters on different sub-training sets ( = { , } of size randomly sampled from the training set ( . The weight vector ∈ ℛ and bias ∈ ℛ of each SVM filter are determined by optimizing the following problem [39,40]: where is a penalty parameter, which can be estimated through cross-validation. As loss function, we use ( , ; , = max(1 ( + , 0 referred as the hinge loss. After training, we represent the weights of the SVM filters as

A. Convolution Layer
In this section, we present the SVM convolution technique for the first convolution layer. The generalization to subsequent layers is straightforward. In a binary classification setting, the training set X i , y i M i=1 is supposed to be composed of M positive and negative RGB images and the corresponding class labels are set to y i ∈ {+1, −1}. The positive images contain the object of interest, whereas the negatives ones represent background. From each image X i , we extract a set of patches of size h × h × 3 and represent them as feature vectors x i of dimension d, with d = h × h × 3. After processing the M training images, we obtain a large training set Tr (1) = x i , y i m (1) i=1 of size m (1) as shown in Figure 2. Next, we learn a set of SVM filters on different sub-training sets Tr of size l randomly sampled from the training set Tr (1) . The weight vector w ∈ R d and bias b ∈ R of each SVM filter are determined by optimizing the following problem [39,40]: where C is a penalty parameter, which can be estimated through cross-validation. As loss function, we use ξ(w, b; x i , y i ) = max 1 − y i w T x i + b , 0 referred as the hinge loss. After training, we represent the weights of the SVM filters as w k ∈ R h×h×3 refers to kth-SVM filter weight matrix, while n (1) is the number of filters. Then, the complete weights of the first convolution layer are grouped into a filter bank of four dimensions W (1) ∈ R h×h×3×n (1) .
In order to generate the feature maps, we simply convolve each training image {X i } M i=1 , with the obtained filters as is usually done in standard CNN to generate a set of 3D hyper-feature maps feature maps ( Figure 3). To obtain the kth feature map h (1) ki , we convolve the kth filter with a set of sliding windows of size h × h × 3 (with a predefined stride) over the training image X i as shown in Figure 4: In order to generate the feature maps, we simply convolve each training image { } , with the obtained filters as is usually done in standard CNN to generate a set of 3D hyper-feature maps is the new feature representation of image composed of ( feature maps (Figure 3). To obtain the kth feature map ℎ ( , we convolve the kth filter with a set of sliding windows of size ℎ × ℎ × 3 (with a predefined stride) over the training image as shown in Figure 4: where * is the convolution operator and is the activation function.   In order to generate the feature maps, we simply convolve each training image { } , with the obtained filters as is usually done in standard CNN to generate a set of 3D hyper-feature maps is the new feature representation of image composed of ( feature maps (Figure 3). To obtain the kth feature map ℎ ( , we convolve the kth filter with a set of sliding windows of size ℎ × ℎ × 3 (with a predefined stride) over the training image as shown in Figure 4: where * is the convolution operator and is the activation function.  In the following algorithm, we present the implementation of this convolutional layer. The generalization to subsequent convolution layer is simply made by considering the obtained feature maps as new input to the next convolution layer.

Algorithm 1 Convolution layer
Input: Training images X i , y i M i=1 ; SVM filters: n (1) ; filter parameters (width h, and stride); l : size of sampled training set for generating a single feature map. Output: Feature maps:

B. Fusion Layer
In a multi-label setting, we run multiple CSVMs depending on the number of objects. Each CSVM will apply a set of convolutions on the image under analysis as shown in Figure 1. Then each convolution layer is followed by a spatial reduction layer. This reduction layer is similar to the spatial pooling layer in standard CNNs. It is commonly used to reduce the spatial size of the feature maps by selecting the most useful features for the next layers. It takes small blocks from the resulting features maps and sub-samples them to produce a single output from each block. Here, we use the average pooling operator for carrying out reduction. Then, to exploit the correlation between objects present in the image, we propose to fuse the feature maps provided by each branch. In particular, we opt for the max-pooling strategy in order to highlight the different detected objects by each branch. Figure 4 shows an example of fusion process for two branches. The input image contains two objects, Laboratories (object1) and Bins (object2). The first CSVM tries to highlight the first object, while the second one is devoted for the second object. The output maps provided by the pooling operation are fused using the max-rule in order to get a new feature-map where the two concerned objects are highlighted as can be seen in Figure 4. We recall that the feature maps obtained by this operation will be used as input to the next convolution layer for each branch.

Algorithm 2 Pooling and fusion
Input: Feature maps: H (i) i = 1, . . . , K produced by the ith CSVM branch Output: Fusion result: H ( f 1) 1: Apply an average pooling to each H (i) to generate a set of activation maps of reduced spatial size. 2: Fuse the resulting activation using max rule to generate the feature map H ( f 1) . These maps will be used as a common input to the next convolution layers in each CSVM branch.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 13 In the following algorithm, we present the implementation of this convolutional layer. The generalization to subsequent convolution layer is simply made by considering the obtained feature maps as new input to the next convolution layer.

B. Fusion Layer
In a multi-label setting, we run multiple CSVMs depending on the number of objects. Each CSVM will apply a set of convolutions on the image under analysis as shown in Figure 1. Then each convolution layer is followed by a spatial reduction layer. This reduction layer is similar to the spatial pooling layer in standard CNNs. It is commonly used to reduce the spatial size of the feature maps by selecting the most useful features for the next layers. It takes small blocks from the resulting features maps and sub-samples them to produce a single output from each block. Here, we use the average pooling operator for carrying out reduction. Then, to exploit the correlation between objects present in the image, we propose to fuse the feature maps provided by each branch. In particular, we opt for the max-pooling strategy in order to highlight the different detected objects by each branch. Figure 4 shows an example of fusion process for two branches. The input image contains two objects, Laboratories (object1) and Bins (object2). The first CSVM tries to highlight the first object, while the second one is devoted for the second object. The output maps provided by the pooling operation are fused using the max-rule in order to get a new feature-map where the two concerned objects are highlighted as can be seen in Figure 4. We recall that the feature maps obtained by this operation will be used as input to the next convolution layer for each branch.

Algorithm 2 Pooling and fusion
Input: Feature maps: ( =1,…, K produced by the ith CSVM branch Output: Fusion result: ( 1: Apply an average pooling to each ( to generate a set of activation maps of reduced spatial size. 2: Fuse the resulting activation using max rule to generate the feature map ( . These maps will be used as a common input to the next convolution layers in each CSVM branch.

C. Feature Generation and Classification
After applying several convolutions, reduction, and fusion layers, the high-level reasoning of the network is done by training K binary SVM classifiers to detect the presence/absence of the objects in the image. If we let H be the hyper-feature maps obtained by the last computing layer (convolution or reduction depending on the network architecture) and, if we suppose also that each hyper-feature map H (L) i is composed of n (L) feature maps, then a possible solution for extracting the high-level feature vector z i ∈ R n (L) of dimension for the training image X i could be simply done by computing the mean or max value for each feature map.

Dataset Description
In the experiments, we evaluate the proposed method on three datasets taken by a portable camera mounted on a lightweight shield worn by the user, and connected via a USB wire to a laptop processing unit. This system incorporates navigation and recognition modules. In a first step, the user runs the application to load the offline-stored information related to recognition and navigation. Then he can control this system using verbal commands as shown in Figure 5a. For the sake of clarity, we provide also a general view of the application, where the user asks to go to the 'elevator'. Upon the arrival to the desired destination using a path planning module, the prototype notifies the user that the destination is reached. Figure 5b shows the current view of the camera, where the destination (elevators) is displayed. The system also features a virtual environment emulating the real movement of the user within the indoor space. As can be seen, the user is symbolized by the black top-silhouette, emitting two lines, the blue line refers to the user's current frontal view; the green point refers to the destination estimated by the path planning module; while the red dot highlights the final destination. The interface displays also markers displayed as thick lines laying on the walls for helping in the localization. In our work, we have used this system to acquire different images used for developing the recognition module based on M-CSVM.

C. Feature Generation and Classification
After applying several convolutions, reduction, and fusion layers, the high-level reasoning of the network is done by training binary SVM classifiers to detect the presence/absence of the objects in the image. If we let { ( , } be the hyper-feature maps obtained by the last computing layer (convolution or reduction depending on the network architecture) and, if we suppose also that each hyper-feature map ( is composed of ( feature maps, then a possible solution for extracting the high-level feature vector z ∈ ℛ ( of dimension for the training image could be simply done by computing the mean or max value for each feature map.

Dataset Description
In the experiments, we evaluate the proposed method on three datasets taken by a portable camera mounted on a lightweight shield worn by the user, and connected via a USB wire to a laptop processing unit. This system incorporates navigation and recognition modules. In a first step, the user runs the application to load the offline-stored information related to recognition and navigation. Then he can control this system using verbal commands as shown in Figure 5a. For the sake of clarity, we provide also a general view of the application, where the user asks to go to the 'elevator'. Upon the arrival to the desired destination using a path planning module, the prototype notifies the user that the destination is reached. Figure 5b shows the current view of the camera, where the destination (elevators) is displayed. The system also features a virtual environment emulating the real movement of the user within the indoor space. As can be seen, the user is symbolized by the black top-silhouette, emitting two lines, the blue line refers to the user's current frontal view; the green point refers to the destination estimated by the path planning module; while the red dot highlights the final destination. The interface displays also markers displayed as thick lines laying on the walls for helping in the localization. In our work, we have used this system to acquire different images used for developing the recognition module based on M-CSVM. The first and second datasets acquired by this system are composed of images of size 320 x 240 pixels. Both datasets have been taken at two different indoor spaces of the faculty of science of University of Trento (Italy). The first dataset contains 58 training and 72 testing images, whereas the second dataset contains 61 training images and 70 testing images. On the other the side, the third The first and second datasets acquired by this system are composed of images of size 320 × 240 pixels. Both datasets have been taken at two different indoor spaces of the faculty of science of University of Trento (Italy). The first dataset contains 58 training and 72 testing images, whereas the second dataset contains 61 training images and 70 testing images. On the other the side, the third dataset is related to outdoor environment and was acquired over different locations across the city Appl. Sci. 2019, 9, 5062 8 of 13 of Trento. The locations were selected based on their importance as well as the density of people frequenting them. The dataset comprises two hundred (200) images of size 275 × 175 pixels, which were equally divided into 100 training and testing images, respectively. It is noteworthy that the training images for all datasets were selected in such a way as to cover all the predefined objects in the considered indoor and outdoor environments. To this end, we have selected the objects deemed to be the most important ones in the considered spaces. Regarding the first dataset, 15 objects were considered as follows: 'External Window', 'Board', ' Table' dataset is related to outdoor environment and was acquired over different locations across the city of Trento. The locations were selected based on their importance as well as the density of people frequenting them. The dataset comprises two hundred (200) images of size 275 x 175 pixels, which were equally divided into 100 training and testing images, respectively. It is noteworthy that the training images for all datasets were selected in such a way as to cover all the predefined objects in the considered indoor and outdoor environments. To this end, we have selected the objects deemed to be the most important ones in the considered spaces. Regarding the first dataset, 15 objects were considered as follows: 'External Window', 'Board', ' Table'  In the experiments, we assessed the performances of the method in terms of sensitivity (SEN) and specificity (SPE) measures: The sensitivity expresses the classification rate of real positive cases, i.e., the efficiency of the algorithm towards detecting existing objects. The specificity, on the other hand, underlines the In the experiments, we assessed the performances of the method in terms of sensitivity (SEN) and specificity (SPE) measures: The sensitivity expresses the classification rate of real positive cases, i.e., the efficiency of the algorithm towards detecting existing objects. The specificity, on the other hand, underlines the tendency of the algorithm to detect the true negatives, i.e., the non-existing objects. We also propose to compute the average of the two earlier measures as follows: Appl. Sci. 2019, 9, 5062 9 of 13 AVG = SEN + SPE 2 (5)

Results
The architecture of the proposed CSVM involves many parameters. To identify a suitable architecture, we propose to investigate three main parameters, which are: the number of layers of the network, number of feature maps generated by each convolution layer and the spatial sizes of the kernels. To compute the best parameter values, we use a cross-validation technique with a number of folds equal to 3. Due to the limited number of training samples and the large number of possible combinations, we set the maximum number of possible layers to 3, the maximum number of SVMs in each layer to 512 with step of 2 i (i = 0, . . . ,9) and maximum kernel size for each layer is fixed to 10% of the size of the current map (we consider the minimum between the height and the width as the considered size) with step of 2. The obtained best values of the parameters by cross-validation are listed in Table 1. This table indicates that only one layer is enough for the first dataset, whereas the two other datasets require two layers to get the best performances. Concerning the number of SVMs and the related spatial size, dataset 3 presents the simplest architecture with just one SVM at the first layer and two SVMs at the second one with spatial size of 3. It is worth recalling that in the experiments. In order to evaluate our method, we compare it with results obtained using three different pre-trained CNNs, which are ResNet [23], GoogLeNet [24] and VGG16 [33]. All the results in terms of accuracies are reported in Table 2. From these tables, it can be seen that in eight cases out of nine, our proposed method by far outperforms the different pre-trained CNNs. In seven cases the improvement is clearly important (more than 2%). Only in one case (ResNet, dataset 2) does a CNN method give a slightly better result than our method (92.96% compared to 92.90%).

Discussion
Besides the classification accuracies, another important performance parameter is the runtime. Table 3 shows the training time of the proposed method for the three datasets. It can be seen clearly that the training of M-CSVM is fast and needs just few seconds to few minutes in the worse case. In details, dataset 1 presents the highest runtime (76 s) which is due to the high number of filters used for this dataset (512), while training of dataset 3 is much faster (just 8 s) due to the simplicity of the related network (see Table 1). Regarding runtime at the prediction phase, which includes the feature extraction and classification, the M-CSVM method presents different runtime for the three datasets depending on the complexity of the adopted architecture. For instance, as we can see in Table 4, the highest runtime is with the first dataset 1 with 0.200 s per image, which is due to the high number of filters adopted for it (512). In contrast, the third dataset requires only 0.002 s to extract features and estimate the classes for each image. This short time is due to the small number of SVMs used in this network for this dataset. It is also important to mention that the runtime provided by our method outperforms the three pre-trained CNNs on two datasets (datasets 2 and 3), especially for the dataset 3 where the difference is significant. Except for dataset 1, where GoogLeNet is slightly faster due to complexity of the related network.

Conclusions
In this paper, we have presented a novel M-CSVM method for describing the image content for VI people. This method has the following important proprieties: (1) it allows SVMs to act as convolutional filters; (2) it uses a forward supervised learning strategy for computing the weights of the filters; and (3) it estimates each layer locally, which reduces the complexity of the network. The experimental results obtained on the three datasets with limited training samples confirm the promising capability of the proposed method with respect to state-of-the-art methods based on pre-trained CNNs. For future developments, we plan to investigate architectures based on residual connections such as in modern networks, and to explore uses of advanced strategies based on reinforcement learning for finding an optimized M-CVSM architecture. Additionally, we plan to extend this method to act as a detector by localizing the detected object in the image.