Special Issue on Advances in Deep Learning

Nowadays, deep learning is the fastest growing research field in machine learning and has a tremendous impact on a plethora of daily life applications, ranging from security and surveillance to autonomous driving, automatic indexing and retrieval of media content, text analysis, speech recognition, automatic translation, and many others [...]


Introduction
Nowadays, deep learning is the fastest growing research field in machine learning and has a tremendous impact on a plethora of daily life applications, ranging from security and surveillance to autonomous driving, automatic indexing and retrieval of media content, text analysis, speech recognition, automatic translation, and many others. The lightning fast progress of the research in deep learning is testified by the success of this special issue, which witnessed the submission of promising works in several fields like computer vision, signal processing, and natural language processing. The heterogeneity of the proposed approaches also characterizes these works, which span discriminative and generative methods, adversarial perturbations, and reinforcement learning. For better readability, the rest of this editorial summarizes the works published in this special issue on a per-theme basis.

Content
Computer vision. Most of the works accepted to this special issue deal with computer vision applications, in general, and with image classification and object detection, in particular, with authors putting considerable effort in proposing novel and state-of-the-art classification techniques. Various contributions tackled (with interesting results) challenging issues like object detection and person re-identification in the wild. For instance, both the authors of [1,2] propose deep learning approaches aimed at recognizing body parts. In [1], the semantic relationship between body segments is automatically learned from videos and used to improve pedestrian detection when occlusions are present in the video. On the same path, the authors of [2] use the detected body parts for improving person re-identification. The method combines four CNN branches: one for encoding the whole body appearance and the other for extracting robust embeddings from three image patches, corresponding to different body parts (head, torso and lower-body) and segmented through U-net [3]. Then, the final person identification layer leverages a classifier trained on the fused outputs of these CNNs. The authors of [4] approach the identification of anomalous events for video surveillance tasks by using a 3D CNN that is merely trained on "normal" events. The resulting model is, thus, a robust outlier detector capable of adapting to rare events.
Infrared imagery is also gaining great interest due to its relevance in critical applications. Inspired by Faster R-CNN [5], object detection in infrared streetscape images is addressed in [6] by exploiting both fine-and coarse-grained image features. Among applications in real-world scenarios, text detection is also gaining attention for the automotive industry. In [7], the authors focus on text detection from Google Street View images. By combining a region proposal (obtained through an attention-based CNN) with a semantic segmentation (which results from a fully connected network), the proposed technique is capable of reliably extracting text from multiple image regions in parallel.
Industry is another area of large interest for the research community. A specific problem to address in this context is the development of robust and reliable image-based diagnostic tools. As an example, the authors of [8] present an investigation on the use of CNNs for heated metal attributes such as heating temperature or duration, cooling mode, and relative humidity, with particular attention to low computational complexity models that allow deployment on mobile devices. Remote sensing images is another scenario where classical computer vision applications like object detection, scene classification, and scene retrieval are relevant. In this regard, the survey in [9] collects the most recent contributions in these specific application fields.
Other works accepted to this special issue address more general questions related to classification performance of deep neural networks. As an example, the authors of [10,11] present two different approaches to reduce the complexity of CNNs, thus allowing a lighter training phase and a faster inference. In particular, the authors of [10] propose a CNN compression method that exploits kernel density estimation to perform a fast 4-bit quantization of the network weights without impairing classification performance. On the contrary, the authors of [11] introduce the Layer Selective Learning framework, which aims to improve the distillation of a very deep and well-trained network into a shallower (and much faster) one. Furthermore, the work shows that the proposed distillation approach also improves the generalization capability of deep classifiers when tackling new domains where a limited amount of labeled data is available. Finally, the work in [12] addresses the difficulties in fine-tuning the hyperparameters of the architecture to be trained. The authors propose the use of an evolutionary approach to automatically set (for each specific task or dataset) a variety of parameters, like the number of layers and their size.
Generative models. A vast research field in deep learning regards generative models, as witnessed by the ability of recent approaches to learn data distribution and generate new complex samples, like human faces that are barely distinguishable from the original one. An application of such generative models is the inpainting of occluded or missing image regions. As an example, the authors of [13] present a method that transforms the depth image inpainting into the dual problem of image denoising, which is then solved with the help of learned CNN denoiser priors. A novel generative model (Encapsulated Variational Autoencoders) is presented in [14]. Autoencoders can learn, thanks to their bottleneck-shaped architecture, general and effective latent representations for generative tasks. The proposed approach stacks two variational autoencoders controlling their mutual dependence in order to improve both the generative and the discriminative power of the whole architecture. Generative Adversarial Networks (GANs) are another approach to (deep) generative modeling. The authors of [15] present an example of their application to the image-to-image translation, where a generator takes as input an image of the source domain and outputs a related image belonging to the target domain. The model is trained with an adversarial loss that discriminates between identical and generated image pairs.
Another extensive application area for generative models is domain adaptation (DA). DA has rapidly gained attention from the scientific community, thanks to its contribution to approaching real cases where a sufficient amount of training data is not available. As shown in [16], GANs and autoencoders can be combined to provide an unsupervised DA method for image classification. However, GAN-based DA requires facing several challenges such as deciding when to stop training, how to properly evaluate the training performance when building a validation set is not possible, and how to guarantee the reliability of the model output in case of significant domain shifts. To soften these issues, the authors of [17] introduce novel confidence metrics to predict the generalization ability of the model and thus stop its training without the need for a separate validation set.
Another use of generative models is the development of powerful adversarial attacks, where adding purposefully tiny perturbations to a clean input can lead the classifier to output the label of any desired class. In [18], an efficient method to improve the attack success rate in the most challenging scenario, i.e., when the attacker has low or no knowledge about the model to attack, is presented. The proposed approach exploits data augmentation to train an ensemble of substitute models that are used to craft adversarial examples. Given the relevance of the threat of such adversarial attacks, the survey in [19] tries to resume recent advances in this rapidly growing subject area, from both the defensive and offensive points of view.
One-dimensional signals. Besides computer vision applications, deep learning is widespread in one-dimensional signal processing. As an example, in [20], audio speech is enhanced through deep learning. This work presents an interesting denoising approach based on the disentangling of the representation of the useful signal and the noise. These two components are then discriminated in the latent space by a neural network trained through an adversarial paradigm. Another relevant application related to audio processing is recognizing emotions in human speech. This information can then be used in various domains, from customer care to security. The authors of [21] tackle this problem by feeding a Long Short-Term Memory (LSTM) network with spectrograms and handcrafted features capturing voice timber. The final emotion classification is then obtained through a Support Vector Machine. Similar approaches are adopted to address signal classification and information extraction in different domains. As an example, in [22], CNN and particle swarm optimization are jointly employed to predict large-scale wind power, while in [23], a self-attention CNN is used to classify the heartbeat signal of hatching eggs in order to tell apart dead from alive ones in commercial poultry breeding. In [24], the authors present an attention-based deep learning approach to automatic signal modulation recognition, which is robust against transmission errors. Finally, in [25,26], the classification of vibration signals is carried out through recurrent neural networks. In the first work [25], bearing fault diagnosis is performed to prevent accidents and reduce production costs, while the second paper [26] is concerned with surface roughness prediction for milling process quality assessment.
Text. Deep learning approaches also appear in many applications related to text processing, in both structured (e.g., web pages and source code) and unstructured domains (e.g., Natural Language Processing (NLP)). This special issue presents two approaches to structured text processing, both leveraging recurrent models but with very different aims. In [27], track's metadata and user's behavioral data are jointly used to train an LSTM, which learns user preferences and can improve personalized music recommendation. Instead, the authors of [28] try to predict the risk of chronic disease in patients analyzing the data collected from a national survey designed to assess the health and nutritional status of adults and children in Korea. In particular, they built their approach around Char-RNN [29], a deep recurrent network that efficiently models language by also handling missing data in input sequences.
Concerning unstructured text (NLP), various applications try to reach a data representation at a high level of abstraction to help correctly extract the information in a way that is independent from the words used to express it. In this regard, there are two works accepted in this special issue that propose sequence-to-sequence based summarization approaches. The first [30] promotes diversity in the final summary through an attention-based approach. The second [31] proposes a generative method (based on CNNs and equipped with both an attention mechanism and a module for identifying rare/unseen words) that can spot key sentences and capture the hierarchical structure of textual documents. Other NLP approaches are discussed in [32], which adopts Self Organizing Maps and Bag of Words to measure text similarity, and in [33], which investigates the use of capsule networks (initially proposed for computer vision applications [34]) for sentence classification and sentiment analysis. The work in [35] tackles a different problem, i.e., providing human-understandable justifications of the classification output for sentiment analysis from unstructured text. The approach exploits a Discretized Interpretable Multi-Layer Perceptron architecture to transform the last connected layer of a CNN into a set of rules capable of explaining with high fidelity the classification decision. Finally, the work in [36] uses deep learning to process multiple sources of heterogeneous information jointly. In particular, the authors propose a framework that separately processes text (through an LSTM) and images attached to emails (through a CNN) and then fuses their classification probabilities to increase the spam filtering robustness.
Reinforcement learning. Finally, a recent trend in research is adopting the reinforcement learning paradigm in a variety of real-world applications ranging from robotics to automotive, gaming, personalized recommendations, and advertising. This technique enables agents to learn the best possible action in the environment by assigning penalties and rewards to all possible choices. In this special issue, we have two examples of reinforcement learning applications: the dredger control system presented in [37], which is trained on historical data acquired during human-controlled dredging operations, and an intelligent controller of the water level in a multi-input multi-output communicating two-tank system [38].

Conclusions
The diverseness of the works published in this Special Issue confirms the widespread of deep learning in several aspects of our life. In the scientific community, new learning strategies are regularly proposed while, concurrently, more robust techniques are presented in very challenging scenarios. This expansion is catalyzed by the availability of more data and more powerful computational resources. In this special issue, recent deep learning strategies exploited in very different applications have been collected with the hope of inspiring cross-contamination of ideas. However, a constant bibliography update is necessary to keep track of the fast evolution of this research field.
Author Contributions: Contribution is equally divided among all authors. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.