Machine Learning Methods with Noisy, Incomplete or Small Datasets

: In this article, we present a collection of ﬁfteen novel contributions on machine learning methods with low-quality or imperfect datasets, which were accepted for publication in the special issue “Machine Learning Methods with Noisy, Incomplete or Small Datasets”, Applied Sciences (ISSN 2076-3417). These papers provide a variety of novel approaches to real-world machine learning problems where available datasets suffer from imperfections such as missing values, noise or artefacts. Contributions in applied sciences include medical applications, epidemic management tools, methodological work, and industrial applications, among others. We believe that this special issue will bring new ideas for solving this challenging problem, and will provide clear examples of application in real-world scenarios.


Introduction
In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information is of low quality, which includes unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred as the low-quality data problem. Machine learning researchers and practitioners have been working on various strategies to correctly handle the low-quality problem in recent years. Far from being solved, this problem still represents a fundamental and classic challenge in the artificial intelligence community.
The aim of this Special Issue was to collect novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios. Despite the COVID-19 crisis and lockdowns in most countries, this Special Issue attracted great attention among researchers worldwide. A total number of twenty-one papers were submitted and fifteen of them were accepted after appropriate revisions. We were pleasantly surprised by the diversity of nationalities of contributors and the variety of the addressed problems in applied sciences ranging from medical and health applications through specific industrial case study examples. The authors of the published papers are from nine countries located in Europe, America, Africa and Asia.
In the following sections, the accepted papers and their corresponding most relevant contributions are summarized, which are grouped in the following categories: medical applications, epidemics management tools, methodological papers, industrial applications, and others.

Medical Applications
Interestingly, the majority of the contributions are related to specific applications in medicine. Three papers addressed different problems or diseases in Neuroscience. For example, in [1], Caiafa et al. (Argentina-Spain-Japan) reviewed recent approaches to deal with incomplete or noisy measurements by applying signal decomposition methods and showed their usefulness in epileptic intracranial electroencephalogram (iEEG) signals classification, among other applications. Finding epileptic focus with iEEG is usually difficult mainly because available datasets labeled by expert medical doctors are scarce. In [2], Tong et al. (China-South Africa) proposed a few-shot learning method for the severity assessment of Parkinson's disease based on a small gait dataset. The proposed algorithm solves the small-data problem by using permutation-variable importance (PVI) and persistent entropy of topological imprints; as well as applying a support vector machine (SVM) classifier to achieve the severity classification of Parkinson disease patients. In [3], Wang et al. (China) addressed the problem of small and unbalanced datasets in functional magnetic resonance imaging (fMRI) for neuroscience studies. Their technique combines Independent Component Analysis (ICA) for dimensionality reduction, data augmentation to balance data and a convolution-gated recurrent unit (GRU) network. Results on episodic memory evaluation are reported.
The other papers that addressed medical applications are described as follows. In [4], Yasutomi et al. (Japan) introduced a deep learning method based on an auto-encoder architecture to detect and remove shadow artifacts in ultrasound images. The model can be trained on unlabeled data (unsupervised) or with few pixel labels available (semi supervised). The method has been applied to fetal heart diagnosis. In [5], Ahmad et al. (Saudi Arabia) investigated a machine learning approach to predict diabetes mellitus based on a handful set of features obtained by simple laboratory tests, allowing a costeffective and rapid screening tool. They compared different machine learning classifiers and provided a set of recommendations based on those analyses. In [6], Qiao et al. (China) proposed a method to measure the length of the root canal length, which is crucial for an effective treatment of endodontics and periapicalitis. The authors employed a neural network on multifrequency impedance measurements.

Epidemics Monitoring and Management Tools
Machine learning has been demonstrated to have an important role in dealing with infectious diseases and epidemics. In this collection, two contributions are devoted to the development of tools to deal with some aspects of COVID-19 and dengue epidemics. More specifically, in [7], Gibert Oliveras et al. (Spain) reported the results of a project developed in Catalonia, Spain, owing to help in the COVID-19 crisis. The project allowed for quick territory screening providing relevant information to support informed decision-making and strategy and policy design. The authors proposed a data-driven methodology in order to deal with small subgroups of the population for statistical secrecy preserving. In [8], Silitonga et al. (Indonesia) developed prediction models to estimate the severity level of dengue based on the laboratory test results of the corresponding patients using artificial neural network (ANN) and discriminant analysis (DA) applied to very small datasets.

Methodological Articles
Four contributions proposed general methods for machine learning with low-quality datasets. In [1], the authors provided a unified review of decomposition methods, which includes linear decomposition, low-rank matrix/tensor factorization, sparse matrix/tensor decomposition and empirical mode decomposition (EMD) models. This paper illustrates the ability of these decomposition models to impute missing features, denoising and to artificially generate additional data samples (data augmentation) with examples to the brain-computer interface (BCI) and epileptic EEG analysis, among others. In [9], Lee et al. (South Korea) developed feature extraction methods based on the non-negative matrix factorization (NMF) algorithm and it is applied in weakly supervised sound event detection.
The algorithm considers learning from strongly and weakly labeled data. On the other side, in [10], Gil et al. (Spain) investigated the use of optimization in the preprocessing step of time series joining. More specifically, the authors proposed an error function to measure the adequateness of the joining and demonstrated the effectiveness of the proposed method on the synthetical datasets and real industrial process scenario. Finally, in [11], Wang et al. (China-Japan) proposed a novel multi-label feature selection approach by embedding label correlations (dubbed ELCs) in order to eliminate irrelevant and redundant, features, also referred as noisy features.

Applications to the Industry
This Special Issue also includes two papers studying the application of machine learning to specific practical problems in different industries: the fishing and smart buildings industries. In [12], Marti-Puig et al. (Spain) addressed the problem of distinguishing between different Mediterranean demersal species of fish that share a remarkably similar form and that are also used for the evaluation of marine resources. The authors employed both a binary and a multi-class classification problem based on very small datasets with unreliable labels. In [13], Ge et al. (Japan-China) proposed a unified and practical framework for knowledge inference inside the smart building.

Other Applications
Two very important machine learning problems face recognition and natural language processing. These two problems were addressed in this Special Issue for cases with low-quality datasets. In [14], Lee et al. (Korea) studied the problem of training a facial recognition system provided that only one sample per identity is available. The authors proposed a data augmentation technique by introducing changes in pixels in face images associated with variations by extracting the binary weighted interpolation map (B-WIM) from neutral and variational images in the auxiliary set. In [1], the EMD method was applied to remove noise in face images, thus improving the classification accuracy of a machine learning classifier. Finally, in [15], Mouratidis et al. (Greece) provided an application to natural language processing. They developed a deep learning schema for machine translation evaluation (English-Greek and English-Italian), based on different categories of information (linguistic features, natural language processing metrics and embeddings), by using a model for machine learning based on noisy and small datasets.

Conclusions
The correct handling of noisy, incomplete or small datasets remains an open problem in the artificial intelligence community. However, this Special Issue collects fifteen research papers providing general approaches to some low-quality datasets problems and clear practical examples in different applied sciences disciplines. This collection of papers represents a good reference for the current state-of-the-arts, also providing an excellent starting point for developing new advanced methods in the future.