Special Issue "Machine Learning Methods with Noisy, Incomplete or Small Datasets"

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 December 2020).

Special Issue Editors

Prof. Dr. Jordi Solé-Casals
E-Mail Website
Guest Editor
Department of Engineering, University of Vic - Central University of Catalonia, Vic, Catalonia, Spain
Interests: biomedical signal processing; machine learning / deep learning; signal processing theory and methods, neurosciences
Special Issues and Collections in MDPI journals
Dr. Sun Zhe
E-Mail Website
Guest Editor
RIKEN National Science Institute, Tokyo, Japan
Interests: brain simulation; connectome; brain mapping; brain signal processing
Special Issues and Collections in MDPI journals
Dr. César F. Caiafa
E-Mail Website
Guest Editor
IAR-CONICET / University of Buenos Aires, ‎Buenos Aires, Argentina
Interests: machine learning; tensors decompositions; compressed sensing; sparse representations; brain diffusion MRI
Special Issues and Collections in MDPI journals
Dr. Pere Marti-Puig
E-Mail Website
Guest Editor
Department of Engineering, University of Vic - Central University of Catalonia, Vic, Catalonia, Spain
Interests: signal processing; fast algorithms; tensor analysis; machine learn- ing/deep learning
Special Issues and Collections in MDPI journals
Prof. Dr. Toshihisa Tanaka
E-Mail Website
Guest Editor
Tokyo University of Agriculture and Technology, Tokyo 183-8538, Japan
Interests: biosignal informatics (biomedical signal processing and machine learning); brain–computer interfaces; epilepsy; neuromusicology

Special Issue Information

Dear colleagues,

In many machine learning applications, measurements are sometimes incomplete, noisy or affected by artifacts, resulting in missing features. In other cases, and for different reasons, the data sets are originally small, and therefore, few data samples are available to derive useful supervised or unsupervised classification methods. Correct handling of incomplete or small data sets in machine learning is a fundamental and classic challenge.

The problem of training a classifier on a dataset with missing features has traditionally been solved by first using imputation methods to complete the datasets and then designing the classifier. On the other hand, training a classifier with small data sets has been solved using data augmentation or artificial sample generation to increase the size of the available training data. New strategies have recently been introduced to address the problem of training with incomplete data sets, and new data augmentation strategies have been proposed to generate more useful artificial data.

The aim of this Special Issue is to invite active researchers to submit original papers that focus on the development of algorithms for machine learning based on incomplete or small datasets and/or on the application of these techniques, to contribute to the dissemination of new ideas to solve this challenging problem and to encourage their application in real scenarios.

Prof. Dr. Jordi Solé-Casals
Dr. Sun Zhe
Dr. César F. Caiafa
Dr. Pere Marti-Puig
Prof. Dr. Toshihisa Tanaka
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2000 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Incomplete data
  • Data augmentation
  • Machine learning
  • Deep learning
  • Noisy data
  • Outliers
  • Tensor methods
  • Imputation methods
  • Imperfect learning
  • Matrix/tensor completion
  • Weak supervised learning

Published Papers (15 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

Open AccessFeature PaperArticle
The INSESS-COVID19 Project. Evaluating the Impact of the COVID19 in Social Vulnerability While Preserving Privacy of Participants from Minority Subpopulations
Appl. Sci. 2021, 11(7), 3110; https://doi.org/10.3390/app11073110 - 31 Mar 2021
Viewed by 279
Abstract
In this paper, the results of the project INSESS-COVID19 are presented, as part of a special call owing to help in the COVID19 crisis in Catalonia. The technological infrastructure and methodology developed in this project allows the quick screening of a territory for [...] Read more.
In this paper, the results of the project INSESS-COVID19 are presented, as part of a special call owing to help in the COVID19 crisis in Catalonia. The technological infrastructure and methodology developed in this project allows the quick screening of a territory for a quick a reliable diagnosis in front of an unexpected situation by providing relevant decisional information to support informed decision-making and strategy and policy design. One of the challenges of the project was to extract valuable information from direct participatory processes where specific target profiles of citizens are consulted and to distribute the participation along the whole territory. Having a lot of variables with a moderate number of citizens involved (in this case about 1000) implies the risk of violating statistical secrecy when multivariate relationships are analyzed, thus putting in risk the anonymity of the participants as well as their safety when vulnerable populations are involved, as is the case of INSESS-COVID19. In this paper, the entire data-driven methodology developed in the project is presented and the dealing of the small subgroups of population for statistical secrecy preserving described. The methodology is reusable with any other underlying questionnaire as the data science and reporting parts are totally automatized. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Severity Classification of Parkinson’s Disease Based on Permutation-Variable Importance and Persistent Entropy
Appl. Sci. 2021, 11(4), 1834; https://doi.org/10.3390/app11041834 - 19 Feb 2021
Viewed by 417
Abstract
Parkinson’s disease (PD) is a neurodegenerative disease that causes chronic and progressive motor dysfunction. As PD progresses, patients show different symptoms at different stages of the disease. The severity assessment is inefficient and subjective when it comes to artificial diagnosis. However, abnormal gait [...] Read more.
Parkinson’s disease (PD) is a neurodegenerative disease that causes chronic and progressive motor dysfunction. As PD progresses, patients show different symptoms at different stages of the disease. The severity assessment is inefficient and subjective when it comes to artificial diagnosis. However, abnormal gait was contingent and the subject selection was limited. Therefore, few-shot learning based on small sample sets is critical to solving the problem of insufficient sample data in PD patients. Using datasets from PhysioNet, this paper presents a method based on permutation-variable importance (PVI) and persistent entropy of topological imprints, and uses support vector machine (SVM) as a classifier to achieve the severity classification of PD patients. The method includes the following steps: (1) Take the data as gait cycles, and calculate the gait characteristics of each cycle. (2) Use the random forest (RF) method to obtain the leading factors differentiating the gait of patients at different severity levels. (3) Use time-delay embedding to map the data into a topological space, and use the topological data analysis based on permutation homology to obtain the persistent entropy. (4) Use the Borderline-SMOTE (BSM) method to balance the sample data. (5) Use the SVM to classify the samples for the severity levels of PD. An accuracy of 98.08% was achieved by 10-fold cross-validation, so our method can be used as an effective means of computer-aided diagnosis of PD, and has important practical value. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Investigating Health-Related Features and Their Impact on the Prediction of Diabetes Using Machine Learning
Appl. Sci. 2021, 11(3), 1173; https://doi.org/10.3390/app11031173 - 27 Jan 2021
Viewed by 445
Abstract
Diabetes Mellitus (DM) is one of the most common chronic diseases leading to severe health complications that may cause death. The disease influences individuals, community, and the government due to the continuous monitoring, lifelong commitment, and the cost of treatment. The World Health [...] Read more.
Diabetes Mellitus (DM) is one of the most common chronic diseases leading to severe health complications that may cause death. The disease influences individuals, community, and the government due to the continuous monitoring, lifelong commitment, and the cost of treatment. The World Health Organization (WHO) considers Saudi Arabia as one of the top 10 countries in diabetes prevalence across the world. Since most of its medical services are provided by the government, the cost of the treatment in terms of hospitals and clinical visits and lab tests represents a real burden due to the large scale of the disease. The ability to predict the diabetic status of a patient with only a handful of features can allow cost-effective, rapid, and widely-available screening of diabetes, thereby lessening the health and economic burden caused by diabetes alone. The goal of this paper is to investigate the prediction of diabetic patients and compare the role of HbA1c and FPG as input features. By using five different machine learning classifiers, and using feature elimination through feature permutation and hierarchical clustering, we established good performance for accuracy, precision, recall, and F1-score of the models on the dataset implying that our data or features are not bound to specific models. In addition, the consistent performance across all the evaluation metrics indicate that there was no trade-off or penalty among the evaluation metrics. Further analysis was performed on the data to identify the risk factors and their indirect impact on diabetes classification. Our analysis presented great agreement with the risk factors of diabetes and prediabetes stated by the American Diabetes Association (ADA) and other health institutions worldwide. We conclude that by performing analysis of the disease using selected features, important factors specific to the Saudi population can be identified, whose management can result in controlling the disease. We also provide some recommendations learned from this research. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Shadow Estimation for Ultrasound Images Using Auto-Encoding Structures and Synthetic Shadows
Appl. Sci. 2021, 11(3), 1127; https://doi.org/10.3390/app11031127 - 26 Jan 2021
Cited by 1 | Viewed by 434
Abstract
Acoustic shadows are common artifacts in medical ultrasound imaging. The shadows are caused by objects that reflect ultrasound such as bones, and they are shown as dark areas in ultrasound images. Detecting such shadows is crucial for assessing the quality of images. This [...] Read more.
Acoustic shadows are common artifacts in medical ultrasound imaging. The shadows are caused by objects that reflect ultrasound such as bones, and they are shown as dark areas in ultrasound images. Detecting such shadows is crucial for assessing the quality of images. This will be a pre-processing for further image processing or recognition aiming computer-aided diagnosis. In this paper, we propose an auto-encoding structure that estimates the shadowed areas and their intensities. The model once splits an input image into an estimated shadow image and an estimated shadow-free image through its encoder and decoder. Then, it combines them to reconstruct the input. By generating plausible synthetic shadows based on relatively coarse domain-specific knowledge on ultrasound images, we can train the model using unlabeled data. If pixel-level labels of the shadows are available, we also utilize them in a semi-supervised fashion. By experiments on ultrasound images for fetal heart diagnosis, we show that our method achieved 0.720 in the DICE score and outperformed conventional image processing methods and a segmentation method based on deep neural networks. The capability of the proposed method on estimating the intensities of shadows and the shadow-free images is also indicated through the experiments. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Data-Dependent Feature Extraction Method Based on Non-Negative Matrix Factorization for Weakly Supervised Domestic Sound Event Detection
Appl. Sci. 2021, 11(3), 1040; https://doi.org/10.3390/app11031040 - 24 Jan 2021
Viewed by 311
Abstract
In this paper, feature extraction methods are developed based on the non-negative matrix factorization (NMF) algorithm to be applied in weakly supervised sound event detection. Recently, the development of various features and systems have been attempted to tackle the problems of acoustic scene [...] Read more.
In this paper, feature extraction methods are developed based on the non-negative matrix factorization (NMF) algorithm to be applied in weakly supervised sound event detection. Recently, the development of various features and systems have been attempted to tackle the problems of acoustic scene classification and sound event detection. However, most of these systems use data-independent spectral features, e.g., Mel-spectrogram, log-Mel-spectrum, and gammatone filterbank. Some data-dependent feature extraction methods, including the NMF-based methods, recently demonstrated the potential to tackle the problems mentioned above for long-term acoustic signals. In this paper, we further develop the recently proposed NMF-based feature extraction method to enable its application in weakly supervised sound event detection. To achieve this goal, we develop a strategy for training the frequency basis matrix using a heterogeneous database consisting of strongly- and weakly-labeled data. Moreover, we develop a non-iterative version of the NMF-based feature extraction method so that the proposed feature extraction method can be applied as a part of the model structure similar to the modern “on-the-fly” transform method for the Mel-spectrogram. To detect the sound events, the temporal basis is calculated using the NMF method and then used as a feature for the mean-teacher-model-based classifier. The results are improved for the event-wise post-processing method. To evaluate the proposed system, simulations of the weakly supervised sound event detection were conducted using the Detection and Classification of Acoustic Scenes and Events 2020 Task 4 database. The results reveal that the proposed system has F1-score performance comparable with the Mel-spectrogram and gammatonegram and exhibits 3–5% better performance than the log-Mel-spectrum and constant-Q transform. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Comparison of Dengue Predictive Models Developed Using Artificial Neural Network and Discriminant Analysis with Small Dataset
Appl. Sci. 2021, 11(3), 943; https://doi.org/10.3390/app11030943 - 21 Jan 2021
Cited by 1 | Viewed by 325
Abstract
In Indonesia, dengue has become one of the hyperendemic diseases. Dengue consists of three clinical phases—febrile phase, critical phase, and recovery phase. Many patients have died in the critical phase due to the lack of proper and timely treatment. Therefore, we developed models [...] Read more.
In Indonesia, dengue has become one of the hyperendemic diseases. Dengue consists of three clinical phases—febrile phase, critical phase, and recovery phase. Many patients have died in the critical phase due to the lack of proper and timely treatment. Therefore, we developed models that can predict the severity level of dengue based on the laboratory test results of the corresponding patients using Artificial Neural Network (ANN) and Discriminant Analysis (DA). In developing the models, we used a very small dataset. It is shown that ANN models developed using logistic and hyperbolic tangent activation function with 70% training data yielded the highest accuracy (90.91%), sensitivity (91.11%), and specificity (95.51%). This is the proposed model in this research. The proposed model will be able to help physicians in predicting the severity level of dengue patients before entering the critical phase. Furthermore, it will ease physicians in treating dengue patients early, so fatal cases or deaths can be avoided. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Applying Knowledge Inference on Event-Conjunction for Automatic Control in Smart Building
Appl. Sci. 2021, 11(3), 935; https://doi.org/10.3390/app11030935 - 20 Jan 2021
Viewed by 305
Abstract
Smart building, one of IoT-based emerging applications is where energy-efficiency, human comfort, automation, security could be managed even better. However, at the current stage, a unified and practical framework for knowledge inference inside the smart building is still lacking. In this paper, we [...] Read more.
Smart building, one of IoT-based emerging applications is where energy-efficiency, human comfort, automation, security could be managed even better. However, at the current stage, a unified and practical framework for knowledge inference inside the smart building is still lacking. In this paper, we present a practical proposal of knowledge extraction on event-conjunction for automatic control in smart buildings. The proposal consists of a unified API design, ontology model, inference engine for knowledge extraction. Two types of models: finite state machine(FSMs) and bayesian network (BN) have been used for capturing the state transition and sensor data fusion. In particular, to solve the problem that the size of time interval observations between two correlated events was too small to be approximated for estimation, we utilized the Markov Chain Monte Carlo (MCMC) sampling method to optimize the sampling on time intervals. The proposal has been put into use in a real smart building environment. 78-days data collection of the light states and elevator states has been conducted for evaluation. Several events have been inferred in the evaluation, such as room occupancy, elevator moving, as well as the event conjunction of both. The inference on the users’ waiting time of elevator-using revealed the potentials and effectiveness of the automatic control on the elevator. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Innovatively Fused Deep Learning with Limited Noisy Data for Evaluating Translations from Poor into Rich Morphology
Appl. Sci. 2021, 11(2), 639; https://doi.org/10.3390/app11020639 - 11 Jan 2021
Viewed by 809
Abstract
Evaluation of machine translation (MT) into morphologically rich languages has not been well studied despite its importance. This paper proposes a classifier, that is, a deep learning (DL) schema for MT evaluation, based on different categories of information (linguistic features, natural language processing [...] Read more.
Evaluation of machine translation (MT) into morphologically rich languages has not been well studied despite its importance. This paper proposes a classifier, that is, a deep learning (DL) schema for MT evaluation, based on different categories of information (linguistic features, natural language processing (NLP) metrics and embeddings), by using a model for machine learning based on noisy and small datasets. The linguistic features are string based for the language pairs English (EN)–Greek (EL) and EN–Italian (IT). The paper also explores the linguistic differences that affect evaluation accuracy between different kinds of corpora. A comparative study between using a simple embedding layer (mathematically calculated) and pre-trained embeddings is conducted. Moreover, an analysis of the impact of feature selection and dimensionality reduction on classification accuracy has been conducted. Results show that using a neural network (NN) model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation for EN–EL and EN–IT, by an increase of almost 0.40 points in correlation with human judgments on pairwise MT evaluation. It is observed that the proposed algorithm achieved better results on noisy and small datasets. In addition, for a more integrated analysis of the accuracy results, a qualitative linguistic analysis has been carried out in order to address complex linguistic phenomena. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
An Effective Multi-Label Feature Selection Model Towards Eliminating Noisy Features
Appl. Sci. 2020, 10(22), 8093; https://doi.org/10.3390/app10228093 - 15 Nov 2020
Viewed by 422
Abstract
Feature selection has devoted a consistently great amount of effort to dimension reduction for various machine learning tasks. Existing feature selection models focus on selecting the most discriminative features for learning targets. However, this strategy is weak in handling two kinds of features, [...] Read more.
Feature selection has devoted a consistently great amount of effort to dimension reduction for various machine learning tasks. Existing feature selection models focus on selecting the most discriminative features for learning targets. However, this strategy is weak in handling two kinds of features, that is, the irrelevant and redundant ones, which are collectively referred to as noisy features. These features may hamper the construction of optimal low-dimensional subspaces and compromise the learning performance of downstream tasks. In this study, we propose a novel multi-label feature selection approach by embedding label correlations (dubbed ELC) to address these issues. Particularly, we extract label correlations for reliable label space structures and employ them to steer feature selection. In this way, label and feature spaces can be expected to be consistent and noisy features can be effectively eliminated. An extensive experimental evaluation on public benchmarks validated the superiority of ELC. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Convolution-GRU Based on Independent Component Analysis for fMRI Analysis with Small and Imbalanced Samples
Appl. Sci. 2020, 10(21), 7465; https://doi.org/10.3390/app10217465 - 23 Oct 2020
Viewed by 504
Abstract
Functional magnetic resonance imaging (fMRI) is a commonly used method of brain research. However, due to the complexity and particularity of the fMRI task, it is difficult to find enough subjects, resulting in a small and, often, imbalanced dataset. A dataset with small [...] Read more.
Functional magnetic resonance imaging (fMRI) is a commonly used method of brain research. However, due to the complexity and particularity of the fMRI task, it is difficult to find enough subjects, resulting in a small and, often, imbalanced dataset. A dataset with small samples causes overfitting of the learning model, and the imbalance will make the model insensitive to the minority class, which has been a problem in classification. It is of great significance to classify fMRI data with small and imbalanced samples. In the present study, we propose a 3-step method on a small and imbalanced fMRI dataset from a word-scene memory task. The steps of the method are as follows: (1) An independent component analysis is performed to reduce the dimension of data; (2) The synthetic minority oversampling technique is used to generate new samples of the minority class to balance data; (3) A convolution-Gated Recurrent Unit (GRU) network is used to classify the independent component signals, indicating whether the subjects are performing episodic memory tasks. The accuracy of the proposed method is 72.2%, which improves the classification performance compared with traditional classifiers such as support vector machines (SVM), logistic regression (LGR), linear discriminant analysis (LDA) and k-nearest neighbor (KNN), and this study gives a biomarker for evaluating the reactivation of episodic memory. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Multifrequency Impedance Method Based on Neural Network for Root Canal Length Measurement
Appl. Sci. 2020, 10(21), 7430; https://doi.org/10.3390/app10217430 - 22 Oct 2020
Viewed by 363
Abstract
Root canal therapy is the most fundamental and effective approach for treating endodontics and periapicalitis. The length of the root canal must be accurately measured to clean the pathogenic substances in it. This study aims to present a multifrequency impedance method based on [...] Read more.
Root canal therapy is the most fundamental and effective approach for treating endodontics and periapicalitis. The length of the root canal must be accurately measured to clean the pathogenic substances in it. This study aims to present a multifrequency impedance method based on a neural network for root canal length measurement. A circuit system was designed which generates a current of frequencies from 100 Hz to 20 kHz in order to augment the data of impedance ratios with different combinations of frequencies. Several impedance ratios and other quantified characteristics, such as the type of tooth and file, were selected as features to train a neural network model that could predict the distance between the file and apical foramen. The model uses leave-one-out cross-validation, adopts the Adam optimizer and regularization, and has two hidden layers with nine and five nodes, respectively. The neural network-based multifrequency impedance method exhibits nearly 95% accuracy, compared with the dual-frequency impedance ratio method (which demonstrated no more than 85% accuracy in some situations). This method may eliminate the influence of human and environmental factors on measurement of the root canal length, thereby increasing measurement robustness. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Training Set Enlargement Using Binary Weighted Interpolation Maps for the Single Sample per Person Problem in Face Recognition
Appl. Sci. 2020, 10(19), 6659; https://doi.org/10.3390/app10196659 - 23 Sep 2020
Viewed by 614
Abstract
We propose a method of enlarging the training dataset for a single-sample-per-person (SSPP) face recognition problem. The appearance of the human face varies greatly, owing to various intrinsic and extrinsic factors. In order to build a face recognition system that can operate robustly [...] Read more.
We propose a method of enlarging the training dataset for a single-sample-per-person (SSPP) face recognition problem. The appearance of the human face varies greatly, owing to various intrinsic and extrinsic factors. In order to build a face recognition system that can operate robustly in an uncontrolled, real environment, it is necessary for the algorithm to learn various images of the same person. However, owing to limitations in the collection of facial image data, only one sample can typically be obtained, causing difficulties in the performance and usability of the method. This paper proposes a method that analyzes the changes in pixels in face images associated with variations by extracting the binary weighted interpolation map (B-WIM) from neutral and variational images in the auxiliary set. Then, a new variational image for the query image is created by combining the given query (neutral) image and the variational image of the auxiliary set based on the B-WIM. As a result of performing facial recognition comparison experiments on SSPP training data for various facial-image databases, the proposed method shows superior performance compared with other methods. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessArticle
Learning Optimal Time Series Combination and Pre-Processing by Smart Joins
Appl. Sci. 2020, 10(18), 6346; https://doi.org/10.3390/app10186346 - 11 Sep 2020
Viewed by 534
Abstract
In industrial applications of data science and machine learning, most of the steps of a typical pipeline focus on optimizing measures of model fitness to the available data. Data preprocessing, instead, is often ad-hoc, and not based on the optimization of quantitative measures. [...] Read more.
In industrial applications of data science and machine learning, most of the steps of a typical pipeline focus on optimizing measures of model fitness to the available data. Data preprocessing, instead, is often ad-hoc, and not based on the optimization of quantitative measures. This paper proposes the use of optimization in the preprocessing step, specifically studying a time series joining methodology, and introduces an error function to measure the adequateness of the joining. Experiments show how the method allows monitoring preprocessing errors for different time slices, indicating when a retraining of the preprocessing may be needed. Thus, this contribution helps quantifying the implications of data preprocessing on the result of data analysis and machine learning methods. The methodology is applied to two case studies: synthetic simulation data with controlled distortions, and a real scenario of an industrial process. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Open AccessFeature PaperArticle
Automatic Classification of Morphologically Similar Fish Species Using Their Head Contours
Appl. Sci. 2020, 10(10), 3408; https://doi.org/10.3390/app10103408 - 14 May 2020
Cited by 1 | Viewed by 575
Abstract
This work deals with the task of distinguishing between different Mediterranean demersal species of fish that share a remarkably similar form and that are also used for the evaluation of marine resources. The experts who are currently able to classify these types of [...] Read more.
This work deals with the task of distinguishing between different Mediterranean demersal species of fish that share a remarkably similar form and that are also used for the evaluation of marine resources. The experts who are currently able to classify these types of species do so by considering only a segment of the contour of the fish, specifically its head, instead of using the entire silhouette of the animal. Based on this knowledge, a set of features to classify contour segments is presented to address both a binary and a multi-class classification problem. In addition to the difficulty present in successfully discriminating between very similar forms, we have the limitation of having small, unreliably labeled image data sets. The results obtained were comparable to those obtained by trained experts. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Review

Jump to: Research

Open AccessReview
Decomposition Methods for Machine Learning with Small, Incomplete or Noisy Datasets
Appl. Sci. 2020, 10(23), 8481; https://doi.org/10.3390/app10238481 - 27 Nov 2020
Viewed by 893
Abstract
In many machine learning applications, measurements are sometimes incomplete or noisy resulting in missing features. In other cases, and for different reasons, the datasets are originally small, and therefore, more data samples are required to derive useful supervised or unsupervised classification methods. Correct [...] Read more.
In many machine learning applications, measurements are sometimes incomplete or noisy resulting in missing features. In other cases, and for different reasons, the datasets are originally small, and therefore, more data samples are required to derive useful supervised or unsupervised classification methods. Correct handling of incomplete, noisy or small datasets in machine learning is a fundamental and classic challenge. In this article, we provide a unified review of recently proposed methods based on signal decomposition for missing features imputation (data completion), classification of noisy samples and artificial generation of new data samples (data augmentation). We illustrate the application of these signal decomposition methods in diverse selected practical machine learning examples including: brain computer interface, epileptic intracranial electroencephalogram signals classification, face recognition/verification and water networks data analysis. We show that a signal decomposition approach can provide valuable tools to improve machine learning performance with low quality datasets. Full article
(This article belongs to the Special Issue Machine Learning Methods with Noisy, Incomplete or Small Datasets)
Show Figures

Figure 1

Back to TopTop