Next Article in Journal
Dynamics of the Stochastic Belousov-Zhabotinskii Chemical Reaction Model
Next Article in Special Issue
A Novel and Simple Mathematical Transform Improves the Perfomance of Lernmatrix in Pattern Classification
Previous Article in Journal
A New Kind of Nonlinear Quasicontractions in Metric Spaces
Previous Article in Special Issue
Deep Hybrid Model Based on EMD with Classification by Frequency Characteristics for Long-Term Air Quality Prediction
Open AccessArticle

Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Oxford Institute for Sustainable Development, School of the Built Environment, Oxford Brookes University, Oxford OX3 0BP, UK
Author to whom correspondence should be addressed.
Mathematics 2020, 8(5), 662;
Received: 22 March 2020 / Revised: 23 April 2020 / Accepted: 23 April 2020 / Published: 27 April 2020
(This article belongs to the Special Issue Advances in Machine Learning Prediction Models)
In the field of supervised machine learning, the quality of a classifier model is directly correlated with the quality of the data that is used to train the model. The presence of unwanted outliers in the data could significantly reduce the accuracy of a model or, even worse, result in a biased model leading to an inaccurate classification. Identifying the presence of outliers and eliminating them is, therefore, crucial for building good quality training datasets. Pre-processing procedures for dealing with missing and outlier data, commonly known as feature engineering, are standard practice in machine learning problems. They help to make better assumptions about the data and also prepare datasets in a way that best expose the underlying problem to the machine learning algorithms. In this work, we propose a multistage method for detecting and removing outliers in high-dimensional data. Our proposed method is based on utilising a technique called t-distributed stochastic neighbour embedding (t-SNE) to reduce high-dimensional map of features into a lower, two-dimensional, probability density distribution and then use a simple descriptive statistical method called interquartile range (IQR) to identifying any outlier values from the density distribution of the features. t-SNE is a machine learning algorithm and a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualisation in a low-dimensional space of two or three dimensions. We applied this method on a dataset containing images for training a convolutional neural network model (ConvNet) for an image classification problem. The dataset contains four different classes of images: three classes contain defects in construction (mould, stain, and paint deterioration) and a no-defect class (normal). We used the transfer learning technique to modify a pre-trained VGG-16 model. We used this model as a feature extractor and as a benchmark to evaluate our method. We have shown that, when using this method, we can identify and remove the outlier images in the dataset. After removing the outlier images from the dataset and re-training the VGG-16 model, the results have also shown that the accuracy of the classification has significantly improved and the number of misclassified cases has also dropped. While many feature engineering techniques for handling missing and outlier data are common in predictive machine learning problems involving numerical or categorical data, there is little work on developing techniques for handling outliers in high-dimensional data which can be used to improve the quality of machine learning problems involving images such as ConvNet models for image classification and object detection problems. View Full-Text
Keywords: convolutional neural network; t-SNE; outliers convolutional neural network; t-SNE; outliers
Show Figures

Figure 1

MDPI and ACS Style

Perez, H.; Tah, J.H.M. Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE. Mathematics 2020, 8, 662.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

Search more from Scilit
Back to TopTop