Survey on Preprocessing Techniques for Big Data Projects

: In the era of big data, a vast amount of data are being produced. This results in two main issues when trying to discover knowledge from these data. There is a lot of information that is not relevant to the problem we want to solve, and there are many imperfections and errors in the data. Therefore, preprocessing these data is a key step before applying any kind of learning algorithm. Reducing the number of features to a relevant subset (feature selection) and reducing the possible values of continuous variables (discretisation) are two of the main preprocessing techniques. This paper will review different methods for completing these two steps, focusing on the big data context and giving examples of projects where they have been applied.


Introduction
With the irruption of the "big data" phenomenon, massive amounts of data are generated daily. These data are normally available in a raw format and need to be treated before acquiring any knowledge from them. This step in the big data chain is usually referred to as preprocessing and there exists a wide range of techniques [1].
The main approaches to preprocess big data are discretisation and feature selection. The former transforms continuous data to a limited set of values. Feature selection aims to reduce the number of attributes [1,2].
The remainder of this paper is organised as follows: Section 2 introduces the different preprocessing techniques, dividing them into feature selection and discretisation. For each of these techniques, a classification with different examples for each category is presented. Section 3 concludes the paper and suggests a future line of research.

Data Preprocessing
Different feature selection and discretisation techniques are presented in this section based on big data projects where they have been applied.

Feature Selection
The different feature selection techniques for big data mining can be classified into filter methods, wrapper methods, and embedded methods [1].

Filter Methods
Features are selected according to the value of different metrics, usually certain statistical criteria.
In the context of text mining, it is common to use the bag-of-words approach so that each word is taken as a unique feature. Chi-squared was used to filter the most relevant terms in a text mining algorithm to estimate credit score at Deutsche Bank [3].
In relation to text mining as well, in [4] tweets are analysed in order to figure out the impact of their sentiment on stock market movements. The authors also use a filter method to select the most relevant features-Fisher score.
Based on Chi-squared and the GUIDE regression tree, Loh [5,6] presents a technique to perform feature selection in a large genomic dataset.
Some work has been done to adapt these methods to the big data context, such as in [11], where a framework to parallelise and scale some of these algorithms is introduced.

Embedded Methods
Feature selection is performed in the process of fitting a model to a given dataset. SVM-RFE (Supported Vector Machine Recursive Feature Elimination), introduced in [12] to analyse DNA microarrays has shown its power in several applications, such as in bioinformatics [13].
The Feature Selection-Perceptron (FS-P) [14] technique has been used in a proton ( 1 H) magnetic resonance spectroscopy (MRS) database to select features that could better predict brain tumours.
Based on a more complex neural network, the embedded method BlogReg is introduced in [15], where it is applied to data collected from the sensors of a robot.

Wrapper Methods
Wrapper methods refer to an iterative process in which a subset of features is evaluated at a time.
A wrapper method based on the decision tree C4.5 has been used for many years [16]. However, developments based on this method are still ongoing, such as the one from [17], which is applied to healthcare data (Medical Internet of Things).
Another wrapper method is based on the SVM algorithm [18]. It has been widely used since its creation, such as in [19], predicting arrhythmias from cardiac data.
FSSEM (Feature Subset Selection wrapped around EM clustering) [20] is also a wrapped method, and a popular stepwise approach for regression problems [21].

Discretisation
Discretisation is the step where continuous variables are transformed into categorical ones [2]. There exist multiple classifications for discretization techniques, but here one of unsupervised and supervised discretisation is chosen [2].

Unsupervised
Unsupervised discretisation methods do not take into account the target of the learning algorithm when the features are discretised.
Equal width interval discretisation and equal frequency interval discretisation need to be adapted in the context of big data streaming as done in [22].
In [23], k-means [24] discretisation is used to transform the target for road detection.
Other methods based on k-means algorithm have been proposed, such as Cokmeans and Bikmeans, used in [25] in the context of microarrays.

Supervised
Supervised discretisation does take into account the target of the learning algorithm. One of the most popular methods is based on entropy [26]. This algorithm was parallelised in [27].
The previously presented approaches are univariate, but there also exist supervised multivariate discretisation (SMD) techniques, such as the one in [32].

Conclusions
Due to extension limitations, this paper has only given some feature selection and discretisation techniques, mentioning some up-to-date examples of where they are used. There is growing interest in adapting these techniques so that they can perform efficiently in the big data context. In this direction, a future line of work is to create a comprehensive and complete taxonomy of the up-to-date feature selection and discretisation techniques, performing experimental results in the big data context.
Funding: This research received no external funding.