A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers †

Revillon, Guillaume; Mohammad-Djafari, Ali

doi:10.3390/proceedings2019033023

Open AccessProceeding Paper

A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers ^†

by

Guillaume Revillon

^‡

and

Ali Mohammad-Djafari

^*

L2S, CentraleSupélec-Univ Paris Saclay, 91192 Gif-sur-Yvette, France

^*

Author to whom correspondence should be addressed.

^†

Presented at the 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Garching, Germany, 30 June–5 July 2019.

^‡

These authors contributed equally to this work.

Proceedings 2019, 33(1), 23; https://doi.org/10.3390/proceedings2019033023

Published: 9 December 2019

(This article belongs to the Proceedings of The 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Classification and clustering problems are closely connected with pattern recognition where many general algorithms have been developed and used in various fields. Depending on the complexity of patterns in data, classification and clustering procedures should take into consideration both continuous and categorical data which can be partially missing and erroneous due to mismeasurements and human errors. However, most algorithms cannot handle missing data and imputation methods are required to generate data to use them. Hence, the main objective of this work is to define a classification and clustering framework that handles both outliers and missing values. Here, an approach based on mixture models is preferred since mixture models provide a mathematically based, flexible and meaningful framework for the wide variety of classification and clustering requirements. More precisely, a scale mixture of Normal distributions is updated to handle outliers and missing data issues for any types of data. Then a variational Bayesian inference is used to find approximate posterior distributions of parameters and to provide a lower bound on the model log evidence used as a criterion for selecting the number of clusters. Eventually, experiments are carried out to exhibit the effectiveness of the proposed model through an application in Electronic Warfare.

Keywords:

classification; clustering; mixture models; bayesian framework; outliers; missing data

1. Introduction

Classification and clustering problems are closely connected with pattern recognition [1] where many general algorithms [2,3,4] have been developed and used in various fields [5,6]. Depending on the complexity of patterns in data, classification and clustering procedures should take into consideration both continuous and categorical data which can be partially missing and erroneous due to mismeasurements and human errors. However, most algorithms cannot handle missing data and imputation methods [7] are required to generate data to use them. Hence, the main objective of this work is to define a classification and clustering framework that handles both outliers and missing values. Here, an approach based on mixture models is preferred since mixture models provide a mathematically based, flexible and meaningful framework for the wide variety of classification and clustering requirements [8]. Two families of models emerge from finite mixture models fitting mixed-type data:

The location mixture model [9] that assumes that continuous variables follow a multivariate Gaussian distribution conditionally on both component and categorical variables.
The underlying variables mixture model [10] that assumes that each discrete variable arises from a latent continuous variable and that all continuous variables follow a Gaussian mixture model.

In this work, the location mixture model approach is retained since it better models relations between continuous and categorical features when data patterns are mostly designed by first choosing patterns of categorical features to achieve a specific goal and then choosing continuous features that meet constraints related to the chosen patterns and the problem environment. Indeed regarding clustering approach, each cluster groups observations that share same combinations of categorical features where continuous features belong to a peculiar subset. Hence, the location mixture model naturally responds to that dependence structure by assuming that continuous variables are normally distributed conditionally to categorical variables. More precisely, a scale mixture of conditional Gaussian distributions [11] is updated to handle outliers and missing data issues for any types of data. Then a variational Bayesian inference [12] is used to find approximate posterior distributions of parameters and to provide a lower bound on the model log evidence used as a criterion for selecting the number of clusters. An application of the resulting model in Electronic Warfare [13] is proposed to perform Source Emission Identification which is a supreme asset for decision making in military tactical situations. By providing information about the presence of threats, classification and clustering of radar emitters have a significant role ensuring that countermeasures against enemies are well-chosen and enabling detection of unknown radar signals to update databases. As a pulse-to-pulse modulation pattern [14], a radar signal pattern is decomposed into a relevant arrangement of sequences of pulses where each pulse is defined by continuous features and each sequence is characterized by categorical features. However, a radar signal is often partially observed due to the presence of many radar emitters in the electromagnetic environment causing mismeasurements and measurement errors. Therefore the proposed model is suitable for radar emitter classification and clustering. The outline of the paper is as follows. Assumptions on mixed-type data are presented in Section 2. Then, the proposed model and inference procedure are introduced in Section 3. Finally, evaluation of the model is proposed through different experiments on radar emitter datasets in Section 4.

2. Mixed-Type Data

In this section, a joint distribution for mixed data is introduced to model the dependence structure between continuous and categorical data. Then, outliers and missing values are tackled by taking advantage of the joint distribution.

2.1. Assumptions on Mixed-Type Data

Data

x

consist of J observations

{(x_{j})}_{j = 1}^{J}

gathering continuous features

x_{q} = {(x_{q j})}_{j = 1}^{J}

and categorical features

x_{c} = {(x_{c j})}_{j = 1}^{J}

. Let

x_{j} = (x_{q j}, x_{c j})

the

j^{t h}

observation vector of mixed variables where

$x_{q j} \in R^{d}$ is a vector of d continuous variables,
$x_{c j} = (x_{c j}^{0}, \dots, x_{c j}^{q - 1}) \in C_{q}$ is a vector of q categorical variables where $C_{q} = C_{0} \times \dots \times C_{q - 1}$ is the tensor gathering each space $C_{i} = \{m_{1}^{i}, \dots, m_{| C_{i} |}^{i}\}$ of events that $x_{c j}^{i}$ can take $\forall i \in {0, \dots, q - 1}$ .

2.2. Distribution of Mixed-Type Data

Considering that the retained approach focuses on conditioning continuous data

x_{q} = {(x_{q j})}_{j = 1}^{J}

according to categorical data

x_{c} = {(x_{c j})}_{j = 1}^{J}

, the following joint distribution is introduced

\forall j \in {1, \dots, J}, p (x_{q j}, x_{c j}) = \prod_{c \in C_{q}} {(π_{c} N (x_{q j} | μ_{c}, Σ))}^{δ_{x_{c j}}^{c}}

(1)

where continuous variables

x_{q j}

are normally distributed according to categorical variables

x_{c j}

with means

{(μ_{c})}_{c \in C_{q}}

and variance

Σ

. As for categorical variables

x_{c j}

, they are jointly distributed according to a multivariate categorical distribution

MC (x_{c j} | π)

parametrized by weights

π = {(π_{c})}_{c \in C_{q}}

and defined by

MC (x_{c j} | π) = \prod_{c \in c_{q}} π_{c}^{δ_{x_{c j}}^{c}}

(2)

where

\forall c = (c^{0}, \dots, c^{q - 1}) \in C_{q} = C_{0} \times \dots \times C_{q - 1}

:

\sum_{c \in C_{q}} π_{c} = 1 and δ_{x_{c j}}^{c} = \{\begin{matrix} 1 if x_{c j}^{0} = c^{0}, \dots, x_{c j}^{q - 1} = c^{q - 1} \\ 0 otherwise \end{matrix} .

This multivariate categorical distribution is proposed to tackle issues related to missing data by modelling a dependence structure for

x_{c j}

that enables inference on missing categorical features.

2.3. Outlier Handling

Outliers are only considered for continuous data

x_{q} = {(x_{q j})}_{j = 1}^{J}

since only reliable categorical variables are assumed to be filled in databases and unreliable ones are processed as missing data. Then, continuous outliers are handled by introducing scale latent variables

u = {(u_{j})}_{j = 1}^{J}

conditionally to categorical data

x_{c}

due to the dependence structure established in (1) such that

\forall j \in {1, \dots, J}, x_{q j} | u_{j}, x_{c j} \sim \prod_{c \in C_{q}} N {(x_{q j} | μ_{c}, u_{j}^{- 1} Σ)}^{δ_{x_{c j}}^{c}} and u_{j} | x_{c j} \sim \prod_{c \in C_{q}} G {(u_{j} | α_{c}, β_{c})}^{δ_{x_{c j}}^{c}},

where each

u_{j}

follows conditionally to categorical data

x_{c j}

a Gamma distribution with rate and shape parameters

(α_{c}, β_{c}) \in R^{* +} \times R^{* +}

.

2.4. Missing Data Handling

Both continuous and categorical data

{(x_{q j}, x_{c j})}_{j = 1}^{J}

can be partially observed. Hence

{(x_{q j}, x_{c j})}_{j = 1}^{J}

are decomposed into observed features

{(x_{q j}^{obs}, x_{c j}^{obs})}_{j = 1}^{J}

and missing features

{(x_{q j}^{miss}, x_{c j}^{miss})}_{j = 1}^{J}

such that

\forall j \in {1, \dots, J}, \begin{matrix} x_{q j} = (\begin{matrix} x_{q j}^{miss} \\ x_{q j}^{obs} \end{matrix}) with (x_{q j}^{miss}, x_{q j}^{obs}) \in R^{d_{j}^{miss}} \times R^{d_{j}^{obs}} and d_{j}^{miss} + d_{j}^{obs} = d, \\ x_{c j} = (\begin{matrix} x_{c j}^{miss} \\ x_{c j}^{obs} \end{matrix}) with (x_{c j}^{miss}, x_{c j}^{obs}) \in C_{q_{j}^{miss}} \times C_{q_{j}^{obs}} and q_{j}^{miss} + q_{j}^{obs} = q . \end{matrix}

where

(R^{d_{j}^{miss}}, C_{q_{j}^{miss}})

and

(R^{d_{j}^{obs}}, C_{q_{j}^{obs}})

, are disjoint subsets of

(R^{d}, C_{q})

embedding missing features

(x_{q j}^{miss}, x_{c j}^{miss})

and observed features

(x_{q j}^{obs}, x_{c j}^{obs})

. Missing continuous data

x_{q}^{miss} = {(x_{q j}^{miss})}_{j = 1}^{J}

are handled by taking advantage of properties of the multivariate normal distribution to obtain a distribution for missing values. Due to the dependence structure established in (1), missing continuous data

x_{q}^{miss} = {(x_{q j}^{miss})}_{j = 1}^{J}

are distributed conditionally to observed continuous data

x_{q}^{obs} = {(x_{q j}^{obs})}_{j = 1}^{J}

and categorical data

x_{c}

as follows

\forall j \in {1, \dots, J}, x_{q j}^{miss} | x_{q j}^{obs}, x_{c j} \sim \prod_{c \in C} N {(x_{q j}^{miss} | μ_{j c}^{x_{q}^{miss}}, Σ^{x_{q}^{miss}})}^{δ_{x_{c j}}^{c}}, x_{q j}^{obs} | x_{q j}^{obs}, x_{c j} \sim \prod_{c \in C} N {(x_{q j}^{obs} | μ_{j c}^{x_{q}^{obs}}, Σ^{x_{q}^{obs}})}^{δ_{x_{c j}}^{c}},

where

\forall j \in {1, \dots, J}, \forall c \in C_{q}

:

\begin{matrix} μ_{j c}^{x_{q}^{miss}} = μ_{c}^{miss} + Σ^{cov} Σ^{{obs}^{- 1}} (x_{q j}^{obs} - μ_{c}^{obs}), μ_{j c}^{x_{q}^{obs}} = μ_{c}^{obs}, \\ Σ^{x_{q}^{miss}} = Σ^{miss} - Σ^{cov} Σ^{{obs}^{- 1}} Σ^{cov ’} and Σ^{x_{q}^{obs}} = {(Σ^{{obs}^{- 1}} + 2 \times Σ^{{obs}^{- 1}} Σ^{cov ’} {(Σ^{x_{q}^{miss}})}^{- 1} Σ^{cov} Σ^{{obs}^{- 1}})}^{- 1} . \end{matrix}

Noting that the dependence structure between categorical features is modeled through Kronecker symbols

{(δ_{x_{c j}}^{c})}_{c \in C_{q}}

, this dependence structure can be exploited to handle missing features such that the missing features

x_{c j}^{miss}

follow a multivariate categorical distribution conditionally to observed features

x_{c j}^{obs}

given by

p (x_{c j}^{miss} = c^{miss} | x_{c j}^{obs} = c^{obs}) = \frac{π_{c^{miss}, c^{obs}}}{\sum_{c^{miss} \in c_{q_{j}^{miss}}} π_{c^{miss}, c^{obs}}}

where

π_{c^{miss}, c^{obs}}

is the joint probability

π_{c}

defined in (2) for

c = (c^{miss}, c^{obs}) \in c_{q_{j}^{miss}} \times c_{q_{j}^{obs}}

.

3. Model and Inference

In this section, the proposed model is briefly presented as a hierarchical latent variable model handling missing values and outliers. Then, the inference procedure is developed through a variational Bayesian approximation. At last, classification and clustering algorithms are introduced by using the proposed model.

3.1. Model

According to a dataset

x^{obs}

of i.i.d observations, independent latent variables

h = (x^{miss}, u, z

), parameters

Θ = (a, π, α, β, μ, Σ)

of the K clusters and assumptions on mixed data defined in Section 2.1, the complete likelihood of the proposed mixture model can be expressed as

p (x^{obs}, h | Θ, K) = \prod_{j = 1}^{J} \prod_{k = 1}^{K} {(a_{k} \prod_{c \in C_{q}} {(π_{k c} N ((\begin{matrix} x_{q j}^{miss} \\ x_{q j}^{obs} \end{matrix}) | μ_{k c}, u_{j}^{- 1} Σ_{k}) G (u_{j} | α_{k c}, β_{k c}))}^{δ_{(x_{c j}^{miss}, x_{c j}^{obs})}^{c}})}^{δ_{z_{j}}^{k}}

where

$x^{obs} = {(x_{q j}^{obs}, x_{c j}^{obs})}_{j = 1}^{J}$ are the observed features,
$x^{miss} = {(x_{q j}^{miss}, x_{c j}^{miss})}_{j = 1}^{J}$ are the latent variables modelling the missing features,
$z = {(z_{j})}_{j = 1}^{J}$ the independent labels for continuous and categorical observations $x = {(x_{q j}, x_{c j})}_{j = 1}^{J}$
$u = {(u_{j})}_{j = 1}^{J}$ the scale latent variables handling outliers for quantitative data $x_{q}$ and distributed according to a Gamma distribution with shape and rate parameters $(α, β) = {(α_{k c}, β_{k c})}_{(k, c) \in {1, \dots, K} \times C_{q}}$ ,
$a = {(a_{k})}_{k = 1}^{K}$ are the weights related to component distributions,
$(μ, Σ) = {({(μ_{k c})}_{c \in C_{q}}, Σ_{k})}_{k = 1}^{K}$ the mean and the variance parameters of quantitative data $x_{q}$ for each cluster,
$π = {(π_{k})}_{k = 1}^{K}$ the weights of the multivariate Categorical distribution of categorical data $x_{c}$ for each cluster.

Eventually, the Bayesian framework imposes to specify a prior distribution

p (Θ | K)

for

Θ

which is chosen as

\begin{matrix} p (Θ | K) & = p (a | K) p (π | K) p (α, β | K) p (μ, Σ | K) \\ = D (a | κ_{0}) \prod_{k = 1}^{K} D (π_{k} | π_{0}) \prod_{c \in C_{q}} p (α_{k c}, β_{k c} | p_{0}, q_{0}, s_{0}, r_{0}) N (μ_{k c} | μ_{0_{k c}}, η_{0_{k c}}^{- 1} Σ_{k}) IW (Σ_{k} | γ_{0}, Σ_{0}) \end{matrix}

where

D (\cdot | \cdot)

and

IW (\cdot | \cdot)

denote the Dirichlet and Inverse-Wishart distributions and

p (\cdot, \cdot | p, q, s, r)

is a particular distribution designed to avoid a non-closed-form posterior distribution for

(α, β)

such that

\forall (α, β) \in R^{* +} \times R^{* +}, p (α, β | p, q, s, r) \propto p^{α - 1} e^{- q β} β^{s α} Γ {(α)}^{- r}

.

3.2. Variational Bayesian Inference

The intractable posterior distribution

P = p (h, Θ | x^{obs}, K)

is approximated by a tractable one

Q = q (h, Θ | K)

whose parameters are chosen via a variational principle to minimize the Kullback-Leibler (KL) divergence

K L [Q | | P] = \int q (h, Θ | K) log (\frac{q (h, Θ | K)}{p (h, Θ | x^{obs}, K)}) \partial h \partial Θ = log p (x^{obs} | K) - L (q | K)

with

L (q | K)

a lower bound for the log evidence

log p (x^{obs} | K)

given by

L (q | K) = E_{h, Θ} [log p (x^{obs}, h, Θ | K)] - E_{h, Θ} [log q (h, Θ | K)],

(3)

where

E_{h, Θ} [\cdot]

denotes the expectation with respect to

q (h, Θ | K)

. Then, minimizing the KL divergence is equivalent to maximizing

L (q | K)

. Assuming that

q (h, Θ | K)

can be factorized over the latent variables

h

and the parameters

Θ

, a free-form maximization with respect to

q (h | K)

and

q (Θ | K)

leads to the following update rules :

\begin{matrix} VBE - step : q (h | K) \propto exp (E_{Θ} [log p (x^{obs}, h | Θ, K)]), \\ VBM - step : q (Θ | K) \propto exp (E_{h} [log p (x^{obs}, h, Θ | K)]) . \end{matrix}

Thereafter, the algorithm iteratively updates the variational posteriors by increasing the bound

L (q | K)

. Even if latent variables

h

and parameters

Θ

are assumed to be independent a posteriori, their conditional structures are preserved as follows

\begin{matrix} q (h | K) = q (x_{q}^{miss} | u, x_{c}^{miss}, z, K) q (u | x_{c}^{miss}, z, K) q (x_{c}^{miss} | z, K) q (z | K), \\ q (Θ | K) = q (a | K) q (π | K) q (α, β | K) q (μ, Σ | K) . \end{matrix}

Eventually, the following conjugate variational posterior distributions are obtained according to the previous assumptions

\begin{matrix} q (h | K) = \prod_{j = 1}^{J} \prod_{k = 1}^{K} {({\tilde{r}}_{j k} \prod_{c_{miss} \in C_{q_{j}^{miss}}} {({\tilde{r}}_{j k c_{miss}}^{x_{c}^{miss}} \prod_{c_{obs} \in C_{q_{j}^{obs}}} {(N (x_{q j}^{miss} | {\tilde{μ}}_{j k c}^{x_{q}^{miss}}, u_{j}^{- 1} {\tilde{Σ}}_{k}^{x_{q}^{miss}}) G (u_{j} | {\tilde{α}}_{j k c}, {\tilde{β}}_{j k c}))}^{δ_{x_{c j}^{obs}}^{c_{obs}}})}^{δ_{x_{c j}^{miss}}^{c_{miss}}})}^{δ_{z_{j}}^{k}}, \\ q (Θ | K) = D (a | \tilde{κ}) \prod_{k = 1}^{K} D (π | {\tilde{π}}_{k}) \prod_{c \in C_{q}} p (α_{k c}, β_{k c} | {\tilde{p}}_{k}, {\tilde{q}}_{k}, {\tilde{s}}_{k}, {\tilde{r}}_{k}) N (μ_{k c} | {\tilde{μ}}_{k c}, {\tilde{η}}_{k c}^{- 1} Σ_{k}) IW (Σ_{k} | {\tilde{γ}}_{k}, {\tilde{Σ}}_{k}) . \end{matrix}

Their respective parameters are estimated during the VBE and VBM-steps by developing expectations

E_{Θ} [log p (x^{obs}, h | Θ, K)]

and

E_{h} [log p (x^{obs}, h, Θ | K)]

.

3.3. Classification and Clustering

According to the degree of supervision, three problems can be distinguished: supervised classification, semi-supervised classification and unsupervised classification known as clustering. The supervised classification problem is decomposed into a training step and a prediction step. The training step consists in estimating parameters

Θ

given the number of classes K and a set of training data

x

with known labels

z

. Then, the prediction step results in associating label

z^{*}

of a new sample

x^{*}

to its class

k^{*}

chosen as the Maximum A Posteriori (MAP) solution

k^{*} = arg {max}_{k = 1}^{K} q (z^{*} = k | K)

given the previous estimated parameters

Θ

. In the semi-supervised classification, only the number of classes K is known and both labels

z

of the dataset

x

and parameters

Θ

have to be determined. As for the prediction step, the MAP criterion is retained for affecting observations to classes such that

k^{*} = arg {max}_{k = 1}^{K} q (z = k | K) .

Given a set of data

x

, the clustering problem aims to determine the number of clusters

\tilde{K}

, labels

z

of data and parameters

Θ

. Selecting the appropriate

\tilde{K}

seems like a model selection issue and is usually based on a maximized likelihood criterion given by

\tilde{K} = arg max_{K} log p (x | K) = arg max_{K} log \int p (x, Θ | K) d Θ .

(4)

Unfortunately,

log p (x | K)

is intractable and the lower bound in (3) is preferred to penalized likelihood criteria [8,15,16] since it does not depend on asymptotical assumptions and does not require Maximum Likelihood estimates. Then according to an a priori range of numbers of clusters

{K_{\min}, \dots, K_{\max}}

, the semi-supervised classification is performed for each

K \in {K_{\min}, \dots, K_{\max}}

and both

z^{K}

and

Θ^{K}

are estimated. Finally, the number of classes

\tilde{K}

in (4) is chosen as the maximizer of the lower bound

L (q | K)

:

\tilde{K} = arg max_{K} L (q | K) .

(5)

After determining

\tilde{K}

, only

z^{\tilde{K}}

and

Θ^{\tilde{K}}

are kept as estimated labels and parameters.

4. Application

In this section, the proposed method is performed on a radar emitter dataset. For comparison, a standard neural network (NN), the k-nearest neighbours (KNN) algorithm, Random Forests (RdF) the k-means algorithm are also evaluated. Two experiments are carried out to evaluate classification and clustering performance with respect to a range of percentages of missing values.

4.1. Data

Realistic data are generated from an operational database gathering 55 radar emitters presenting various patterns. Each pattern consists of a sequence of pulses which are defined by a triplet of continuous features (pulse features) and a quartet of categorical features (pulse modulations) listed among 42 combinations of the categorical features. For each radar emitter, 100 observations

{(x_{j})}_{j = 1}^{100}

are simulated from its pattern of pulses such that an observation

x_{j} = (x_{q j}, x_{c j})

is made up of continuous features

x_{q j}

and categorical features

x_{c j}

related to one of the pulses. Extra missing values are added to evaluate limits of the proposed approach by randomly deleting coordinates of

{(x_{q j})}_{j = 1}^{100}

and

{(x_{c j})}_{j = 1}^{100}

for each of the 55 radar emitters. Therefore, imputation methods [17] are used to handle missing data for comparison algorithms. As for continuous missing data, they are handled through the Mean and k-nearest neighbours imputation methods whereas missing categorical data are handled through the k-nearest neighbours and mode imputation methods. These imputation methods are compared with the proposed approach where missing continuous data are reconstructed through the variational posterior marginal mean of missing continuous data given by

\forall j \in {1, \dots, J},

{\tilde{x}}_{q j}^{miss} = E_{x_{qj}^{miss}} [\int q (x_{q j}^{miss}, u_{j}, x_{c j} z_{j}) \partial u_{j} \partial x_{c j} \partial z_{j}] = \sum_{k = 1}^{K} {\tilde{r}}_{j k} \sum_{c^{obs} \in C_{q_{j}^{obs}}} δ_{x_{c j}^{obs}}^{c^{obs}} \sum_{c^{miss} \in C_{q_{j}^{miss}}} {\tilde{r}}_{j k c^{miss}}^{x_{c}^{miss}} {\tilde{μ}}_{j k c^{obs} c^{miss}}^{x_{q}^{miss}}

(6)

and missing categorical data are reconstructed through the variational posterior marginal mode of missing categorical data given by

\forall j \in {1, \dots, J},

{\tilde{x}}_{c j}^{miss} = arg max_{c^{miss} \in C_{q_{j}^{miss}}} \int q (x_{c j}^{miss}, z_{j}) d z_{j} = arg max_{c^{miss} \in C_{q_{j}^{miss}}} \sum_{k = 1}^{K} {\tilde{r}}_{j k} {\tilde{r}}_{j k c^{miss}}^{x_{c}^{miss}} .

(7)

4.2. Classification Experiment

The classification experiment evaluates the ability of each algorithm to assign unlabeled data to one of the K classes trained by a set of labeled data. Since comparison algorithms do not handle datasets including missing values, a complete dataset is used to enable their training. During the prediction step, incomplete observations are completed thanks to the mean and KNN imputation methods and the posterior reconstructions defined in (6) and (7). For the classification experiment, results are shown in Figure 1. Without missing data, both algorithms cannot perfectly classify the 55 radar emitters for the 2 datasets. Indeed, both algorithms reach accuracies of

90 %

for the continuous dataset and

98 %

for the mixed dataset. These performance can be explained by the non total separability of continuous and categorical datasets since the 55 emitters share 42 combinations of categorical features and some intervals of continuous features. Nonetheless when mixed data are taken into consideration, the dataset becomes more separable leading to higher performance of both algorithms. When the proportion of missing values increases, the proposed model outperforms comparisons algorithms for each dataset. It achieves accuracies of

80 %

and

95 %

for

90 %

of deleted continuous and mixed values whereas accuracies of comparison algorithms are lower than

65 %

and

75 %

with missing data imputation from standard methods. These higher performance of the proposed model reveal that the proposed method embeds a more efficient inference method than other imputation methods. That result is confirmed on Figure 1 when comparison algorithms are applied on data reconstructed by the proposed model. Indeed when the proposed inference is chosen, comparison algorithms share the same performance than the proposed model and manage to handle missing data even for

90 %

of deleted values.

Then, effectiveness of the proposed model can be explained by the fact that missing data imputation methods can create outliers that deteriorate performance of classification algorithms whereas the inference on missing data and labels prediction are jointly estimated in the proposed model. Indeed, embedding the inference procedure into the model framework allows properties of the model, such as outliers handling, to counterbalance drawbacks of imputation methods such as outlier creation.

4.3. Clustering Experiment

The clustering experiment tests the ability of each algorithm to find the true number of clusters

\tilde{K}

among

{35, \dots, 85}

. The lower bound (3) and the average Silhouette score [18] are criteria used to select the optimal number of clusters for the proposed model and the k-means algorithm. Results of the clustering experiment are visible on Figure 2 which presents numbers of clusters selected by the lower bound and average Silhouette scores for the proposed model and k-means algorithm according to different proportions of missing values and imputation methods. Without missing data, the correct number of clusters (K = 55) is selected by the two criteria for the k-means algorithm and the proposed model when continuous and mixed data are clustered. In presence of missing values, the average Silhouette score mainly selects

K = 65

when the k-means algorithm is run on the 2 datasets completed by standard imputation methods. When, the k-means algorithm performs clustering on the posterior reconstructions, the average Silhouette score correctly selects

K = 55

until

60 %

of missing values for continuous data and

40 %

of missing values for mixed data. Eventually when the proposed model does clustering, the two criteria select the correct number of clusters

K = 55

until

70 %

of missing values for continuous and mixed data. These results show two main advantages of the proposed model. As previously, the proposed model provides a more robust inference on missing data since the average Silhouette score chooses more representative number of clusters when the k-means algorithm is run on the posterior reconstructions than on data completed by standard imputation methods. Furthermore, since the lower bound criterion also selects the correct number of clusters as the average Silhouette score, it can be used as a valid criterion for selecting the optimal number of clusters and does not require extra computational costs as the Silhouette score since it is computed during the model parameter estimation. Finally, the proposed approach provides a more robust inference on missing data and a criterion for selecting the optimal number of clusters without extra computations.

5. Conclusions

In this paper, a mixture model handling both continuous data and categorical data is developed. More precisely, an approach based on the conditional Gaussian mixture model is investigated by establishing conditional relations between continuous and categorical data. Benefiting from a dependence structure designed for mixed-type data, the proposed model shows its efficiency for inferring on missing data, performing classification and clustering tasks and selecting the correct number of clusters. Since the posterior distribution is intractable, model learning is processed through a variational Bayesian approximation where variational posterior distributions are proposed for continuous and categorical missing data. Experiments point out that the proposed approach can handle mixed-type data even in presence of missing values and can outperform standard algorithms in classification and clustering tasks. Indeed the main advantage of our approach is that it enables the counterbalance of imputation methods drawbacks by embedding the inference procedure into the model framework.

References

Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise; Kdd: Portland, OR, USA, 1996; Volume 96, pp. 226–231. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. Density-Based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Discov. 1998, 2, 169–194. [Google Scholar] [CrossRef]
Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]
Biernacki, C.; Celeux, G.; Govaert, G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 719–725. [Google Scholar] [CrossRef]
Lawrence, C.J.; Krzanowski, W.J. Mixture separation for mixed-mode data. Stat. Comput. 1996, 6, 85–92. [Google Scholar] [CrossRef]
Everitt, B. A finite mixture model for the clustering of mixed-mode data. Stat. Probab. Lett. 1988, 6, 305–309. [Google Scholar] [CrossRef]
Andrews, D.F.; Mallows, C.L. Scale Mixtures of Normal Distributions. J. R. Stat. Soc. Ser. B (Methodol.) 1974, 36, 99–102. [Google Scholar] [CrossRef]
Waterhouse, S.; MacKay, D.; Robinson, T. Bayesian methods for mixtures of experts. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1996; pp. 351–357. [Google Scholar]
Schleher, D.C. Introduction to Electronic Warfare; Technical report; Eaton Corp., AIL Div.: Deer Park, NY, USA, 1986. [Google Scholar]
Richards, M.A. Fundamentals of Radar Signal Processing; McGraw-Hill Education: New York, NY, USA, 2005. [Google Scholar]
Akaike, H. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike; Springer: New York, NY, USA, 1998; pp. 199–213. [Google Scholar]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
García-Laencina, P.J.; Sancho-Gómez, J.L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl. 2010, 19, 263–282. [Google Scholar] [CrossRef]
Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 344. [Google Scholar]

Figure 1. Classification performance are presented for the proposed model (PM) in blue, the NN in red, the RnF in green and the KNN in cyan. For each figure, solid lines represent accuracies with a posteriori reconstructed missing data, doted dashed lines stand for accuracies with mean/mode imputation whereas dashed lines show accuracies with KNN imputation for the comparison algorithms.

Figure 2. Estimation of the number of clusters using the lower bound (LB) and the silhouette score (S) for the proposed model and only the silhouette score (S) for the k-means algorithm.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Revillon, G.; Mohammad-Djafari, A. A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers ^†. Proceedings 2019, 33, 23. https://doi.org/10.3390/proceedings2019033023

AMA Style

Revillon G, Mohammad-Djafari A. A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers ^†. Proceedings. 2019; 33(1):23. https://doi.org/10.3390/proceedings2019033023

Chicago/Turabian Style

Revillon, Guillaume, and Ali Mohammad-Djafari. 2019. "A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers ^†" Proceedings 33, no. 1: 23. https://doi.org/10.3390/proceedings2019033023

APA Style

Revillon, G., & Mohammad-Djafari, A. (2019). A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers ^†. Proceedings, 33(1), 23. https://doi.org/10.3390/proceedings2019033023

Article Menu

A Complete Classification and Clustering Model to Account for Continuous and Categorical Data in Presence of Missing Values and Outliers ^†

Abstract

1. Introduction

2. Mixed-Type Data

2.1. Assumptions on Mixed-Type Data

2.2. Distribution of Mixed-Type Data

2.3. Outlier Handling

2.4. Missing Data Handling

3. Model and Inference

3.1. Model

3.2. Variational Bayesian Inference

3.3. Classification and Clustering

4. Application

4.1. Data

4.2. Classification Experiment

4.3. Clustering Experiment

5. Conclusions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI