Next Article in Journal
Levels of Confidence and Utility for Binary Classifiers
Previous Article in Journal
Forecasting Mortality Trends: Advanced Techniques and the Impact of COVID-19
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem

by
Stefan Michael Stroka
* and
Christian Heumann
Department of Statistics, Ludwig-Maximilians-University Munich, 80539 Munich, Germany
*
Author to whom correspondence should be addressed.
Stats 2024, 7(4), 1189-1208; https://doi.org/10.3390/stats7040070
Submission received: 21 August 2024 / Revised: 11 October 2024 / Accepted: 14 October 2024 / Published: 17 October 2024

Abstract

:
The growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectiveness of anonymization and increase traceability. In fact, combining probability distributions with a small training sample can effectively infer true metric values from discrete information, depending on the model and data complexity. Our method uses metric values and ordinal classes to model latent normal distributions for each discrete class. This approach, applied with both linear and Bayesian linear regression, aims to enhance supervised learning models. Evaluated with synthetic datasets and real-world datasets from UCI and Kaggle, our method shows improved mean point estimation and narrower prediction intervals compared to the baseline. With 5–10% training data randomly split from each dataset population, it achieves an average 10% reduction in MSE and a ~5–10% increase in R² on out-of-sample test data overall.

1. Introduction

Today, both private and business data are incredibly valuable and are targeted by various interest groups at every possible junction. As a result, data privacy has become a highly relevant issue affecting both individuals and organizations. Often, however, data are handled carelessly, with an overreliance on existing anonymization methods, leading to potential risks.
The permanent recording and insufficiently regulated sale of anonymized data present significant risks of re-identification [1], which, according to current research, cannot be completely ruled out despite protective measures [2]. Experts in data governance warn against the false sense of security provided by data anonymization. Advanced machine learning models and statistical techniques, such as modeling probability distributions, are increasingly uncovering methods to re-identify supposedly anonymized data, thereby compromising anonymity [3].
Common challenges in this field include the small sample size problem, whereby the sample size has a significant impact on the generalizability of research findings [4]. According to common recommendations, it should not be less than 10% in pilot studies [5]. Additionally, computational complexity and the reliability of model predictions pose hurdles, as the forecasts depend on the reliability of the given data [6] and the quantification of uncertainties in the predictions [7].
This paper addresses the traceability and reversibility of anonymized metric data or target values through discretization, which may lead to security or information loss. Techniques such as ordinalization [8] and rounding [9] coarsen the metric space into adjacent ordinal classes, leading to information loss depending on the number of ordinal classes. Previous research has explored the evaluation of data based on discrete classes, considering the uncertainty of data [10], the optimum between data usability and maximum anonymity [11], and data diversity and characteristic details [12]. It has also investigated applying k-anonymization for de-identification [13], the conversion of ordinal classes back to metric values, focusing on assumptions about underlying distributions [14], and an unsupervised learning approach combined with discriminative information [15]. Other approaches have examined how rough set theory, as an example, benefits from discretizing continuous value ranges [16], the conditions under which it makes sense to convert the data [17], and how ordinal classes can be treated as a continuous space [18].
Recent studies have also shown improved methods for enhancing data privacy and anonymity [19,20,21,22,23,24,25], where the main focus is on providing a systematic overview of existing anonymization techniques, especially in light of the increasing availability of data from social networks [19], as well as a comparative study of five current techniques for anonymizing collected data, assessing their strengths and weaknesses [20]. Further research has reviewed existing methods such as generalization and bucketization [21], compared suppression with slicing with other common techniques [22], highlighted weaknesses related to the compatibility of independently generalized data [23], and explored anonymization through pseudonymization [24] and a pseudo creation technique [25]. Furthermore, it is interesting to note how the use of latent influencing factors based on ordinal classes improves Bayesian analysis [26] and generates more accurate classifications compared to traditional classification methods [27].
Modeling probabilistic distributions for latent categorical variables suggests that assuming a continuous latent distribution within ordinal classes allows for precise value derivation from limited data using machine learning models. This implies that probability distributions for ordinal classes, even with small datasets, can provide a good distributional model and a reliable approximation of the underlying continuous latent distribution. This paper aims to demonstrate that data coarsening for anonymization can be misleading and not fully reliable. We propose a new approach that enables high prediction accuracy of true metrics values from anonymized data using deterministic or probabilistic supervised learning regression models. The re-anonymized results are also analyzed for uncertainties.
In the following sections, we apply our approach to simulation studies on both low- and high-complexity synthetic data and conduct a benchmarking study with publicly available datasets from various application domains.

2. De-Anonymization of Metric Data

To ensure anonymity in surveys or, in general, in data protection, data discretization is an often-applied approach. This approach involves dividing metric or continuous data into classes, thereby coarsening it into discrete information. In this section, we describe our novel methodology for reliably reverting anonymized information to its true (metric) values, highlighting the issues of this anonymization technique.

2.1. Introduction to the Method

Our methodology is based on the assumption that even a small set of precise, non-discretized information (a very small training dataset with metric values) is sufficient to train models that are capable of de-anonymizing coarsened data and inferring the metric values of out-of-sample data, considering uncertainties. In the process, we model latent distributions (of each class) with normal distributions based on the available precise information and use these to generate probabilities for the discretized observations. The goal is to retrospectively reverse the discretization with minimal bias.

2.2. Process of Discretization

The new approach promises that reliable inferences for the entire discrete ordinal class can be drawn from just a few metric data points per class and their distributions. This logic describes the partitioning of the distribution of the entire metric space into ordered, discrete classes, each associated with a specific sampling distribution. Consequently, the division and choice of class boundaries significantly influence the sampling distribution within each class.
The grouping of data inevitably leads to a loss of information. However, statistical clustering methods that minimize squared errors can assist in optimally establishing group boundaries, thereby facilitating an optimal classification of normally distributed data concerning the frequency distribution within the class [28]. Other methodologies aim to reduce the complexity of continuous distributions while preserving as much information as possible through Representative Points (RPs). RPs can be generated using techniques such as Monte Carlo sampling, deterministic point selection, or MSE-based clustering, thereby optimizing classification by minimizing a loss function. The commonly used k-means algorithm can also be employed as an approach in this context [29].
In contrast to statistical grouping, there is also the possibility of a predefined, data-independent, or random classification into ordered classes. In practical applications, it may occur that existing classes are utilized by the methodology. Therefore, the objective of this paper is to model existing classes and boundaries that may contradict an optimal statistical clustering.

2.3. Theoretical Formulation

Let X = { x 1 , x 2 , x 3 , x m } be the feature matrix of m independent variables, where each variable is assumed to be independently and identically distributed (i.i.d.). Let y be the dependent target variable with the value range W y . We address a regression problem
f : R m W R f : X y
with a very small training sample relative to the test data (i.e., small sample size problem). We sloppily define the grid with K + 2 subclasses as
g r i d y = { c l a s s 0 , c l a s s 1 , c l a s s 2 , , c l a s s K , c l a s s K + 1 y } ,   g r i d = K + 2
comprising disjoint ordinal classes that depend on the target variable y . The g r i d refers to a finite set of ordered, ordinally scaled classes { 0 , 1 , 2 , , K + 1 } . Consequently, the following holds:
{ c l a s s 0 , c l a s s 1 , c l a s s 2 , , c l a s s K + 1 } = { 0 , 1 , 2 , , K , K + 1 } .
Each class k from the set { 0 , 1 , 2 , , K + 1 } is defined by two threshold values, a k and a k + 1 . These thresholds satisfy the following conditions:
= a 0 < a 1 < < a K < a K + 1 < a K + 2 = .
As a result, the following holds:
k = 1 K + 1 i = 1 n : y i c l a s s k y i a k , a k + 1 a k y i < a k + 1
and for k = 0 :
i = 1 n : y i c l a s s 0 y i a 0 , a 1 = a 0 < y i < a 1 .
K + 2 classes are defined by K + 3 thresholds. Hence, each class has a lower threshold a k and an upper threshold a k + 1 , which partition the continuous value range W of y into discrete, adjacent subgroups. Specifically, this can be expressed as:
g r i d | y = a 0 , a 1 , , a K , a K + 1 , a K + 2 W .
In the following application, we focus on a selected finite subset of the classes and their associated thresholds. Consequently, we disregard the outer classes c l a s s o = a 0 , a 1 and c l a s s K + 1 = a K + 1 , a K + 2 . This restriction does not impact the results because:
i = 1 n : y i a 1 , , a K + 1 a 0 , a 1 , , a K , a K + 1 , a K + 2 W .
In the transformation T , the metric values y are mapped to classes based on these thresholds:
T : W R c l a s s 1 , c l a s s 2 , , c l a s s K N 0 T : y { 1 , , K }
The resulting vector c l a s s y is defined as a new, optionally applicable feature X m + 1 , which is subsequently used as an ordinal-scaled variable and one-hot encoded.
Following, each class for each observation represents a discretized continuous value y . Now, we split the given data with observations { 1 , , n } into training data { 1 , , l } and test data l + 1 , , n . For each k -th ordinal class c l a s s k , a normal distribution N ( μ k , σ k 2 ) is fitted by estimating its parameters using the mean and standard deviation formulas to approximate the histogram of the training data for y c l a s s k . With mean μ k and standard deviation σ k of the normal distribution for the k -th class from k = { 1 , , K } , it follows that:
i = 1 l : y i t r a i n c l a s s k | c l a s s k ~ N ( μ k , σ k 2 ) .
The parametric modeling of the normal distributions allows for determining the parameters μ and σ for each class, and thus, the densities for each observation can be calculated. While random sampling from the distribution could be used to capture the properties of the density function as an input feature, we aim to provide more precise information for each observation. Therefore, we determine the absolute f i and relative frequency p i of y i based on the density of its class. For each k -th class, we determine the mean and standard deviation as follows:
k = 1 K : μ y c l a s s k = μ k σ y c l a s s k = σ k .
Then, we calculate the probability density function (PDF) value for the given train data y i t r a i n given the k -th class:
f y i t r a i n | c l a s s k = 1 σ k 2 π e x p y i t r a i n μ k 2 2 σ k 2 = f i k
The PDF indicates how densely the random variable y i t r a i n is distributed around a specific value. For the relative frequency density, we have:
p i k t r a i n = f i k t r a i n N k t r a i n
where N k t r a i n is the total number of training samples in class k , and:
k = 1 K N k = k = 1 K N k t r a i n + k = 1 K N k t e s t = k = 1 K ( N k t r a i n + N k t e s t ) = n .
The relative frequency p i k t r a i n thus provides a value that is proportional to the probability density at y i , indicating how likely it is that y i lies in a small interval around the given point. By integrating the density over a continuous range
b ϵ 2 , b + ϵ 2
with a sufficient small ϵ , the actual probability can be approximated by:
P b ϵ 2 y i b + ϵ 2 | c l a s s k = b ϵ 2 b + ϵ 2 f y i c l a s s k d y .
Since f y i c l a s s k is nearly constant within a sufficiently small interval around b , it follows that:
P b ϵ 2 y i b + ϵ 2 | c l a s s k ϵ f b c l a s s k ,
where ϵ is independent of the density function f .
To enable this for out-of-sample applications where y -values are not available, we employ a simple linear regression model as a transfer learning method. In the first step, we predict the y t e s t - values for the out-of-sample data. In the next step, we determine the proportional probabilities with PDF-values that the i -th prediction y ^ i t e s t for i = { l + 1 , , n } falls within the given classes. With the linear prediction model
y ^ t e s t = X β ^ + ϵ , ϵ ~ N ( 0 , σ ϵ )
and the class distribution based on the training data y t r a i n (with μ k , σ k ), we generate K new input features for the test data for these K classes as follows:
f y ^ i t e s t | c l a s s k = 1 σ k 2 π e x p y ^ i t e s t μ k 2 2 σ k 2 = f i k t e s t
and
p i k t e s t = f i k t e s t N k
with N k t e s t as the total number of test samples in class k . Finally, this results in a new feature matrix P with n observations for K classes, and therefore:
P = p 11 p 1 K p n 1 p n K = p 11 p 1 K p l 1 p l K p ( l + 1 ) 1 p ( l + 1 ) K p n 1 p n K = P t r a i n P t e s t
with as the row-wise binding operator.

2.4. Features and Model Definition

The choice of regression model is flexible and independent of our proposed method. For demonstration purposes, this paper uses linear regression, but other supervised learning regression models can also be employed. We differentiate between four models based on the features used. Each of these models is evaluated using both linear regression and Bayesian linear regression approaches.

2.4.1. Linear Regression

Let X t r a i n be the ( l × m ) -Matrix of the training data, c l a s s t r a i n ( = c l a s s t r a i n | y ) as the ( l × K ) -matrix of ordinal classes dependent on the target variable train data ( c l a s s t r a i n is a one-hot-encoded l × 1 v e c t o r ), and P t r a i n as the ( l × K ) -probability matrix (with the probably for each k -th class of all K for all training data observations l ). In the following, we modify the input feature matrix X for four different model approaches. For all models, the following applies:
y t r a i n = X β + ϵ , ϵ ~ N ( μ ϵ , σ ϵ )
where X   (   1 , , 4 ) serves as a placeholder for the respective input feature matrix. It follows that:
Model 1:
X l × 1 + m 1 = 1 l × 1 X l × m t r a i n   ( with   β 1 + m × 1 )
Model 2:
X l × 1 + m + K 2 = 1 l × 1 X l × m t r a i n P l × K t r a i n   ( with   β 1 + m + K × 1 )
Model 3:
X l × 1 + m + K 3 = 1 l × 1 X l × m t r a i n c l a s s l × K t r a i n   ( with   β 1 + m + K × 1 )
Model 4:
X l × 1 + m + 2 K 4 = 1 l × 1 X l × m t r a i n c l a s s l × K t r a i n P l × K t r a i n   ( with   β 1 + m + 2 K × 1 ) .

2.4.2. Bayesian Linear Regression

Following a deterministic analysis, it is essential to consider whether uncertainties behave similarly. Therefore, we extend the models to a probabilistic perspective by incorporating prior distributions for β in each model. Therefore, we use standard prior distributions with
β ~ N 0 ,   10 2 I   and   σ ~ H a l f N ( 1 ) .
The linear prediction is then given by
μ = X β
with the likelihood
y | β , σ ~ N ( μ , σ 2 )
Finally, the posterior distribution is derived as:
p β | y L H p ( β ) p ( σ ) with   L H :   p ( y β , σ ) = i = 1 n N ( y i β X ˇ , σ 2 ) .

2.5. Evaluation Metrics

In the following sections, we examine various use cases and compare them based on selected metrics. For deterministic analysis, we use the Mean Squared Error ( M S E ) and the coefficient of determination ( R 2 ). The evaluation is conducted using a train-test data split, which enables the assessment of the model’s performance (in-sample evaluation) and its predictive accuracy (out-of-sample evaluation) using M S E and R 2 .
In the probabilistic analysis, we extend these metrics to include the C o v e r a g e r a t e , the average width of the prediction interval ( P I   w i d t h ), and the ratio of coverage to width ( r a t i o ).
The Bayesian linear regression enables sampling M prediction vectors y ^ ( 1 ) , , y ^ ( M ) from the posterior predictive distribution, where each draw y ^ ( m ) = ( y ^ l + 1 ( m ) , . , y ^ n m ) represents an out-of-sample forecast using the same parameter. This results in the following out-of-sample prediction matrix:
y ^ p o s t = y ^ l + 1 1 y ^ n 1 y ^ l + 1 M y ^ n M T .
From this matrix, we calculate the vector of posterior mean estimates by taking the row-wise mean:
m e a n p o s t = y ¯ l + 1 y ¯ n = 1 M m = 1 M y ^ l + 1 ( m ) 1 M m = 1 M y ^ n ( m )
The vector of standard deviations is then given by:
s t d p o s t = s t d ( y ^ l + 1 ) s t d ( y ^ n ) = 1 M m = 1 M y ^ l + 1 ( m ) y ¯ l + 1 2 1 M m = 1 M y ^ n ( m ) y ¯ n 2 .
For a 95% prediction interval (PI), the interval boundaries are calculated as:
P I : m e a n p o s t ± 1.96 s t d post
Due to the simplicity of the regression problem, using a small M is sufficient for the normal approximation of the PI. However, for more complex data problems, M should be large (e.g., M 100 ) to accurately determine the credibility interval for the 2.5% and 97.5% quantiles of the M values with respect to the model parameters. The coverage rate is calculated as:
c o v e r a g e r a t e = i = 1 n 1 y ^ P I n  
Thus, the PI boundaries are based on discrete forecast values. Using these boundaries, the mean distance between them is calculated as follows:
w i d t h = i = l + 1 n m e a n p o s t + 1.96 s t d p o s t m e a n p o s t 1.96 s t d p o s t i n l + 1
Finally, the ratio is given by:
r a t i o = c o v e r a g e r a t e w i d t h
This ratio represents the number of predictions within the PI divided by the mean width of the PI boundaries.

3. Explanatory Application Example

To gain a deeper understanding of probability distributions and models, we start with a straightforward regression problem. We compare various test-training data ratios for standard and Bayesian linear regression.

3.1. Application Example Design

For the multivariate application problem, 10,000 observations are drawn i.i.d. with X   ~   N ( 0 , 1 ) and the target variable is determined using the function:
y = f x = 4 + 3 X + ϵ ,     ϵ ~ N ( 0 , σ ϵ )
Ordinal class boundaries dependent on y are randomly set (boundaries = {−8.9, 0, 5, 10, 16.8} with K = 4 classes), which determine the ordinal class for each observation. The generated population is divided into different ratios of test and training datasets for the analysis. The split is stratified based on the original classes to ensure that the class distribution remains proportional and all classes are represented in the training data. Each ordinal class thus has training data used to estimate the parameters (mean and standard deviation) of a normal distribution for that class.
In the subsequent analysis, X and the ordinal classes for the population are assumed to be given. The goal is to achieve the best possible out-of-sample prediction for the target variable (for the test data), taking uncertainties into account. We compare the four models mentioned with their respective input feature combinations for both standard linear regression and Bayesian linear regression.

3.2. Input Features

The given input features for the population are the observations X and the ordinal class c l a s s y . To account for the probability distributions of the classes P , we use the target variable from the training data y t r a i n to determine the probability of being in a class. Since the target variable for the test data y t e s t is unknown during model training, we use a simple linear model to obtain an estimate of the target variable and, based on this estimate and the probability distributions, determine the probabilities approximately.
Thus, the prediction is:
y ^ t e s t = X t e s t β ^ k = 1 K P k t e s t
This results in new input feature variables with probabilities for the test data, depending on the number of classes (and also introducing variability in the prior distribution).

3.3. Models to Compare

We compare standard and Bayesian linear regression using the following feature settings:
Model1 (Standard)2 (Baseline)34
X*X1X2X3X4

3.4. Descriptive Statistics

The goal of our new methodology is to provide robust out-of-sample predictions with limited training data. A critical factor in this is the test-training data ratio. Figure 1 illustrates the Mean Squared Error ( M S E ) and the Jensen-Shannon Divergence between the Kernel density estimation (KDE)-modeled distribution of the training data and the distribution of the target variable in the entire population. Various ratios ranging from 0 to 1 are evaluated. The idea is to compare the approximated distribution of the small sample training dataset with the distribution of the ground truth.
Figure 1 indicates that, in the case of low complexity, there is no significant increase in discrepancy up to a test proportion of approximately 95%. This suggests that beyond a certain size of the training dataset, the distribution from the training data, despite the small sample size, can probably closely approximate the ground truth. Based on this evaluation, we next examine training data proportions of 5%, 10%, 20%, and 50% from the population.
The following figures display the described data in x-y plots, showing the corresponding probability distributions for each class based on the training-test split with 5/95 (Figure 2) and 80/20 ratios (Figure 3).
Figure 2 visually demonstrates that even with just 5% of the training data, the probability distribution is similar to that based on the entire dataset. Increasing the proportion of training data to 80% leads to an even closer approximation of the class distribution from the total data, as shown in Figure 3. However, despite the significantly larger training dataset, there is no exceptionally strong visual improvement in the comparison between Figure 2 and Figure 3. This supports our hypothesis for this application case: with a less complex frequency distribution of y , having as little as 5% of the training data can already yield comparably good results for predicting the true value of y .

3.5. Analysis and Evaluation

In the detailed analysis, we examine how well the models fit depending on the given features. Figure 4 displays the test and training data along with the linear regression predictions for the out-of-sample forecasts across all feature settings. It is important to note that while linear regression can appear non-linear in the subsequent figures, this is due to the graphical representation being limited to a 2D x-y view. Other influencing factors are still incorporated into the predictions. Therefore, several dimensions (influencing factors) used in the analysis are reduced to two for visual clarity.
Figure 4 illustrates the significant impact of including probability features and, especially, class features on model performance. A visual comparison in 2D between (1) and (4) shows that our method allows for a more flexible (and even seemingly nonlinear) modeling compared to standard linear regression. This is because the hyperplane in multivariate standard linear regression is extended by 2 K additional planes, where K is the number of classes. In the following Table 1, we prove this statement with regression evaluation metrics.
The out-of-sample results in Table 1 indicate that expanding the model from input feature setting (1) to (2) leads to a significant reduction in M S E (approximately 5–10%) and an average improvement in R 2 of about 1% across all proportions. Notably, as the proportion of training data increases, the benefit from the probability distributions diminishes. This, combined with the in-sample results, suggests that a higher proportion may increase the likelihood of overfitting. However, the new features provide substantial benefits in out-of-sample evaluation, especially when training data are limited (i.e., in small sample size problems). This advantage is also observed in the comparison between settings (3) and (4). The comparison between (1) and (3) is less relevant because the additional categorical variable has a significant impact on the model, making such a comparison less meaningful.
In the next step, we examine the same evaluation considering uncertainties. For Bayesian linear regression, we extend point estimates with a PI. Figure 5 shows, similar to Figure 4, how different feature combinations (1)–(4) affect out-of-sample predictions.
Comparing Figure 5 to Figure 4, the average predictions remain nearly identical. However, when examining the PIs, it becomes apparent that adding probability features increases the uncertainty, as seen in the comparison between models (1) and (2) and between (3) and (4). Nevertheless, the very narrow PI for (1) seems to underestimate the actual uncertainty of the test data.
Regarding Figure 5, Table 2 confirms that while the absolute coverage rate increases in comparisons between (1) and (2) and between (3) and (4), the coverage relative to the width of the PI (Ratio) generally decreases.

4. Simulation Study

In this section, we examine a more complex simulation study, oriented on a real-world application with respect to variable names and ranges. This approach allows us to gain comprehensive insights into the entire population, unlike using potentially biased or skewed samples. As before, the data are divided into training and test sets.

4.1. Simulation Design

The generated synthetic data include four input features that influence the log-normally distributed target variable. Let X be the feature matrix. This multivariate data problem is designed to simulate a real-world application where, in a street survey, individuals are asked about their age, work experience, education level, weekly hours, and the target variable, salary. A relatively small portion is asked for their exact salary (training data), while the remainder are asked to categorize their salary into ranked (ordinal) classes based on predefined thresholds (test data), such as ‘low income’ or ‘high income’.
For dataset 1, the following equation applies:
s a l a r y = 0.02 X 1 + 0.05 X 2 + 0.1 X 3 + 0.03 X 4 + ϵ 1
and for dataset 2:
s a l a r y = 0.03 X 1 + 0.06 X 2 + 0.04 X 4 + 0.1 X 5 + ϵ 2
with
s a l a r y ~ L o g N ( μ , σ 2 )
and
X 1 :   a g e   [ i n   a ]   ~   U ( 20 , 65 )
X 2 :   e x p e r i e n c e   i n   a   ~   U 0 , 40
X 3 :   e d u c a t i o n a l   l e v e l   ~   U ( 1 , 5 )
X 4 :   h o u r s   p e r   w e e k   i n   h   ~   U ( 20 , 60 )
X 5 :   p e r f o r m a n c e   r a t i n g   ~   U ( 1 , 6 )
ϵ 1 :   n o i s e   ~   N 0 , 0.1
ϵ 2 :   n o i s e   ~   N ( 0 , 0.2 ) .
The coefficients are chosen randomly. In the simulation, n samples are drawn such that the input matrix dim X = ( n × 4 ) . Samples are also randomly drawn and evaluated for n = 3000 and n = 5000 for each dataset. For reproducibility, a random seed is used per dataset. Figure 6 shows a simulation example ( n = 3000 ) with the correlations between the independent and dependent variables on the left and the approximately log-normally distributed target variable (salary) in detail on the right.
The simulation allows us to apply the methodology to a known and thoroughly understandable ground truth. We compare different train-test split ratios of 5/95, 10/90, and 20/80. This simulates the applicability to small sample size problems to re-anonymize ordinal data. Only for the training data, the exact salary values are provided. In the next step, we evaluate the simulated datasets for the application of the methodology.

4.2. Simulation Results

Table 3 shows the evaluation of the six different datasets. Notably, in all cases, using the linear baseline model (2) instead of the standard model (1) provides a significant improvement. This confirms that even without including the ordinal class (3) as a categorical feature, exceptional out-of-sample results can be achieved. In the comparison between models (3) and (4), no substantial improvement was observed. Often, model (4) tends to overfit, indicating that model (3) is sufficiently trained. However, in some cases, such as dataset 3 with 5% training data, an improvement in the out-of-sample value by 0.5% was achieved despite the already very good result of ~95%. We conclude that, depending on the data complexity and the SL regression model used, further improvements are possible by incorporating probabilities even when using categories as features. Regarding the MSE, it is also very clear that using probabilities (2) results in a 3–5 times lower MSE compared to standard linear regression (1).

5. Benchmarking

To evaluate real-world applicability, we use multiple datasets from the UCI Database and Kaggle to conduct a benchmark study comparing various feature combinations and models.

5.1. Datasets

For a comprehensive analysis, we select datasets that differ in terms of the ratio between the number of observations and the number of features. We examine two datasets for each category: average, low, and high sample/feature ratio. Table 4 shows the settings for class size and thresholds for each dataset, and Table 5 provides additional descriptive information.

5.2. Evaluation

In the evaluation, we use models (1) and (3) as benchmarks for models (2) and (4), respectively, because models (1) and (2) and models (3) and (4) are comparable based on the given input. Looking at Table 6, it is evident that the use of probability distributions as additional input in model (2) often provides a significant improvement compared to standard models (1). This also holds true in comparisons between (1) and (3). However, when comparing models (2) and (3), their performance is often comparable. Specific exceptions include the Boston Housing dataset for 5% and 10% training data, where the use of probability distributions with models (2) and (4) confuses the model, which results in worse results. Comparing models (3) and (4) confirms that overfitting is a concern, particularly with model (4). The datasets for Automobile and Student Performance highlight a key limitation of both linear regression and our new approach: models with a low sample-to-feature ratio struggle to produce reliable results, even when probabilistic features are incorporated. In these cases, our method combined with linear regression reaches its limits. This limitation is particularly evident in the Student Performance dataset, where increasing the proportion of training data leads to improved model results. This suggests that a low sample-to-feature ratio significantly restricts performance. A similar trend is observed in the AutoMPG and Boston Housing datasets, which have a moderately higher sample-to-feature ratio. While we observe improvements from model (1) to (2) and even (3), model (4) appears to overfit, occasionally yielding exceptionally poor results. In contrast, model (4) achieved the best results across all evaluations for the California Housing and Bike Sharing datasets. This demonstrates that, when well calibrated, the use of probabilistic features can marginally improve performance compared to the categorical variable model (3).

6. Conclusions and Discussion

This paper aimed to explore the feasibility of tracing anonymized data from very small sample sizes. We developed a methodology that combines latent probabilistic distributions over ordinal classes—i.e., anonymized data—with a small sample approach. This combination enables significantly improved predictions of actual values using linear regression, applicable to both simple and complex data structures. In the context of increasing data protection concerns, our new method demonstrates how standard anonymization techniques, such as discretizing metric data in street surveys or similar contexts, can be accurately reversed. While it might seem counterintuitive to infer exact values from discrete classes using latent distributions, this approach aligns with existing methodologies. The use of distributions to describe latent relationships within a class provides notable advantages.
Our application results show that even a small training dataset can outperform standard linear regression when using latent probabilistic distributions. However, when comparing models that include ordinal class variables, probabilistic distributions often do not provide substantial additional benefits and may even lead to overfitting. This methodology is particularly versatile for any supervised learning regression task. While data quantity and quality can be limited, the approach remains effective. One limitation is that latent probabilistic distributions require a minimum amount of data, which, in our cases, did not need to be excessively large.
Future research should focus on a more detailed examination of the distribution of latent influencing factors, while considering the potential for optimizing class boundaries. Considering the significant impact of class boundaries, combining the optimal clustering solution and given class boundaries could further improve the methodology.
This paper used normally distributed modeling of ordinal classes and considered these as influencing features. Future work could explore more detailed modeling with Gaussian mixture models or other distributions.
Overall, beyond data protection and anonymization, our approach offers universal applicability and could improve various supervised learning regression problems, particularly when latent probabilistic distributions are not independently or sufficiently recognized by the model.

Author Contributions

Conceptualization, S.M.S. and C.H.; methodology, S.M.S.; software, S.M.S.; validation, S.M.S. and Heumann C.; formal analysis, S.M.S.; investigation, S.M.S.; resources, S.M.S.; data curation, S.M.S.; writing—original draft preparation, S.M.S.; writing—review and editing, S.M.S.; visualization, S.M.S.; supervision, C.H.; project administration, S.M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are reproduceable or open access.

Acknowledgments

This work utilized generative artificial intelligence (AI) tools to assist with translation and ensure grammatical correctness.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lubarsky, B. Re-Identification of “Anonymized Data”. Georg. Law Technol. Rev. 2010. Available online: https://www.georgetownlawtechreview.org/re-identification-of-anonymized-data/GLTR-04-2017 (accessed on 10 September 2021).
  2. Porter, C.C. De-Identified Data and Third Party Data Mining: The Risk of Re-Identification of Personal Information. Shidler JL Com. Tech. 2008, 5, 1. [Google Scholar]
  3. Senavirathne, N.; Torra, V. On the Role of Data Anonymization in Machine Learning Privacy. In Proceedings of the 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, 29 December 2020–1 January 2021; pp. 664–675. [Google Scholar]
  4. Ercikan, K. Limitations in Sample-to-Population Generalizing. In Generalizing from Educational Research; Routledge: Abingdon, UK, 2008; ISBN 978-0-203-88537-6. [Google Scholar]
  5. Hertzog, M.A. Considerations in Determining Sample Size for Pilot Studies. Res. Nurs. Health 2008, 31, 180–191. [Google Scholar] [CrossRef]
  6. Li, T.; Li, N.; Zhang, J. Modeling and Integrating Background Knowledge in Data Anonymization. In Proceedings of the 2009 IEEE 25th International Conference on Data Engineering, Shanghai, China, 29 March–2 April 2009; pp. 6–17. [Google Scholar]
  7. Stickland, M.; Li, J.D.-Y.; Tarman, T.D.; Swiler, L.P. Uncertainty Quantification in Cyber Experimentation; Sandia National Lab. (SNL-NM): Albuquerque, NM, USA, 2021. [Google Scholar]
  8. Oertel, H.; Laurien, E. Diskretisierung. In Numerische Strömungsmechanik; Vieweg+Teubner Verlag: Wiesbaden, Germany, 2003; pp. 126–214. ISBN 978-3-528-03936-3. [Google Scholar]
  9. Senavirathne, N.; Torra, V. Rounding Based Continuous Data Discretization for Statistical Disclosure Control. J. Ambient Intell. Humaniz. Comput. 2023, 14, 15139–15157. [Google Scholar] [CrossRef]
  10. Inan, A.; Kantarcioglu, M.; Bertino, E. Using Anonymized Data for Classification. In Proceedings of the 2009 IEEE 25th International Conference on Data Engineering, Shanghai, China, 29 March–2 April 2009; pp. 429–440. [Google Scholar]
  11. Pors, S.J. Using Discretization and Resampling for Privacy Preserving Data Analysis: An Experimental Evaluation. Master’s Thesis, Utrecht University, Utrecht, The Netherlands, 2018. [Google Scholar]
  12. Milani, M.; Huang, Y.; Chiang, F. Data Anonymization with Diversity Constraints. IEEE Trans. Knowl. Data Eng. 2021, 35, 3603–3618. [Google Scholar] [CrossRef]
  13. Bayardo, R.J.; Agrawal, R. Data Privacy through Optimal K-Anonymization. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; pp. 217–228. [Google Scholar]
  14. Robitzsch, A. Why Ordinal Variables Can (Almost) Always Be Treated as Continuous Variables: Clarifying Assumptions of Robust Continuous and Ordinal Factor Analysis Estimation Methods. Front. Educ. 2020, 5, 589965. [Google Scholar] [CrossRef]
  15. Zouinina, S.; Bennani, Y.; Rogovschi, N.; Lyhyaoui, A. A Two-Levels Data Anonymization Approach. In Artificial Intelligence Applications and Innovations; Maglogiannis, I., Iliadis, L., Pimenidis, E., Eds.; IFIP Advances in Information and Communication Technology; Springer International Publishing: Cham, Switzerland, 2020; Volume 583, pp. 85–95. ISBN 978-3-030-49160-4. [Google Scholar]
  16. Xin, G.; Xiao, Y.; You, H. Discretization of Continuous Interval-Valued Attributes in Rough Set Theory and Its Application. In Proceedings of the 2007 International Conference on Machine Learning and Cybernetics, Hong Kong, China, 19–22 August 2007; Volume 7, pp. 3682–3686. [Google Scholar]
  17. Rhemtulla, M.; Brosseau-Liard, P.É.; Savalei, V. When Can Categorical Variables Be Treated as Continuous? A Comparison of Robust Continuous and Categorical SEM Estimation Methods under Suboptimal Conditions. Psychol. Methods 2012, 17, 354. [Google Scholar] [CrossRef]
  18. Jorgensen, T.D.; Johnson, A.R. How to derive expected values of structural equation model parameters when treating discrete data as continuous. Struct. Equ. Model. A Multidiscip. J. 2022, 29, 639–650. Available online: https://scholar.google.de/scholar?hl=de&as_sdt=0%2C5&q=Jorgensen%2C+T.D.%3B+Johnson%2C+A.R.+How+to+Derive+Expected+Values+of+Structural+Equation+Model+Parameters+When+Treating+Discrete+Data+as+Continuous.&btnG= (accessed on 10 October 2024). [CrossRef]
  19. Zhou, B.; Pei, J.; Luk, W. A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data. ACM Sigkdd Explor. Newsl. 2008, 10, 12–22. [Google Scholar] [CrossRef]
  20. Murthy, S.; Bakar, A.A.; Rahim, F.A.; Ramli, R. A Comparative Study of Data Anonymization Techniques. In Proceedings of the 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing,(HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), Washington, DC, USA, 27–29 May 2019; pp. 306–309. [Google Scholar]
  21. Mogre, N.V.; Agarwal, G.; Patil, P. A Review on Data Anonymization Technique for Data Publishing. Int. J. Eng. Res. Technol. IJERT 2012, 1, 1–5. [Google Scholar]
  22. Kaur, P.C.; Ghorpade, T.; Mane, V. Analysis of Data Security by Using Anonymization Techniques. In Proceedings of the 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), Noida, India, 14–15 January 2016; pp. 287–293. [Google Scholar]
  23. Martinelli, F.; SheikhAlishahi, M. Distributed Data Anonymization. In Proceedings of the 2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Fukuoka, Japan, 5–8 August 2019; pp. 580–586. [Google Scholar]
  24. Marques, J.F.; Bernardino, J. Analysis of Data Anonymization Techniques. In Proceedings of the KEOD 2020—12th International Conference on Knowledge Engineering and Ontology Development, Online Streaming, 2–4 November 2020; pp. 235–241. [Google Scholar]
  25. Abd Razak, S.; Nazari, N.H.M.; Al-Dhaqm, A. Data Anonymization Using Pseudonym System to Preserve Data Privacy. IEEE Access 2020, 8, 43256–43264. [Google Scholar] [CrossRef]
  26. Muthukumarana, S.; Swartz, T.B. Bayesian Analysis of Ordinal Survey Data Using the Dirichlet Process to Account for Respondent Personality Traits. Commun. Stat.-Simul. Comput. 2014, 43, 82–98. [Google Scholar] [CrossRef]
  27. Sha, N.; Dechi, B.O. A Bayes Inference for Ordinal Response with Latent Variable Approach. Stats 2019, 2, 321–331. [Google Scholar] [CrossRef]
  28. Cox, D.R. Note on Grouping. J. Am. Stat. Assoc. 1957, 52, 543–547. [Google Scholar] [CrossRef]
  29. Fang, K.-T.; Pan, J. A Review of Representative Points of Statistical Distributions and Their Applications. Mathematics 2023, 11, 2930. [Google Scholar] [CrossRef]
Figure 1. Mean squared error (MSE) and Jensen-Shannon divergence (J-S divergence) as evaluation metrics for comparing KDE-modeled distributions between test data proportions and the population. The train-test split is performed based on the x-range values. The blue line represents the mean result of the evaluation based on repeated train-test splits, while the orange line indicates the corresponding standard deviation.
Figure 1. Mean squared error (MSE) and Jensen-Shannon divergence (J-S divergence) as evaluation metrics for comparing KDE-modeled distributions between test data proportions and the population. The train-test split is performed based on the x-range values. The blue line represents the mean result of the evaluation based on repeated train-test splits, while the orange line indicates the corresponding standard deviation.
Stats 07 00070 g001
Figure 2. Comparison of KDE-modeled distributions for a 5/95 train-test split of the population (depicted by the gray distribution). The distributions are modeled based on values within the respective thresholds for each ordinal class using training data (middle inset) and test data (right inset). The test and training datasets are split randomly but with class stratification. Despite the visually apparent lower dispersion in the training data, the variability of both datasets is similar.
Figure 2. Comparison of KDE-modeled distributions for a 5/95 train-test split of the population (depicted by the gray distribution). The distributions are modeled based on values within the respective thresholds for each ordinal class using training data (middle inset) and test data (right inset). The test and training datasets are split randomly but with class stratification. Despite the visually apparent lower dispersion in the training data, the variability of both datasets is similar.
Stats 07 00070 g002
Figure 3. Comparison of KDE-modeled distributions for an 80/20 train-test split (i.e., standard cross validation ratio) of the population (gray distribution). The distributions are modeled based on values within the respective thresholds for each ordinal class, using train data (middle inset) and test data (right inset). The test and training datasets are split randomly but with class stratification.
Figure 3. Comparison of KDE-modeled distributions for an 80/20 train-test split (i.e., standard cross validation ratio) of the population (gray distribution). The distributions are modeled based on values within the respective thresholds for each ordinal class, using train data (middle inset) and test data (right inset). The test and training datasets are split randomly but with class stratification.
Stats 07 00070 g003
Figure 4. Comparison of four linear regression models based on different input features for a 5/95 train-test split. The insets show a 2D cross-section of the multivariate models, where all features are used. The test and training datasets are split randomly but with class stratification. Despite the visually apparent lower dispersion in the training data, the variability of both datasets is similar.
Figure 4. Comparison of four linear regression models based on different input features for a 5/95 train-test split. The insets show a 2D cross-section of the multivariate models, where all features are used. The test and training datasets are split randomly but with class stratification. Despite the visually apparent lower dispersion in the training data, the variability of both datasets is similar.
Stats 07 00070 g004
Figure 5. Comparison of four Bayesian linear regression models based on different input features for a 5/95 train-test split. The insets show a 2D cross-section of the multivariate models, where all features are used. The test and training datasets are split randomly but with class stratification. Despite the visually apparent lower dispersion in the training data, the variability of both datasets is similar.
Figure 5. Comparison of four Bayesian linear regression models based on different input features for a 5/95 train-test split. The insets show a 2D cross-section of the multivariate models, where all features are used. The test and training datasets are split randomly but with class stratification. Despite the visually apparent lower dispersion in the training data, the variability of both datasets is similar.
Stats 07 00070 g005
Figure 6. Heatmap and histogram with an approximated log-normal distribution for a simulated example with n = 3000 .
Figure 6. Heatmap and histogram with an approximated log-normal distribution for a simulated example with n = 3000 .
Stats 07 00070 g006
Table 1. Comparison of the different feature input settings (with linear regression) based on the training data proportion for in- and out-of-sample predictions. Evaluation metrics are M S E , R2, and the computational time (CT). The best results per comparison are highlighted in bold.
Table 1. Comparison of the different feature input settings (with linear regression) based on the training data proportion for in- and out-of-sample predictions. Evaluation metrics are M S E , R2, and the computational time (CT). The best results per comparison are highlighted in bold.
Train Data
Proportion
[in %]
ModelMSER2 [%]CT [s]
in *out **in *out **
510.8761.00189.4789.391.27
20.6460.89392.2490.531.39
30.4990.59494.0093.711.90
40.4270.52794.8694.411.99
2011.0110.99188.4189.621.50
20.7240.87391.7090.851.66
30.5270.54493.9594.301.45
40.4590.50194.7394.751.47
5011.0660.92188.1990.541.22
20.7390.86491.8291.111.23
30.4650.59494.8593.891.25
40.4260.54795.2894.381.14
8010.9980.97789.0990.521.08
20.7300.98792.0290.420.95
30.4850.69194.6993.300.87
40.4670.61694.8994.020.91
* in-sample. ** out-of-sample.
Table 2. Comparison of different feature input settings (using Bayesian linear regression) was conducted by evaluating the coverage of model predictions within the PI (out-of-sample). Evaluation metrics are c o v e r a g e r a t e , P I   w i d t h , r a t i o , and the computational time C T . The best results per comparison are highlighted in bold.
Table 2. Comparison of different feature input settings (using Bayesian linear regression) was conducted by evaluating the coverage of model predictions within the PI (out-of-sample). Evaluation metrics are c o v e r a g e r a t e , P I   w i d t h , r a t i o , and the computational time C T . The best results per comparison are highlighted in bold.
Train Data
Proportion
[in %]
Input Feature SettingsPl MetricsCT
[s]
Cov. Rate
[in %]
PI WidthRatio
[in %]
5131.681.3124.2553.94
253.893.7514.3856.27
396.213.2529.6178.09
497.893.4328.5597.43
20115.130.5627.1450.89
224.631.5715.6655.21
396.002.9632.4996.06
495.632.8333.78138.68
5019.800.4024.7950.48
220.800.8923.4553.81
394.002.7134.66118.48
493.22.6235.55191.8
8017.500.3124.3449.81
215.000.6224.1054.81
394.002.7534.15154.44
493.502.7234.37215.88
Table 3. Comparison of the different feature input settings (with linear regression) and training data proportion based on the simulation study data for in- and out-of-sample. Evaluation metrics are M S E and R 2 . The best results per comparison are highlighted in bold.
Table 3. Comparison of the different feature input settings (with linear regression) and training data proportion based on the simulation study data for in- and out-of-sample. Evaluation metrics are M S E and R 2 . The best results per comparison are highlighted in bold.
DatasetTrain Data
Proportion
ModelMSER2
in *out **in *out **
1
(# 3000)
510.002460.0035884.9780.85
20.000500.0011896.9793.72
30.000550.0011396.6593.97
40.000480.0013097.0693.03
1010.003660.0035280.1681.09
20.001110.0009893.9894.73
30.001040.0009894.3994.71
40.000920.0010295.0094.51
2010.003420.0035480.1781.30
20.001040.0010893.9594.30
30.000990.0009994.2794.75
40.000940.0009394.5695.07
2
(# 3000)
510.004880.0052866.1865.13
20.000540.0012196.2592.00
30.000590.0011495.9292.49
40.000530.0011396.3292.54
1010.004740.0053466.9964.89
20.000870.0011993.9292.20
30.000930.0010993.5592.84
40.000840.0011094.1592.80
2010.005170.0052366.3265.31
20.000840.0013894.5190.84
30.000950.0010993.7892.75
40.000830.0013094.5991.37
3
(# 5000)
510.003110.0028580.7581.05
20.001070.0009393.4193.86
30.000860.0008194.7194.59
40.000730.0007495.4695.10
1010.002970.0028780.5481.02
20.000850.0008294.4094.56
30.000900.0008094.1394.72
40.000830.0007594.5895.01
2010.003160.0027880.1981.32
20.001240.0008092.2094.64
30.001040.0007493.5095.02
40.000980.0007293.8895.19
4
(# 5000)
510.005010.0049767.8366.18
20.000960.0010793.8692.69
30.001050.0010393.2492.97
40.000950.0010793.8792.70
1010.004030.0050270.4666.24
20.000520.0011396.1792.40
30.000620.0010895.4692.76
40.000510.0011196.2892.54
2010.004800.0049266.1766.91
20.001170.0011691.7692.24
30.000950.0010093.3293.30
40.000890.0009493.7693.69
* in-sample. ** out-of-sample.
Table 4. Descriptive information about class size and thresholds for the multivariate Benchmark datasets.
Table 4. Descriptive information about class size and thresholds for the multivariate Benchmark datasets.
Descriptive Analysis
of the Target
Class
Size
Class
Thresholds
MinMeanMax
AutoMPG923.5246.64[8, 16, 24, 32.5, 48]
Boston Housing522.53504[4, 15, 25, 35, 51]
Student Performance011.91194[−1, 9, 12, 15, 20]
Automobile *---2[−3.5, 1, 3.5]
California Housing14,999206,855500,0014[14,998, 136,249, 257,500,
378,751, 500,002]
Bike Sharing1189.469774[0, 150, 350, 500, 1000]
* Automobile has an ordinal-scaled regression target. The target does not have to be metric scaled to further coarsen into classes.
Table 5. Descriptive information for the normalized multivariate Benchmark datasets.
Table 5. Descriptive information for the normalized multivariate Benchmark datasets.
Samples
Size
Features
Size
Sample/
Feature-
Ratio
TargetTarget
Unit
Target
Mean
Target
IQR
AutoMPG398849.8MPG m i l e s g a l l o n 0.38600.3059
Boston Housing5061436.1MEDV[1k USD]0.38960.1772
Student Performance6495711.4G3Points0.62660.2105
Automobile205693.0Risk Level0.56680.4000
California Housing20,640141474.3MHV[USD]0.39560.2992
Bike Sharing17,379141241.4CountBikes0.19310.2469
Table 6. Comparison of the different feature input settings (with linear regression) and training data proportion based on the Benchmark datasets for in- and out-of-sample. Evaluation metrics are M S E and R 2 . The best results per comparison are highlighted in bold.
Table 6. Comparison of the different feature input settings (with linear regression) and training data proportion based on the Benchmark datasets for in- and out-of-sample. Evaluation metrics are M S E and R 2 . The best results per comparison are highlighted in bold.
DatasetTrain Data
Proportion
ModelMSER2
in *out **in *out **
AutoMPG510.002730.0117892.9272.77
20.000790.0097597.9677.47
30.000870.0050797.7588.28
40.000410.0310198.9328.36
1010.006420.0093885.4678.14
20.002490.0039894.3790.73
30.002310.0036294.7691.57
40.002020.0052895.4287.71
2010.006250.0085286.0280.04
20.001550.0043996.5289.71
30.001670.0038096.2791.10
40.001300.0042497.1090.08
Boston
Housing
510.003180.0176992.3257.55
20.00216>194.79 < 1   ×   10 5
30.000570.0103398.6375.21
40.000370.0187899.1154.95
1010.003490.0223388.8747.85
20.001470.0174195.3259.35
30.001220.0065696.1184.69
40.000935.8278097.03 < 1   ×   10 5
2010.006530.0168483.8159.93
20.002580.0064293.5984.72
30.001850.0032995.4092.18
40.001650.0037395.9191.12
Automobile51 < 1   ×   10 5 0.07078100.00−12.98
2 < 1   ×   10 5 0.07211100.00−15.10
3 < 1   ×   10 5 0.06992100.00−11.61
4 < 1   ×   10 5 0.07121100.00−13.66
101 < 1   ×   10 5 0.08464100.00−36.34
2 < 1   ×   10 5 0.08549100.00−37.70
3 < 1   ×   10 5 0.05165100.0016.80
4 < 1   ×   10 5 0.05027100.0019.04
201 < 1   ×   10 5 0.17308100.00−181.87
2 < 1   ×   10 5 0.14112100.00−129.83
3 < 1   ×   10 5 0.16912100.00−175.43
4 < 1   ×   10 5 0.14078100.00−129.27
Student
Performance
51 < 1   ×   10 5 0.08539100.00−197.44
2 < 1   ×   10 5 0.07496100.00−161.09
3 < 1   ×   10 5 0.02395100.0016.56
4 < 1   ×   10 5 0.02728100.004.99
1010.009100.0750766.09−158.08
20.001860.0356193.05−22.41
30.001510.0119494.3758.95
40.001000.0238396.2718.07
2010.014230.0330447.32−12.67
20.004030.0117885.0659.82
30.003120.0065988.4377.53
40.003050.0067688.7076.95
California
Housing
510.019490.0204065.6163.96
20.004970.0053291.2390.60
30.003820.0038993.2793.12
40.003810.0038993.2893.13
1010.017830.0204968.1963.84
20.004550.0051891.8790.87
30.003660.0038493.4893.22
40.003630.0038693.5293.19
2010.019690.0202764.6864.32
20.005210.0052390.6590.80
30.003710.0038193.3593.29
40.003690.0038093.3793.31
Bike
Sharing
510.020510.0214340.9337.93
20.003430.0033990.1390.17
30.002550.0029792.6591.41
40.002490.0029692.8491.43
1010.021220.0211739.2038.63
20.003330.0034790.4789.94
30.002820.0029491.9391.48
40.002800.0029291.9891.53
2010.021410.0211139.4238.52
20.004110.0037188.3689.19
30.002950.0029191.6591.54
40.002910.0028991.7891.60
* in-sample. ** out-of-sample.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Stroka, S.M.; Heumann, C. Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem. Stats 2024, 7, 1189-1208. https://doi.org/10.3390/stats7040070

AMA Style

Stroka SM, Heumann C. Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem. Stats. 2024; 7(4):1189-1208. https://doi.org/10.3390/stats7040070

Chicago/Turabian Style

Stroka, Stefan Michael, and Christian Heumann. 2024. "Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem" Stats 7, no. 4: 1189-1208. https://doi.org/10.3390/stats7040070

APA Style

Stroka, S. M., & Heumann, C. (2024). Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem. Stats, 7(4), 1189-1208. https://doi.org/10.3390/stats7040070

Article Metrics

Back to TopTop