Transformer-based Parameter Estimation in Statistics

Parameter estimation is one of the most important tasks in statistics, and is key to helping people understand the distribution behind a sample of observations. Traditionally parameter estimation is done either by closed-form solutions (e.g., maximum likelihood estimation for Gaussian distribution), or by iterative numerical methods such as Newton-Raphson method when closed-form solution does not exist (e.g., for Beta distribution). In this paper we propose a transformer-based approach to parameter estimation. Compared with existing solutions, our approach does not require a closed-form solution or any mathematical derivations. It does not even require knowing the probability density function, which is needed by numerical methods. After the transformer model is trained, only a single inference is needed to estimate the parameters of the underlying distribution based on a sample of observations. In the empirical study we compared our approach with maximum likelihood estimation on commonly used distributions such as normal distribution, exponential distribution and beta distribution. It is shown that our approach achieves similar or better accuracy as measured by mean-square-errors.


Introduction
Parameter estimation is one of the most important tasks in statistics.Given the type of the underlying distribution, and a sample of observations, a parameter estimation method should produce an estimate of the distribution's parameters.Taking 1-dimensional normal distribution as an example.Given a sample x 1 , . . ., x N drawn from normal distribution N (µ, σ 2 ), we hope to estimate its parameters µ and σ.The most commonly used method is MLE (maximum likelihood estimation).It can be easily proven that the MLE of µ is μ = .The proof can be found in [1].Parameter estimation for some other distributions can be much more complex.For example, it is shown that the maximum likelihood estimation for the parameters of a Beta distribution has no closed form [2], and one needs to resort to numeric optimization methods such as Newton-Raphson method or gradient descent to estimate its parameters.
In recent years transformer-based approaches have been successfully applied to many diverse tasks, including many math tasks such as solving word problems [3], mathematical reasoning [4], symbolic regressions [5], and time-series forecasting [6].In this paper we propose an approach to perform parameter estimation using deep learning.For each type of distribution, we generate a training set, with each training example containing a sample drawn from a distribution with predetermined parameters.Then we convert each sample into a sequence of embeddings, and train a transformer to predict the parameter(s) with the sequence being the input.We aim at producing estimations that are as close as possible to the true parameters, i.e., we are making mean-square-error estimations of the parameters.
The advantage of our approach is that it does not require any mathematical derivation.Thus the mathematical complexity of the probability density function does not pose any burden for the transformer model, even though there may not be any closed form solution to MLE for the parameters for the distribution.Iterative methods (such as Newton-Raphson method or gradient descent) have been used for estimating the parameters for a distribution when there is no closed-form solution (e.g., [12]).Compared to such methods, our approach has two advantages.Firstly, our approach is not iterative and just requires a single inference of a transformer model.Secondly, and more importantly, our approach does not even require knowing the probability density function of the distribution.For example, suppose a manufacturer is testing the percentage of two materials in electrical filaments in bulbs.The life of a bulb follows a distribution with a single parameter which is the percentage of the first material.Even if the distribution is unknown to us, we can still learn to estimate the parameter from many samples of bulb lives.Another example is that when testing the hyperparameters of a particular deep learning model, the prediction error on a random test case follows a distribution that is unknown to us.But we can estimate the hyperparameters based on samples of such errors.
In general, we make the following contributions in this paper: -We present a novel approach to use transformers for parameter estimation.
-We propose a way to convert a sample of a distribution into a sequence of embeddings, which can be easily consumed by a transformer, and can carry precise information about the sample.-We conduct a comprehensive empirical study, which compares our approach with maximum likelihood estimation methods for various distributions.-It is shown that when measured by mean-square-error (MSE), our approach outperforms MLE in most commonly used distributions such as normal distribution and exponential distribution.Please note this does not indicate MLE is not a good method, as it always maximizes the probability of observing a sample.Our experiment simply indicates that our method beats MLE in terms of mean-square-error in most scenarios, which is one common way to evaluate a method of parameter estimation.
The rest of the paper is organized as follows.Related work is presented in Section 2. Section 3 describes our approach for parameter estimation with a transformer.Experimental results are reported in Section 4 and Section 5 concludes this paper.

Related Work
In recent years there have been many studies that successfully applied transformerbased approaches to a diverse set of math tasks, including solving word problems [3], mathematical reasoning [4], symbolic regressions [5], and time-series forecasting [6].
In contrast, there have not been many studies applying transformers to statistics.In [7] a method was proposed to convert Bayesian parameter estimation as a classification problem, which was then solved using a multi-layer perceptron network.This approach is very useful in determining if the sample was drawn from a distribution with a particular set of predetermined parameters.But it is not capable of performing parameter estimation, as it requires a hypothesis of what the parameters are.
Although deep learning has not been used for parameter estimation in statistics, it has been used to estimate the parameters in various applications.In [8] the authors proposed a method to use CNN to estimate the parameters of events based on LIGO data (for gravitational waves).[9] describes a method to estimate Magnetic Hamiltonian parameter from electron microscopy images using CNN.In [10] the authors used GRU for parameter estimation of MRI with an application in pancreatic cancer.
Our work is different from the above in several aspects: (1) We study the problem of parameter estimation in statistics, which is key to understanding the underlying distribution of data.(2) We use transformer, which has become the state-of-the-art in most of the important applications including both NLP and computer vision.(3) Unlike the above work which uses the raw data of a specific problem as the input to their deep learning models, we convert a sample of an arbitrary distribution into a sequence of embeddings, which can be consumed by a transformer.

Problem Definition
Our goal is to train a model that can predict the parameters of a distribution, using a sample drawn from it.Let us take normal distribution as an example.A normal distribution N (µ, σ 2 ) has two parameters µ and σ, and its probability density function is p We will train a model to predict the two parameters µ and σ, using a random sample drawn from the distribution as the model's input.The model will be trained on a large number of samples, each from a different normal distribution with different parameters.In this paper we use the loss function of mean-squareerror, and will evaluate the accuracy of our model by the mean-square-error of its predictions.
In this paper we only study univariate distributions, although our method can be easily extended to multivariate distributions.

Data Normalization
Some distributions can be shifted and stretched along the x-axis (e.g., normal distribution), simply by replacing x with another variable x ′ = β(x − α).Therefore, given a data sample, we first normalize the range to [0, 1], which can be easily converted into a sequence of tokens, as described later in this section.After the model makes a prediction, the predicted parameters can be easily converted back.For example, if we replace x with x ′ = β(x − α) during normalization and then predict the parameters of a normal distribution to be µ ′ and σ ′ , then our predicted parameters are actually μ = µ ′ β + α, and σ = σ ′ β .Unlike normal distribution, some distributions can only be stretched along the x-axis, but cannot be shifted, such as exponential distribution.For such distributions we also normalize the range of each sample to [0, 1], and again the estimated parameters can be easily converted back.

Data Representation
We use a transformer with at most L input embeddings, with embedding size being K.We use each value in each embedding to represent a possible value.For example, if L = 1024 and K = 384, we can represent 384K different values.We tried two ways to represent a value: -Seq-first: First divide the range of [0, 1] into L intervals, each corresponding to an embedding in our sequence of length L. Then divide each of the L intervals into K sub-intervals, each represented by one dimension of the embedding.-Embed-first: First divide the range of [0, 1] into K intervals, each corresponding to a particular dimension in all the L positions.Then divide each of the K intervals into L sub-intervals, to determine which position it should reside in.
Figure 1 illustrates how Seq-first conversion is done.We first divide the range [0, 1] into 1024 intervals, and then each interval into 384 sub-intervals.The sample on the horizontal axis contains 6 observations.The leftmost observation is mapped to the first dimension of the first embedding.The observation 0.500001 maps to somewhere between the first and second dimensions of the 512th embedding.The first dimension of the 512th embedding maps to interval [0.5, 0.5000025], and the second dimension maps to interval [0.5000025, 0.500005].The observation 0.500001 falls into the interval of the first dimension.Instead of assigning all its weight to the first dimension, we assign part of its weight to the second dimension, according to its position in the first dimension's interval, so that we can precisely represent where the observation is.In this case the observation appears at the 40-percentile position of the first interval, and thus we keep 60% of its weight in the first dimension, and move 40% of its weight to the second dimension.
If we use the Embed-first representation, we assign an observation's weight to the same dimension in two different positions, in a similar way as the Seq-first representation.The sample contains 6 observations, ranging from 0 to 1 (after normalization).The left-most observation is mapped to the first dimension of the first embedding.The rightmost observation is mapped to the last dimension of the last embedding.The observation 0.500001 maps to somewhere between the first and second dimensions of the 512th embedding, and thus its weight is distributed between these two dimensions.

Transformer Model
As shown in Figure 2, our model is a typical transformer, which takes the sequence of embeddings as its input, and outputs the predicted parameter(s), by adding an output layer on top of the first output embedding of the last layer in the transformer.Since this is a regression problem, we define the loss function as the average mean-square-error on each parameter to be predicted.
Each training example is a sample drawn from a distribution with random parameters.Therefore, our model should never see the same example twice during training.If the sample has been stretched or shifted as described in Section 3.2, we first convert the predicted parameters back to the original scale, and then

Experiment settings
We test our approach on three types of commonly seen distributions: Normal distributions, exponential distributions, and Poisson distributions.The training and testing data are randomly generated as needed, and thus the model should never see the same example twice.We use the Roberta model [11] downloaded from Huggingface as our transformer model.All experiments are done on a machine with an A6000 GPU, with Ubuntu 18.04, CUDA 11.7 and Pytorch 2.0.0.Float32 is used because numerical precision is key to our approach.
Unless otherwise mentioned, we train our model on 9.9M randomly generated examples in each setting, and test its accuracy on 100K examples.We choose our hyperparameters for efficiency reasons, and the hyperparameters chosen have negligible difference in accuracy compared with the default hyperparameters of Roberta.The training usually takes 60 to 70 hours using the default setting.Please note that for each type of distribution we only need to train once, and 60 to 70 hours of machine time is negligible to the time consumed by a human to derive the formula or algorithm for parameter estimation.

Transformer Hyperparameters
We first test various hyperparameters for our transformer model, in order to select the best settings.We start with the open-sourced RoBERTa, which accepts 512 embeddings as its input, each having 768 dimensions.In order to increase the precision, we tried to increase input length to 1024, which quadruples the memory consumption for multiheadself-attention. Due to the limitation of our GPU memory, we had to change the number of layers from 12 to 6, and embedding size from 768 to 384.As discussed in Section 3.3, we use Seq-first by default, and tried Embed-first as well.
We use each of the above methods to estimate the parameters for normal distributions, with 9.9M samples (of size 30) for training, and 100K samples for testing.More details are described in Section 4.3.1.The results are shown in Table 1.We can see RoBERTa and RoBERTa with 1024-input-len have very similar accuracies.A t-test shows there is no statistically significant difference between their MSEs.We choose RoBERTa with 1024-input-len because its training time is much lower.

Exponential Distribution
We start from exponential distribution because of its simplicity.We represent the p.d.f. of exponential distribution in the same way as NumPy1 : Here β is the only parameter and represents the scale of the distribution.It is equivalent to the alternative p.d.f.f (x; λ) = λe −λx , where λ = 1 β .When generating samples, we take β from a uniform distribution in range [0.5, 2].
We test our approach and maximum likelihood estimation on 100K samples, and the average mean-square-error is recorded.Each approach is tested in two settings: 1. Known Parameter Range: The range of each parameter is known to the approach.2. Unknown Parameter Range: The approach is not aware of the range of any parameter.

Known Parameter Range
When each parameter's range is known, we normalize each sample into [0, 1] by a fixed linear transform.When β ∈ [0.5, 2], the probability of x > 20 is less than 0.5%.Therefore, we cap each value of a sample within [0, 20], and normalize each capped value x i by x ′ i = x i /20.In the final fully-connected layer for predicting parameters (as in Figure 2), we let it directly predict the value of the parameter.
The MLE of β is simply , and we cap its estimate of β within [0.5, 2].
We test with three different sample sizes (10, 30, and 100), and another setting in which the sample size is randomly sampled from a log-uniform distribution of range [10,100].The results are shown in Figure 3 (a), in which our method outperforms MLE for each sample size.Table 2 shows the mean and standard deviation of the mean-square-error of the two approaches for each sample size.One can see that our proposed approach solidly beats MLE in every sample size, with close to zero p-values in two-sample t-tests.

Unknown Parameter
Range Suppose a sample is drawn from an exponential distribution exp(β), and the range of β is unknown to the method for parameter estimation (i.e., the method needs to assume that β can take any value).For MLE, we can simply use its formula (as in Section 4.2.1) to estimate the parameters.But a slightly more complex method is needed for our approach.
In order to hide the parameter's range from our transformer model, we first normalize each sample s into range In this way our model can only see the relative shape of the sample, without knowing its original range.Suppose the model's prediction is β * .To compute the loss, we first recover the estimated parameter into the original range: The estimate of β is β = β * • b.Then we compare it with the true parameter, as follows: Please note that b is never seen by the model, and thus the model is unaware of the range of the sample or the parameter.The loss function is just the meansquare-error between the estimated parameters and true parameters, which is what we are optimizing.
Again we test with three different sample sizes (10, 30, and 100), and random sample size from range [10,100].The results are shown in Figure 3(b) and Table 3.We can see that our method outperforms MLE in most cases, except when the sample size is large.

Normal Distribution
We then test our approach on normal distribution and compare with maximum likelihood estimation.The parameter µ is drawn from a uniform distribution with range [−5, 5], and σ from that with range [1,10], independently from µ. .We cap μ and σ within their ranges ([−5, 5] and [1,10]) when the estimates are out of that range.
The results are shown in Figure 4(a), and Table 4 shows the mean and standard deviation of the mean-square-error of the two approaches for each sample size.One can see that our approach solidly beats maximum likelihood estimation in every sample size (with p-value close to zero in two-sample t-tests), especially when the sample size is small.
The results are shown in Figure 4(b) and Table 5.One can see that our proposed approach has similar performance (no statistical significance) with MLE in most cases, but is not as good when sample size is 100.

Beta Distribution
There is no closed form solution for maximum likelihood estimation for the parameters of a Beta distribution.Numerical solutions have been proposed (e.g., [2]), which are complex and there is no open-sourced implementation available.In the special case where the two parameters α and β are between 0 and 1, one can use the method of moments [13] to estimate the parameters, and we compare that to our approach.6.The mean-square-error of MLE and our approach for each sample size for Beta distribution with known parameter range (0 < α, β < 1), together two-sample t-test results.
Table 6 shows our results of parameter estimation for Beta distribution, compared with Method of Moments.Even in the special case where 0 < α, β < 1, our approach outperforms Method of Moments by a large margin.On the other hand, our approach can estimate parameters for Beta distributions of any parameters (results shown in Table 7).
We can see that our approach is an ideal solution for distributions without readily known formula for parameter estimation.Traditionally it often takes years of research time to derive a closed form or design an algorithm for parameter estimation of a particular type of distribution.In comparison, our approach only requires 60 hours of machine time to train a model for each type of distribution.

Conclusions
In this paper we propose a new method for parameter estimation, which converts a sample into a sequence of embeddings which can be consumed by a transformer model.The empirical study shows that the proposed approach outperforms maximum likelihood estimation (in terms of mean-square-error) is most scenarios, especially when the parameters' ranges are known.In real-world applications the parameters' ranges are usually known, which makes our approach an ideal solution to estimate parameters.

Fig. 1 .
Fig.1.Converting a sample into a sequence of L embeddings (L = 1024), each of size K (K = 384).The sample contains 6 observations, ranging from 0 to 1 (after normalization).The left-most observation is mapped to the first dimension of the first embedding.The rightmost observation is mapped to the last dimension of the last embedding.The observation 0.500001 maps to somewhere between the first and second dimensions of the 512th embedding, and thus its weight is distributed between these two dimensions.

Fig. 2 .
Fig. 2. Architecture of our transformer model compare them to the true parameters of the distribution for loss computation.If there are multiple parameters, we use the average mean-square-error between predicted parameter values and their true values as the loss.

Fig. 3 .
Fig. 3. (a) The mean-square-error with # training examples for exponential distributions with known parameter ranges.The horizontal lines represent mean-square-errors of MLE, and the curves represent those of our approach.(b) Those for exponential distributions with unknown parameter ranges.

4. 3 . 1
Known Parameter Range When each parameter's range is known, our approach will normalize each sample into [0, 1] by a fixed linear transform.In this case it caps each value of a sample within [−35, 35] (because 35 is at least three standard deviations away from the mean), and normalizes each capped value x i by x ′ i = (x i + 35)/70.For maximum likelihood estimation, the MLE of µ is μ =

Fig. 4 .
Fig. 4. The mean-square-error with #training examples for normal distributions with known parameter ranges.The horizontal lines represent mean-square-errors of MLE, and the curves represent those of our approach.

4. 3 . 2
Unknown Parameter Range Suppose a sample is drawn from a normal distribution N (µ, σ 2 ), and the ranges of µ and σ are unknown.Like the case for exponential distribution, we first normalize each sample s into range [0, 1], in order to hide the parameters' range from our model.Let a = min(s) and b = max(s).Each value x i is normalized by x ′ i = (x i − a)/(b − a).Suppose the model's outputs are µ * and σ * .To compute the loss, we first recover them into the original range: The estimate of µ is μ = (µ * • (b − a)) + a, and that of σ is σ = σ * •(b−a).Then we compare them with the true parameters, as follows: loss = mean-square-error([μ, σ], [µ, σ])

Table 1 .
Mean-square-error and training time of various settings for normal distribution

Table 2 .
The mean-square-error of MLE and our approach for each sample size for exponential distribution with known parameter range, together two-sample t-test results.

Table 3 .
The mean-square-error of MLE and our approach for each sample size for exponential distribution with unknown parameter range, together two-sample t-test results.

Table 4 .
The mean and standard deviation of the mean-square-error of MLE and our approach for each sample size for normal distribution with known parameter range, together with t-value and p-value from a two-sample t-test.

Table 5 .
The mean-square-error of MLE and our approach for each sample size for normal distribution with unknown parameter range, together two-sample t-test results.