Next Article in Journal
Big Data Sharing: A Comprehensive Survey
Previous Article in Journal
NutritionVerse3D2D: Large 3D Object and 2D Image Food Dataset for Dietary Intake Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Data Descriptor

Small Samples’ Permille Cramér–Von Mises Statistic Critical Values for Continuous Distributions as Functions of Sample Size

by
Lorentz Jäntschi
Department of Physics and Chemistry, Technical University of Cluj-Napoca, 103-105 Muncii Blvd., 400641 Cluj-Napoca, Romania
Data 2025, 10(11), 181; https://doi.org/10.3390/data10110181
Submission received: 5 August 2025 / Revised: 1 November 2025 / Accepted: 3 November 2025 / Published: 5 November 2025

Abstract

Along with other order statistics, the Cramér–von Mises (CM) statistic can assess the goodness of fit. CM does not have an explicit formula for the cumulative distribution function and the alternate way is to obtain its critical value from a Monte Carlo (MC) experiment. A high resolution experiment was deployed to generate a large amount of data resembling CM. Twenty-one repetitions of the experiment were conducted, and in each case, critical values of the CM statistic were obtained for all permilles and sample sizes from 2 to 30. The raw data presented here can serve to interpolate and extract probabilities associated with CM statistic directly, or to obtain a mathematical model for the bivariate dependence.
Dataset: The dataset was submitted and will be published as a supplement to this article in the journal Data. The dataset is available electronically as supplementary materials.

1. Introduction

The Cramér–von Mises (CM) test is often used in biological sciences. It is applied to compare observed distributions of phenomena against theoretical models (see [1]). The test is also used to determine if certain data follow a theoretical distribution. This is a crucial prerequisite for applying parametric statistical methods. Examples of data tested include gene expression levels [2] and physiological measurements [3].
The Cramér–von Mises (CM) test is often used in biological sciences for comparing observed distributions of phenomena against theoretical models [1] and for testing if certain data (gene expression levels in [2], physiological measurements in [3]) follow a theoretical distribution as a prerequisite in applying parametric statistical methods.
A typical use of CM is assessing the goodness of fit [4]. The need for an accurate probability associated with the CM statistic is obvious, and in most instances, it is obtained from a Monte Carlo simulation (see, for instance, [5], where R (version 4.2.1) is reported in the supplementary material to provide calculations for the probability associated with the statistic).
Having the probability associated with the statistic is a better alternative than operating with critical levels. To enumerate a few, here is a list of typical cases:
  • Recognizing that innovations may follow heavy-tailed distributions, [6] supports using these for Lee–Carter residuals and mortality index differences. They compare six distributions using critical thresholds. However, probability-based comparisons better identify the best-fitting model.
  • Analyzing data characterized by skewness, heavy tails, and diverse hazard behaviors, [7] introduces a new distribution. The dependence of the p-value [8] gives an alternative to the information criteria as goodness-of-fit measures.
  • Testing if data follow specific distributions is common in biological/environmental sciences [9]. Multiple distribution options complicate selection, but combining goodness-of-fit probabilities [10] provides a robust solution.
The CM statistic [11,12] is an order statistic-based goodness-of-fit test, which measures the discrepancy between an empirical distribution function and a theoretical distribution function. It has been applied to testing both discrete [13] and continuous [14] distributions, as well as to parameter estimation problems [15]. Beyond hypothesis testing, the CM statistic has also been adapted for use as a distance metric in various statistical applications [16].

2. Data Description

For a sample x of size n, with values not necessarily ordered or distinct, drawn, or following a theoretical distribution for which the cumulative distribution function (CDF) is available, the cumulative probability q i associated with x i can be expressed by Equation (1).
q i = CDF ( x i ; α 1 , , α j ) ,
where α 1 , , α j , the parameters of theoretical distribution, can be predefined or estimated from the sample. One can calculate the degrees of freedom ν = n l , where l is the number of equations based on sample data used to calculate the parameters [17,18].
CDF brings an arbitrary distribution ( x i ) 1 i n to the standard continuous uniform distribution (of ( q i ) 1 i n , q i [ 0 ,   1 ] ). Let s 1 , , s n be defined by Equation (2):
{ s 1 , , s n } = Sort ( { q 1 , , q n } )
with sorting producing an ascending order ( s 1 s i s n ).
Then, for an arbitrary sample s j , the Cramér–von Mises sample statistic can be computed ( y j in Equation (3)).
y j = 1 12 n + i = 1 n 2 i 1 2 n s j , i 2
The Monte Carlo (MC) simulation consists in generating a number (M, conveniently chosen) of samples q 1 , , q n (of same size n), and calculating the CM.
Let be z 1 , , z M defined by Equation (4):
{ z 1 , , z M } = Sort ( { y 1 , , y M } )
with sorting producing an ascending order. In Equation (4), ( z j ) 1 j M and ( y j ) 1 j M represent the same values, possibly listed in a different order. If σ is the permutation which sorts ( y j ) 1 j M , then z i = y σ ( i ) and z 1 z j z M .
In Equation (4), ( z j ) 1 j M is the simulated CDF. If one cuts in k (preferably equally sized) pieces, then the cuts relative positions represent the significance levels, and the cuts’ right ends represent critical values. If k = 100 , one cuts in 100 pieces, and the cuts estimate the percentiles of CM, and if k = 1000 , one cuts in 1000 pieces, and the cuts estimate the permilles of CM. Here, the latter was used.

3. Methods

It was decided to conveniently choose M as a multiple of k. Let ζ l be defined by Equation (5):
ζ l = z l · M / k
with l = 1 ,   2 , ,   k .
The value of M is limited only by the available memory. If working with singles, m e m constant is a fifth of the memory limit, and then M m e m m e m mod ( k n + k ) , where mod is the remainder of division. Greater M values gives greater accuracy. Choosing M as a multiplier of ( k n + k ) has the following rationale: on one hand, for a sample of n, the binomial expansion of ( 1 + 1 ) n has n + 1 , terms which need to be stored separately as weights (see Algorithm 1 in [8]). On the other hand, extracting from the data, a subset of k values is needed, values, which are more conveniently extracted if the data have the size of a multiplier of it.
If k = 2 , then ζ 1 = z M / 2 is an estimate for the end of the second quartile ( Q 2 , corresponding to P ( X z M / 2 ) 0.5 ), and ζ 2 = z M is an estimate for the end of the last quartile ( Q 4 , corresponding to P ( X z M ) 1.0 ). If k = 4 , then ζ 1 = z M / 4 is an estimate for the end of the first quartile ( Q 1 , corresponding to P ( X z M / 4 ) 0.25 ), ζ 2 = z M / 2 is an estimate for the end of the second quartile ( Q 2 , corresponding to P ( X z M / 2 ) 0.5 ), ζ 3 = z 3 M / 4 is an estimate for the end of the third quartile ( Q 3 , corresponding to P ( X z 3 M / 4 ) 0.75 ), and ζ 4 = z M is an estimate for the end of the last quartile ( Q 4 , corresponding to P ( X z M ) 1.0 ).
The key step in obtaining a statistic-probability map for CM is to generate a large amount of data (very big M). However, one can argue that any big value is not big enough. And that might be correct. This is the reason for which developing a statistic from combining samples is needed.
To exemplify and particularize, let us fix k = 1000 in the following discussion. Let H be samples of size M obtained from independent MC simulations for CM. Let ζ h , 1 , ζ h , 2 , …, ζ h , 1000 be the cuts from sample h ( 1 h H ). Each cut is a sample statistic (in this instance, the permille critical value). Each of these sample statistics can be sorted (Equation (6)).
{ η 1 , l , , η H , l } = Sort ( { ζ 1 , l , , ζ H , l } )
with l = 1 , 2 , , k (with l = 1 , 2 , , 1000 here).
Which would be the overall unbiased estimate (supposing that someone joins all H samples) for the big sample observed permilles?
Since one deals with ordered sets ( ζ values were extracted from series after sorting their values as the values placed on certain positions). The proper statistic would be the median (using the notations from Equation (6), with H conveniently chosen as odd number, the l-permille estimator is η ( H + 1 ) / 2 , l ).
When generating large amounts of data, one cannot avoid randomness. Uniform distribution random number generator based on Mersene Twister has been used (see Rand in Algorithm 1 from [8]). In order to benefit from speed, the Lazarus (a freeware version of Pascal) environment has been used for software implementation, when a Windows-64-based binary executable has been generated (which is available from the author upon request).

4. Brief Comparison with Existing Tabulated CM Values

Proposed data have improved accuracy. To exemplify, Table 1 provides comparison with a previously reported data [19]:
In Table 1, the values from ref. [19] are constantly off of the calculated range from 21 replicates. Since 1/21 is less than 0.05%, in other words, ref. [19] are systematically in error. However, it serves very well to prove the argued point. The tendencies are exactly opposite. Thus,
  • For α = 25 % , ref. [19] data are ascending with n, while our data indicate the opposite;
  • For α = 15 % , ref. [19] data are ascending with n, the same as our data indicate;
  • For α 10 % , ref. [19] data are descending with n, while our data indicate the opposite.
On the other hand, when looking at Kolmogorov–Smirnov (KS) data [20], one can notice that the value of KS statistic in upper tail is increasing with sample size. Thus, the reported values correct a trend misreported in the literature.

5. User Notes

With z 1 , , z M of Equation (4), or, since M is too large to be used for practical purposes, with cutting values ζ l from Equation (5), one can construct a table, listing values by sample size n and its relative position l (corresponding to statistical significance levels).
Furthermore, when comparing the calculated value of CM for a sample under consideration with the tabulated values, one needs to obtain the relative position expressing a probability to observe such a (large) value in practice simply by chance.
On the sample from supplementary material standard deviation, ( σ ) for each permille value was calculated and used to generate Figure 1.
In Figure 1, the Matlab trisurf function was used for visual effects, while the Delaunay function was used for triangulation. Figure 1 reveals smooth transitions, well suited for linear interpolations within the domain when needed.
The values in the data given (see supplementary material) range as follows:
  • n (sample size): from 2 to 30;
  • l (permille): from 1 to 999;
  • ζ l ( n ) (CM critical value): from 0 . 0175378 ± 0.0000005 (at n = 30 and l = 1 ) to 1 . 13924 ± 0.00004 (at n = 30 and l = 999 ).
There is an increase in the variability from smaller to bigger values of the CM statistic obtained from the MC experiment. Most of the variability increase can be attributed to the simultaneous increase in the value of statistics. However, the standard deviation plot from Figure 1 shows that the estimation noise is below 10 4 most of the time.

6. Conclusions

The generated data have a small amount of noise, being thus able to well estimate the CM statistic for the probability within the range [0.001, 0.999], and for the sample size within the range [2, 30]. Generated data can be directly used for goodness-of-fit tests with small to moderate samples. The permille-level precision enables precise critical values and p-values, improving hypothesis testing confidence in quality control, bioinformatics, and financial modeling, to give only some examples. Practitioners can evaluate how theoretical distributions fit observed data, particularly for small samples, influencing model selection through robust, reproducible criteria. Developers can enhance small-sample inference modules, improving accuracy and user trust. In industrial quality control with limited samples, CM statistic data enable reliable detection of distribution deviations, improving early identification of manufacturing issues, while in medical studies with small patient cohorts, statisticians can conduct distributional assessments of biomarkers or treatment effects with improved accuracy, facilitating rigorous non-parametric testing under clinical trial constraints.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/data10110181/s1. Raw data: Cramér–von Mises statistic critical values for permilles and sample sizes ranging from 2 to 30.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the supplementary material.

Acknowledgments

A paper using these data to construct a model for CM statistic has been published in the journal Symmetry, doi 10.3390/sym17091542.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

CDFcumulative distribution function, CDF ( x ) = P ( X x )
P ( A ) probability of event A
CMCramér–von Mises (statistic)
MCMonte Carlo (simulation)
Sorta function sorting an numeric array (ascending)
σ standard deviation ( μ = i = 1 n ξ i n ; σ = i = 1 n ( ξ i μ ) 2 n 1 )

References

  1. Ross, G.J.; Adams, N.M. Two nonparametric control charts for detecting arbitrary distribution changes. J. Qual. Technol. 2012, 44, 102–116. [Google Scholar] [CrossRef]
  2. Qiu, X.; Xiao, Y.; Gordon, A.; Yakovlev, A. Assessing stability of gene selection in microarray data analysis. BMC Bioinform. 2006, 7, 50. [Google Scholar] [CrossRef] [PubMed]
  3. Merkle, E.C.; Zeileis, A. Tests of measurement invariance without subgroups: A generalization of classical methods. Psychometrika 2013, 78, 59–82. [Google Scholar] [CrossRef]
  4. Ashkar, F.; Aucoin, F.; Choulakian, V.; Vautour, C. Cramér-von Mises and Anderson-Darling goodness-of-fit tests for the two-parameter kappa distribution. J. Hydrol. Eng. 2013, 18, 1749–1757. [Google Scholar] [CrossRef]
  5. Chotsiri, P.; Yodsawat, P.; Hoglund, R.M.; Simpson, J.A.; Tarning, J. Pharmacometric and statistical considerations for dose optimization. CPT Pharmacometrics Syst. Pharmacol. 2025, 2025, 279–291. [Google Scholar] [CrossRef]
  6. Wang, C.W.; Huang, H.C.; Liu, I.C. A quantitative comparison of the Lee-Carter model under different types of non-Gaussian innovations. Geneva Pap. Risk Insur.-Issues Pract. 2011, 36, 675–696. [Google Scholar] [CrossRef]
  7. Obulezi, O.J.; Semary, H.E.; Nadir, S.; Igbokwe, C.P.; Orji, G.O.; Al-Moisheer, A.; Elgarhy, M. Type-I Heavy-Tailed Burr XII Distribution with Applications to Quality Control, Skewed Reliability Engineering Systems and Lifetime Data. Comput. Model. Eng. Sci. 2025, 144, 2991. [Google Scholar] [CrossRef]
  8. Jäntschi, L. The Cramér–Von Mises Statistic for Continuous Distributions: A Monte Carlo Study for Calculating Its Associated Probability. Symmetry 2025, 17, 1542. [Google Scholar] [CrossRef]
  9. Scholze, M.; Boedeker, W.; Faust, M.; Backhaus, T.; Altenburger, R.; Grimme, L.H. A general best-fit method for concentration-response curves and the estimation of low-effect concentrations. Environ. Toxicol. Chem. 2001, 20, 448–457. [Google Scholar] [CrossRef]
  10. Jäntschi, L. Detecting extreme values with order statistics in samples from continuous distributions. Mathematics 2020, 8, 216. [Google Scholar] [CrossRef]
  11. Cramér, H. On the composition of elementary errors. Scand. Actuar. J. 1928, 1, 13–74. [Google Scholar] [CrossRef]
  12. Von Mises, R. Wahrscheinlichkeit, Statistik und Wahrheit; Springer: Berlin, Germany, 1928. [Google Scholar] [CrossRef]
  13. Traison, T.; Vaidyanathan, V. Goodness-of-Fit Tests for COM-Poisson Distribution Using Stein’s Characterization. Austrian J. Stat. 2025, 54, 85–100. [Google Scholar] [CrossRef]
  14. Muhammad, M.; Abba, B. A Bayesian inference with Hamiltonian Monte Carlo (HMC) framework for a three-parameter model with reliability applications. Kuwait J. Sci. 2025, 52, 100365. [Google Scholar] [CrossRef]
  15. Singh Nayal, A.; Ramos, P.L.; Tyagi, A.; Singh, B. Improving inference in exponential logarithmic distribution. Commun.-Stat.-Simul. Comput. 2024, 1–25. [Google Scholar] [CrossRef]
  16. Chen, Y.; Ding, T.; Wang, X.; Zhang, Y. A robust and powerful metric for distributional homogeneity. Stat. Neerl. 2025, 79, e12370. [Google Scholar] [CrossRef]
  17. Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. London. Ser. A Contain. Pap. A Math. Phys. Character 1922, 222, 309–368. [Google Scholar] [CrossRef]
  18. Fisher, R.A. Theory of statistical estimation. In Mathematical Proceedings of the Cambridge Philosophical Society; Cambridge University Press: Cambridge, UK, 1925; Volume 22, pp. 700–725. [Google Scholar] [CrossRef]
  19. Elmore, K.L. Alternatives to the chi-square test for evaluating rank histograms from ensemble forecasts. Weather Forecast. 2005, 20, 789–795. [Google Scholar] [CrossRef]
  20. Stephens, M.A. EDF statistics for goodness of fit and some comparisons. J. Am. Stat. Assoc. 1974, 69, 730–737. [Google Scholar] [CrossRef]
Figure 1. A 3D plot for CDF of standard deviations ( σ ) in CM replicas as a function of sample size (n) and critical level (l) and 2D contour of the same.
Figure 1. A 3D plot for CDF of standard deviations ( σ ) in CM replicas as a function of sample size (n) and critical level (l) and 2D contour of the same.
Data 10 00181 g001
Table 1. Comparing critical values of the CM statistic.
Table 1. Comparing critical values of the CM statistic.
Sample Size (n)Upper Tail ( α )Ref. [19]Our Data (from 21 Replicates)
3 0.250 0.198[0.213384, 0.213408]
5 0.250 0.207[0.211632, 0.211669]
8 0.250 0.209[0.210698, 0.210726]
10 0.250 0.209[0.210424, 0.210441]
20 0.250 0.209[0.209903, 0.209921]
3 0.150 0.282[0.279629, 0.279662]
5 0.150 0.284[0.283004, 0.283059]
8 0.150 0.284[0.283486, 0.283528]
10 0.150 0.284[0.283622, 0.283657]
20 0.150 0.284[0.283848, 0.283880]
3 0.100 0.351[0.337834, 0.337877]
5 0.100 0.350[0.342342, 0.342372]
8 0.100 0.348[0.344413, 0.344463]
10 0.100 0.348[0.345004, 0.345046]
20 0.100 0.347[0.346142, 0.346205]
3 0.050 0.472[0.439365, 0.439406]
5 0.050 0.467[0.446893, 0.446962]
8 0.050 0.464[0.452274, 0.452317]
10 0.050 0.463[0.454137, 0.454180]
20 0.050 0.462[0.457727, 0.457835]
3 0.025 0.603[0.533164, 0.533223]
5 0.025 0.590[0.550486, 0.550601]
8 0.025 0.584[0.562079, 0.562164]
10 0.025 0.583[0.565833, 0.565986]
20 0.025 0.581[0.573278, 0.573385]
3 0.010 0.783[0.639724, 0.639880]
5 0.010 0.750[0.683402, 0.683510]
8 0.010 0.749[0.706920, 0.707071]
10 0.010 0.748[0.714325, 0.714639]
20 0.010 0.581[0.729055, 0.729229]
3 0.005 0.922[0.706601, 0.706741]
5 0.005 0.888[0.780500, 0.780656]
8 0.005 0.877[0.814893, 0.815198]
10 0.005 0.874[0.826036, 0.826387]
20 0.005 0.871[0.848019, 0.848251]
3 0.001 1.215[0.821365, 0.821671]
5 0.001 1.204[0.985550, 0.985945]
8 0.001 1.179[1.056848, 1.057377]
10 0.001 1.175[1.079760, 1.080267]
20 0.001 1.170[1.124318, 1.124957]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jäntschi, L. Small Samples’ Permille Cramér–Von Mises Statistic Critical Values for Continuous Distributions as Functions of Sample Size. Data 2025, 10, 181. https://doi.org/10.3390/data10110181

AMA Style

Jäntschi L. Small Samples’ Permille Cramér–Von Mises Statistic Critical Values for Continuous Distributions as Functions of Sample Size. Data. 2025; 10(11):181. https://doi.org/10.3390/data10110181

Chicago/Turabian Style

Jäntschi, Lorentz. 2025. "Small Samples’ Permille Cramér–Von Mises Statistic Critical Values for Continuous Distributions as Functions of Sample Size" Data 10, no. 11: 181. https://doi.org/10.3390/data10110181

APA Style

Jäntschi, L. (2025). Small Samples’ Permille Cramér–Von Mises Statistic Critical Values for Continuous Distributions as Functions of Sample Size. Data, 10(11), 181. https://doi.org/10.3390/data10110181

Article Metrics

Back to TopTop