Appendix A. Specimen Attribute RV Pre-Assessments
As already mentioned in the narrative, the Box–Cox power and Manly exponential data transformations attempt to align an attribute RV with a normal distribution, and in in doing so stabilize the RV’s variance to a normal distribution’s constant dispersion. In their inverse forms, these transformations tend to be more applicable to RVs whose observations exhibit right-skewness, tending to concentrate relatively close to zero ([
3] (p. 29) within their non-negative support. A noteworthy difference between the inverse polynomial and negative exponential functions is that the former suggests a more complex distribution, whereas the latter indicates a simple distribution. Therefore, when exponents are outside of the [−2, 2] Tukey power ladder interval, parsimony argues for swapping these descriptive equations; this is the same type of argument backing
Table 9. This replacement occurs three times in
Table A1: Dallas County 40–49 years of age density (
= −4.98), and DFW MSA professional (
= −2.64) and wholesale (
= −4.51) employment percentages.
Table A1.
Some relevant facets of reciprocal Manly exponential data transformation oriented attribute RVs.
Table A1.
Some relevant facets of reciprocal Manly exponential data transformation oriented attribute RVs.
Attribute | Standardization | Geographic Landscape | Manly Transformation Coefficient | MSE | S-W p-Value | Skewness | Excess Kurtosis |
---|
owner occupied housing units | density | DC | −0.09605 | 0.02 | <0.0001 | −0.042 | −0.359 |
20–29 years of age | % | DC | −0.05560 | 0.03 | <0.0001 | 0.067 | 1.117 |
DFW MSA | −0.06414 | 0.01 | <0.0001 | 0.043 | 0.403 |
30–39 years of age | % | DC | −0.03990 | 0.04 | <0.0001 | −0.009 | 1.990 |
DFW MSA | −0.04609 | 0.02 | <0.0001 | 0.009 | 1.354 |
40–49 years of age | % | DFW MSA | −0.02020 | 0.04 | <0.0001 | −0.090 | 2.318 |
density | DC † | −0.25250 | <0.01 | 0.0688 | −0.029 | −0.261 |
50–64 years of age | % | DC | −0.04494 | 0.01 | 0.0076 | 0.026 | 0.868 |
DFW MSA | −0.03367 | <0.01 | 0.0284 | 0.019 | 0.213 |
density | DC | −0.14757 | 0.01 | 0.0115 | −0.016 | −0.114 |
65+ years of age | density | DC | −0.30756 | 0.01 | <0.0001 | −0.033 | −0.370 |
manufacturing employment | % | DC | −0.04699 | <0.01 | 0.1743 | 0.004 | 0.082 |
DFW MSA | −0.04324 | <0.01 | 0.1044 | 0.008 | 0.136 |
wholesale employment | % | DFW MSA † | −0.11909 | <0.01 | 0.0001 | −0.044 | −0.234 |
retail employment | % | DC | −0.04737 | 0.01 | 0.0035 | 0.104 | 0.900 |
DFW MSA | −0.03975 | 0.01 | 0.0003 | 0.063 | 0.685 |
professional employment | % | DC | −0.04216 | <0.01 | 0.1129 | −0.005 | −0.057 |
DFW MSA † | −0.04950 | <0.01 | 0.0247 | −0.009 | −0.120 |
education employment | % | DFW MSA | −0.01346 | <0.01 | 0.0109 | 0.016 | 0.452 |
miscellaneous employment | % | DC | −0.10405 | <0.01 | 0.2357 | −0.006 | −0.126 |
DFW MSA | −0.10412 | <0.01 | 0.0203 | −0.003 | −0.022 |
The literature cited in this paper, as well as other readily available publications, furnish a preponderance of evidence attesting to these two reciprocal transformations being very efficient and effective when undertaking their data modification task: empirical frequency distribution makeovers that deform them into mimicking a bell-shaped curve. In this paper, the S-W statistic provides an index of success for such metamorphoses. Hoeffding [
46] posits a theorem concerning moment matching and the convergence in probability of density functions. For normal approximations, the first and second moments are of limited importance because they minimally impact density function shape; kurtosis governs the relative heaviness of tails incidental with respect to variance size. A positive support often chaperons reciprocal transformations; certainly, this support cannot contain zero, whose inverse is undefined. In addition, variance must be finite. Meanwhile, Romano and Siegel [
47] (pp. 48–49), for example, note counter-examples to the claim that two distributions with the same moments are identical. The notion of a normal approximation already concedes their point. Nevertheless, if one distribution imitates another, then some of their moments should harmonize. For a bell-shaped curve, the intuitive synchronization expectation is for those moments affiliated with skewness and kurtosis: ideal normal and after-transformation histograms should reflect symmetry and peakedness similarities.
Table A1 and
Table A2 tabulate these summary statistics for the attribute RVs discussed in this paper. Both theoretical values of interest are zero: the balance of symmetry begets zero, and excess kurtosis equals kurtosis minus three, the theoretical value for a normal RV. Each of these two tables presents three simultaneous statistical examinations, requiring a multiple testing correction; the Bonferroni adjustment is for a two-tailed 5% significance level, creating the following confidence intervals:
skewness for Dallas County of ± 0.254, and for the DFW MSA of ± 0.161; and,
kurtosis for Dallas County of ± 0.509, and for the DVW MSA of ± 0.322.
Table A2.
Some relevant facets of reciprocal Box–Cox data transformation oriented attribute RVs.
Table A2.
Some relevant facets of reciprocal Box–Cox data transformation oriented attribute RVs.
Attribute | Standardization | Geographic Landscape | Data Transformation | MSE | S-W p-Value | Skewness | Excess Kurtosis |
---|
δ | γ |
---|
persons with some college | density | DC | 5.17 | −1.58479 | 0.01 | 0.0183 | −0.017 | −0.227 |
persons with associate degree | % | DC | 12.07 | −0.31589 | <0.01 | 0.0190 | −0.026 | −0.208 |
no public assistance count | density | DC | 12.84 | −1.64439 | 0.01 | 0.0012 | 0.005 | −0.098 |
owner occupied housing units | density | DC | 15.58 | −1.97132 | 0.01 | 0.0011 | 0.010 | −0.083 |
DFW MSA | 10.09 | −1.08372 | 0.02 | <0.0001 | −0.088 | −0.373 |
vacant housing units | % | DC | 5.15 | −0.45848 | <0.01 | 0.4316 | −0.002 | −0.111 |
DFW MSA | 1.99 | −0.04822 | <0.01 | 0.4396 | −0.005 | −0.043 |
density | DC | 0.05 | −0.13852 | <0.01 | 0.1897 | −0.009 | −0.137 |
<20 years of age | density | DC | 2.63 | −0.28963 | <0.01 | 0.2893 | −0.000 | −0.128 |
20–29 years of age | density | DC | 0.61 | −0.23163 | <0.01 | 0.3897 | −0.002 | −0.100 |
30–39 years of age | density | DC | 0.12 | −0.37042 | <0.01 | 0.1980 | −0.003 | −0.099 |
65+ years of age | % | DC | 2.96 | −0.03533 | <0.01 | 0.9467 | −0.001 | −0.090 |
retail employment | density | DC | 0.27 | −0.13021 | <0.01 | 0.0312 | −0.005 | −0.103 |
transportation employment | density | DC | 0.19 | −0.27852 | <0.01 | 0.0123 | −0.029 | −0.197 |
financial employment | % | DC | 24.58 | −1.02556 | <0.01 | 0.2481 | −0.011 | −0.148 |
density | 0.32 | −0.30390 | <0.01 | 0.0087 | −0.016 | −0.202 |
professional employment | density | DC | 0.36 | −0.26838 | <0.01 | 0.0259 | −0.018 | −0.155 |
education employment | density | DC | 3.45 | −1.78403 | 0.01 | 0.0088 | −0.025 | −0.171 |
arts employment | % | DC | 5.33 | −0.03122 | <0.01 | 0.7045 | 0.000 | −0.116 |
DFW MSA | 12.67 | −0.81549 | <0.01 | 0.0223 | −0.004 | −0.060 |
density | DC | 0.06 | −0.03833 | <0.01 | 0.3089 | −0.028 | −0.115 |
miscellaneous employment | density | DC | 0.08 | −0.08133 | <0.01 | 0.1671 | −0.020 | −0.108 |
public employment | % | DC | 2.89 | −0.52241 | <0.01 | 0.0001 | −0.130 | −0.494 |
DFW MSA | 5.63 | −0.70713 | <0.01 | <0.0001 | −0.084 | −0.449 |
Hispanic population count | % | DFW MSA | 1.15 | −0.09025 | 0.02 | <0.0001 | 0.037 | −0.636 |
miscellaneous racial/ethnic count | % | DC | 0.23 | −0.08165 | <0.01 | 0.0819 | −0.011 | −0.252 |
DFW MSA | 0.33 | −0.05555 | <0.01 | 0.0341 | −0.003 | −0.182 |
density | DC | 0.03 | −0.03085 | <0.01 | 0.5073 | 0.011 | −0.102 |
These tables reveal that the transformations virtually always adequately induce skewness, but perhaps have a slightly lower chance of also inducing kurtosis. Furthermore, even with near-perfect fits to normal quantile values, as measured by the MSE, they are even less likely to generate a non-significant S-W statistic. As an aside, the relatively large sample sizes of 529 and 1324 complicate this inferential appraisal; as
Table 4 and
Table 5 coupled with
Figure 1 and
Figure 2 demonstrate, almost all alignment gains through the use of transformations are substantial, even when transformed data S-W values remain statistically significant; this situation reflects the contemporary need to development substantive differences to replace statistical inference criteria.
Nevertheless, these larger sample sizes signify a situation in which modest departures from normality tend to be far less problematic. Accordingly, invoking the six-sigma rule here increases the confidence intervals to
skewness for Dallas County of ± 0.516, and for the DFW MSA of ± 0.326; and,
kurtosis for Dallas County of ± 1.236, and for the DVW MSA of ± 0.784.
Unfortunately, the reporting style of SAS software prevents a more precise scrutiny of the <0.0001 S-W
p-values. Additionally, because the six-sigma rule classifies only 3.4 per million random samples as extreme outcomes, the natural presence of sampling error does not convincingly account for the few significant kurtosis cases appearing in
Table A1; these particular few variable transformations may well be prone to serious specification error, a theme meriting future research.
On the one hand, because the assumption of normality rests upon symmetry, and a prominent characteristic of many non-normal RV probability density functions is asymmetry, skewness could be viewed as the more important of the two moments in a normality diagnosis. In keeping with this viewpoint, DeCarlo [
48] suggests that skewness has a higher priority in equality of means tests. On the other hand, Khan and Rayner [
49] (p. 204) state: “Both the ANOVA and Kruskal–Wallis tests are vastly more affected by the kurtosis of the error distribution rather than by its skewness.” This incongruity arises because correlation exists between skewness and kurtosis moments; their effects are not completely separable—for example, increasing skewness tends to demand increasing kurtosis in a frequency distribution. Ryu [
50] highlights one consequence of this covariation: selected empirical distribution quantile plots disclose a thicker upper tail attributable to skewness as well as a longer upper tail attributable to kurtosis. With regard to data transformations, skewness usually is easier than kurtosis to manipulate: simultaneously and systematically stretching/shrinking measurement scale segments differentially to better center any clustering tendency of values—alluding to the Tukey-Mosteller bulge—can entail less effort than trying to increase/decrease this clustering propensity. Therefore, until some consensus decision-making rationale crystalizes for weighting one of these moments more than the other, data transformation evaluations should treat them equally, which essentially is the tactic taken in this paper.
Finally, especially
Table A2 tabulates findings that would, for an overwhelming number of its entries, remain statistically non-significant even if the significance level criterion was more restrictive than that for six-sigma (e.g., the preceding 5% level three-test Bonfronni adjustment). In conclusion, the illustrative reciprocal transformations staged in this paper successfully align their corresponding empirical frequency distributions with a bell-shaped normal curve, when judged by a normal RV lower moments matching yardstick.
Appendix B. Deducing Equations (3) and (4)
In today’s academic world, the nature of mathematical proofs materializes in a multitude of appearances beyond their earlier formalisms, in part coinciding with the unfolding of experimental mathematics. Gone are the days of solely deductive/inductive, counter-example, and complete enumeration demonstrations. Now acceptable proofs also are by simulation [
51], with some vigilance, as well as by, again with some caution, computer assisted algebraic/symbolic manipulations (e.g., [
36]). The determination and justification of Equations (3) and (4) are ascribable to both of these avant-garde tools: Mathematica 12.3 aided in the postulating of these two mathematical formulae, and simulation experimentation helps validate the presumable superfluousness of the discarded imaginary parts reported in Mathematica symbolic output. Accordingly, this backdrop insinuates that these two expressions are conjectures rather than theorems, and this appendix outlines the process and rationale used to posit them. Future research needs to convert them into theorems with proofs.
The formulation of Equation (3) begins with the following back-transformation for the reciprocal Manly exponential transformation:
where e denotes Euler’s number (i.e., 2.71828…), and LN denotes the natural logarithm. The original data transformation e
−βy creates X ~
(μ, σ
2), presuming (μ − 6σ) >> 0—whose gap size is relative to the magnitude of the mean and standard deviation—where
denotes a normal RV. The companion Mathematica problem becomes
The computational outcome generated by executing this command is
where the imaginary part,
appears to be trivial (e.g., see
Table 7),
is the Kummer confluent hypergeometric function of the first kind, the superscript (1, 0, 0) denotes the partial derivative with respect to only the first argument of hypergeometric function
1F
1, say a in its 3-tuple [a, b, z] argument, and EulerGamma ≈ 0.577216. Setting
to zero, and replacing the Mathematica notation Log with the natural logarithmic notation LN, yields
Simulation experiments (e.g.,
Table 2) verify this reduced result. Nonetheless, future research needs to document definitively that the imaginary number part source term is irrelevant in general.
This last expression may be rewritten as follows, writing latent Prochhammer symbols with summation and product terms:
Theory of equations states that the coefficients for the k
th-order polynomial generated by
are given by, for each of its a
1 terms that disappear with the first partial differentiation and after substitution of a = 0 in the resulting derivative, (k − 1)!. Thus, the new reduced expression becomes
which is Equation (3). For this paper, specimen empirical data for Dallas County and the DFW MSA submitted to Mathematica 12.3 supplies numerical illustrations employing this expression.
Equation (4) has a similar mathematical pedigree, and hence its derivation parallels the preceding protocol sketched for Equation (3). This new proposition begins with the following back-transformation for the reciprocal Box–Cox polynomial transformation:
where, as mentioned in the text of this paper, δ is a translation/shift parameter. This data transformation also creates X ~ N(μ, σ
2), presuming (μ − 6σ) >> 0. The companion Mathematica problem becomes
The computational outcome generated by executing this symbolic computer code is
where the imaginary part spawned by
appears to be trivial, enabling its removal. Next, factoring out
from the two terms
and
, and then combining it with
renders Equation (4), once more with the appropriate notational replacements (e.g., Γ for Gamma, and the embedded Prochhammer symbol based summations and products):
Interestingly, although the twice-appearing term causes the solution to be a complex number, trial-and-error experiments reveal that it cannot be deleted from this expression without nontrivial real number part consequences. This undesirable complication warrants future research. In addition, equivalent to the Equation (3) situation for this paper, specimen empirical data for Dallas County and the DFW MSA submitted to Mathematica 12.3 supply confirmatory numerical illustrations employing this final expression, ignoring its imaginary part.
To conclude, these two sets of reasoning deliver new normal curve theory transformation conceptualizations pertaining to inverse data transformations.
Table A3 summarizes utilized specimen dataset implementation details for exemplification purposes in this paper;
Figure A1 visualizes part of their quality evaluation. No back-transformed mean results reflect error in excess of 10%:
Figure A1a portrays a near-perfect linear alignment of these quantities with their corresponding source observed means. Mathematica 12.3 is able to compute the analytical expected value of X
2 for Equation (4), allowing calculation of its analytical back-transformed standard error. This second moment quantity encompasses noticeably more error (e.g.,
Figure A1c) than its first moment counterpart, although
Figure A1b indicates that even the most extreme case of this error still falls within its applicable linear regression prediction interval.
Table A3.
Some relevant facets of reciprocal Box–Cox data transformation oriented attribute RVs.
Table A3.
Some relevant facets of reciprocal Box–Cox data transformation oriented attribute RVs.
Attribute | Standardization | Geographic Landscape | Sample | X ~ N(μ, σ2) | Analytical Back-Transform |
---|
Mean | Std | μ | σ | Mean | Std |
---|
persons with some college | density | DC | 2.28868 | 1.78452 | 0.04504 | 0.01289 | 2.30168 | 1.82904 |
persons with associate degree | % | DC | 5.30865 | 3.09253 | 0.40835 | 0.02235 | 5.30959 | 3.09170 |
no public assistance count | density | DC | 7.67880 | 7.08938 | 0.00810 | 0.00289 | 7.63801 | 6.40739 |
owner occupied housing units | density | DC | 7.94437 | 7.37202 | 0.00233 | 0.00087 | 7.81829 | 6.36004 |
DFW MSA | 5.88595 | 6.01351 | 0.05479 | 0.01523 | 5.88984 | 5.84572 |
vacant housing units | % | DC | 9.26427 | 5.36083 | 0.30598 | 0.04611 | 9.28522 | 5.61065 |
DFW MSA | 8.27112 | 5.18106 | 0.89882 | 0.02030 | 8.27266 | 5.18001 |
density | DC | 1.00941 | 1.58668 | 1.08762 | 0.15151 | 1.07676 | 3.01539 |
<20 years of age | density | DC | 6.00651 | 5.55741 | 0.56292 | 0.07947 | 6.00618 | 5.49232 |
20–29 years of age | density | DC | 3.75940 | 4.74007 | 0.77874 | 0.13213 | 3.86005 | 6.50218 |
30–39 years of age | density | DC | 3.42150 | 3.49520 | 0.61765 | 0.12209 | 3.47850 | 4.17092 |
65+ years of age | % | DC | 9.32520 | 5.66088 | 0.91842 | 0.01420 | 9.32757 | 5.71179 |
retail employment | density | DC | 1.09802 | 1.12837 | 0.99245 | 0.08382 | 1.09837 | 1.13553 |
transportation employment | density | DC | 0.51136 | 0.57754 | 1.18487 | 0.19236 | 0.51372 | 0.65285 |
financial employment | % | DC | 9.22157 | 5.26190 | 0.02767 | 0.00412 | 9.22708 | 5.29164 |
density | 0.93917 | 1.07139 | 1.01615 | 0.18628 | 0.95651 | 1.34996 |
professional employment | density | DC | 1.46243 | 1.82636 | 0.93457 | 0.16866 | 1.49461 | 2.34179 |
education employment | density | DC | 1.57018 | 1.52754 | 0.06380 | 0.02124 | 1.55729 | 1.32047 |
arts employment | % | DC | 9.11213 | 5.14143 | 0.92179 | 0.00995 | 9.11566 | 5.17936 |
DFW MSA | 8.34434 | 4.73567 | 0.08629 | 0.01434 | 8.34958 | 4.82226 |
density | DC | 1.04784 | 1.40957 | 1.01643 | 0.03869 | 1.05496 | 1.60596 |
miscellaneous employment | density | DC | 0.57024 | 0.75712 | 1.06901 | 0.07125 | 0.56892 | 0.73592 |
public admin. employment | % | DC | 2.74977 | 3.26125 | 0.43541 | 0.08752 | 2.73393 | 3.06032 |
DFW MSA | 3.09183 | 2.73703 | 0.22569 | 0.03996 | 3.08215 | 2.56482 |
Hispanic population count | % | DFW MSA | 27.60125 | 22.07140 | 0.75887 | 0.05130 | 28.30224 | 29.05083 |
miscellaneous racial/ethnic count | % | DC | 6.67949 | 8.05657 | 0.88850 | 0.06711 | 6.81957 | 10.25678 |
DFW MSA | 7.32560 | 7.53546 | 0.91177 | 0.04209 | 7.38822 | 8.46636 |
density | DC | 1.37530 | 2.51787 | 1.01202 | 0.03669 | 1.40174 | 2.84845 |
Figure A1.
Quality assessment of
Table A3 specimen back-transformations. Left (
a): arithmetic means scatterplot; black line denotes the linear regression trend. Middle (
b): standard deviation scatterplots; black line denotes the linear regression trend, the gray lines denote 95% confidence intervals, and the red lines denote 95% prediction intervals. Right (
c): |observed—back-transformed|/observed box plots.
Figure A1.
Quality assessment of
Table A3 specimen back-transformations. Left (
a): arithmetic means scatterplot; black line denotes the linear regression trend. Middle (
b): standard deviation scatterplots; black line denotes the linear regression trend, the gray lines denote 95% confidence intervals, and the red lines denote 95% prediction intervals. Right (
c): |observed—back-transformed|/observed box plots.