# A Case Study Perspective on Working with ProUCL and a State Environmental Agency in Determining Background Threshold Values

*Int. J. Environ. Res. Public Health*

**2015**,

*12*(10), 12905-12923; https://doi.org/10.3390/ijerph121012905

## Abstract

**:**

## 1. Introduction

#### 1.1. Mandated Soil Background Threshold Limits

#### 1.2. ProUCL Software

#### 1.3. Analysis and Reporting to the State Agency

## 2. Statistical Issues

#### 2.1. Outlier Detection

#### 2.2. Estimation of Distribution Parameters in the Presence of NDs

#### 2.3. Use of Nonparametric UTL Formulas

^{th}percentile of 1000 bootstrapped UPL values was taken as an approximate UTL.

## 3. Discussion

#### 3.1. The Role of the State

_{u}= 5 and n

_{d}= 1, respectively. In this case, the state agency gave the site a formula for a two-sample t-test it wanted the site to use to assess whether a difference existed in the up-gradient and down-gradient population mean concentrations, but the formula completely omitted a variance term in the denominator for the down-gradient sample. Most likely this was because the agency adapted a two-sample t-statistic from an EPA guidance document but, being unable to calculate the sample variance for the down-gradient sample since n

_{d}= 1, they simply dropped the variance term from the formula. This had the effect of inappropriately magnifying the t-value by a factor of approximately 2.45 (assuming the variances are reasonably similar in the two populations), thus magnifying the estimated difference between the population means relative to the estimated variability in their differences due simply to random error, and greatly increasing the likelihood of rejecting the null hypothesis of no difference in the population means.

#### 3.2. The Role of the EPA

- It would facilitate side-by-side comparisons of methodologies via simulations and ensure that the methodologies used in the simulations faithfully represent those in the ProUCL software.
- It would facilitate extraction of specific results from a large number of analyses. In particular, once the code was incorporated into programming environments, such as R or Python, a user could quickly extract just the outputs that are needed for a report.
- It would allow users to better understand how calculations are performed where the manuals are lacking explanation. Having the code available would allow a sophisticated user to study the implementation and determine precisely what is being done in the ProUCL implementation of a particular method.
- It would improve analysis workflow, as having the code available would allow a user to implement analysis processes that could be replicated or modified/updated with minimal effort.
- It would promote the philosophy of reproducible research, which has gained tremendous momentum in the past couple of years. This would be accomplished by incorporating the code into the many software packages where reproducible research tools are already available. It is also very difficult to describe an analysis process that was followed for a large analysis when performed via a graphical interface, and a programmatic interface is generally necessary to ensure the analysis workflow is reproducible.
- It would reduce the local environmental agencies’ views of ProUCL being the sole standard for comparison. Having the code available would eventually result in its implementation in other environments, and thus ultimately push local environmental agencies to more readily consider analyses performed in other software environments.
- It would allow implementation of the ProUCL methodologies on non-Windows operating systems without having to run via virtual machines. Many software packages used for analysis, such as R and Python, are implemented across most platforms.
- It would allow the statistics community to contribute code for consideration for use in ProUCL. With the code available, statisticians would be encouraged to investigate it and compare existing ProUCL methodologies to other methodologies, including new ideas for future methodologies. This will also lead to the development of more extensive guidelines as to when various methodologies should be used.
- It would lead to faster fixing of bugs and general difficulties in ProUCL and its documentation. Any major software product has bugs, and having other statisticians evaluating and using its code base would lead to quicker identification of bugs and fixes, ultimately improving its reliability.

#### 3.3. The Role of the Statistician

## 4. Conclusions

## Conflicts of Interest

## Appendix–Some Simulation Results

- 1
- Gamma (3, 3),
- 2
- Gamma (7, 1), and
- 3
- Gamma (1, 0.2).

**Figure A2.**(

**a**) Original Gamma(3, 3) pdf (black) for simulated data with n = 12 with an average of three NDs, and pdf’s estimated by the EM method (blue) and the GROS method (green); (

**b**) Original Gamma(7, 1) pdf (black) for simulated data with n = 12 with an average of three NDs, and pdf’s estimated by the EM method (blue) and the GROS method (green); (

**c**) Original Gamma(1, 0.2) pdf (black) for simulated data with n = 12 with an average of three NDs, and pdf’s estimated by the EM method (blue) and the GROS method (green).

^{th}percentile of the original gamma distribution. Table A1 shows the results of these simulations.

**Table A1.**Simulation results showing mean distances and root mean squared deviations from the true parameter values. UTL measures are from the 95

^{th}percentile of the true distribution. UTL % exceedance is the percentage of 95%–95% UTLs that exceeded the 95th percentile of the true distribution.

Gamma (3, 3) | Gamma (7, 1) | Gamma (1, 0.2) | ||||
---|---|---|---|---|---|---|

GROS | EM | GROS | EM | GROS | EM | |

Shape Mean Distance | 6.384 | −0.787 | 4.626 | −0.770 | 1.765 | −0.166 |

Shape Root-MS-Deviation | 9.500 | 2.169 | 7.942 | 4.015 | 2.478 | 0.608 |

Rate Mean Distance | 5.225 | −0.761 | 0.614 | −0.103 | 0.299 | −0.022 |

Rate Root-MS-Deviation | 8.572 | 2.119 | 1.114 | 0.569 | 0.522 | 0.129 |

UTL Mean Distance | 2.451 | 1.915 | 4.059 | 5.074 | 22.89 | 14.97 |

UTL Root-MS-Deviation | 3.245 | 2.319 | 5.274 | 6.337 | 27.06 | 18.26 |

UTL % Exceedance | 93.88 | 96.30 | 92.73 | 95.29 | 95.49 | 95.37 |

^{th}percentile with the UTL calculation when the distribution is not symmetric. For the symmetric distribution the EM’s UTL accuracy and precision is slightly lower (a factor of 1.2 of the ROS values), but achieved the 95% confidence while the ROS method fell a little short. All of the R code used to produce Figure A2a–c and to conduct the simulations presented in Table A1, including the starting random number seed, are available as a supplement to this paper on the journal’s web site.

## References

- Singh, A.; Maichle, R. ProUCL Version 5.0.00 User Guide-Statistical Software for Environmental Applications for Data Sets with and without Nondetect Observations; EPA: Washington, WA, USA, 2013.
- Singh, A.; Singh, A.K. ProUCL Version 5.0.00 Technical Guide-Statistical Software for Environmental Applications for Data Sets with and without Nondetect Observations; EPA: Washington, WA, USA, 2013. [Google Scholar]
- Hahn, G.J.; Meeker, W.Q. Statistical Intervals—A Guide for Practitioners; John Wiley & Sons, Inc.: New York, NY, USA, 1991. [Google Scholar]
- U.S. Environmental Protection Agency (EPA). Scout 2008 User Guide (Draft) EPA/600/R-08/038; Office of Research and Development: Atlanta, GA, USA, 2010.
- R Core Team. R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing: Vienna, Austria, 2014. Available online: http://www.R-project.org/ (accessed on 1 Nov 2014).
- Singh, A.; Nocerino, J.M. Robust Procedures for the identification of multiple outliers. In Handbook of Environmental Chemistry; Springer Verlag: Heidelberg, Germany, 1995; Volume 2, pp. 229–277. [Google Scholar]
- Dempster, A.; Laird, N.; Rubin, D. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc.
**1977**, 39, 1–38. [Google Scholar] - Ip, E. A Stochastic EM Estimator in the Presence of Missing Data-Theory and Applications; Technical Report No. 304; Prepared under NSF Grant DMS 93-01366; Statistics Department, Stanford University: Stanford, CA, USA, 1994. [Google Scholar]
- Singh, A.; Armbya, N.; Singh, A.K. ProUCL Version 4.1.00 Technical Guide (Draft)-Statistical Software for Environmental Applications for Data Sets with and without Nondetect Observations; EPA: Washington, WA, USA, 2010.
- Singh, A.; Maichle, R.; Lee, S. On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations; EPA: Washington, WA, USA, 2006.
- Wilson, E.B.; Hilferty, M.M. The Distribution of Chi-Squares. Proc. Natl. Acad. Sci. USA
**1931**, 17, 684–688. [Google Scholar] [CrossRef] [PubMed]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Daniel, D.L. A Case Study Perspective on Working with ProUCL and a State Environmental Agency in Determining Background Threshold Values. *Int. J. Environ. Res. Public Health* **2015**, *12*, 12905-12923.
https://doi.org/10.3390/ijerph121012905

**AMA Style**

Daniel DL. A Case Study Perspective on Working with ProUCL and a State Environmental Agency in Determining Background Threshold Values. *International Journal of Environmental Research and Public Health*. 2015; 12(10):12905-12923.
https://doi.org/10.3390/ijerph121012905

**Chicago/Turabian Style**

Daniel, David L. 2015. "A Case Study Perspective on Working with ProUCL and a State Environmental Agency in Determining Background Threshold Values" *International Journal of Environmental Research and Public Health* 12, no. 10: 12905-12923.
https://doi.org/10.3390/ijerph121012905