The External Exposome and Life Expectancy: Formaldehyde as a Leading Predictor in U.S. Counties
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis is a generally well-written paper with a focus on building XGBoost models to identify and rank the influential factors for life expectancy. The language is fluent. A major concern is that the XGBoost model only explains the statistical relationships between variables and outputs, not cause-and-effect relationships. It is good that the authors used the term “predictor” to describe these variables. Here, I suggest a minor revision for this paper.
- A weakness I have spotted is that the description of the research gap is too abrupt. The authors claim in Lines 56-57 that no prior study has integrated atmospheric, livestock, and socioeconomic data within a unified predictive framework at the county level. This claim is strong, but weakly supported and not entirely defensible. The “gap” is framed as a lack of data integration, rather than a real scientific gap.
- Additionally, the authors argue that socioeconomic variables like educational attainment and poverty, which are included in the model, partially capture behavioral variations. However, behaviors like smoking are massive, direct confounders for respiratory mortality and life expectancy. Without including county-level behavioral data, it is difficult to determine if formaldehyde's high predictive rank is partially absorbing the variance of these omitted behavioral variables.
- Lastly, the livestock density data from the Gridded Livestock of the World (GLW) dataset was only available for the discrete years 2010, 2015, and 2020. The data for the study period (2012-2019) was generated through linear interpolation, which decreases the reliability of the model results.
Author Response
Please see the attached response to Reviewer 1.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsOverall Assessment
This manuscript presents an interesting and timely question: can atmospheric exposure data, integrated within an external exposome framework, identify novel predictors of county-level life expectancy beyond classical socioeconomic determinants? The study is well-motivated, the methodology is sound, and the finding that formaldehyde exposure ranks as the second-strongest predictor is both novel and biologically plausible. However, several issues must be addressed before the manuscript can be accepted. Subject to satisfactory revision, the work has the potential to make a meaningful contribution to environmental health and exposome research.
Major Comments
- The central claim that formaldehyde operates as a distinct predictor independent of socioeconomic status is asserted but not demonstrated. The SHAP analysis shows formaldehyde ranks second but does not establish that its effect is independent of poverty or education. Industrial facilities emitting formaldehyde tend to be sited in lower-income counties; the CAMS reanalysis, at 0.75° resolution, would capture regional pollution gradients that are themselves correlated with regional economic disadvantage. The authors must provide a direct test of spatial independence. For example, partial dependence plots of formaldehyde SHAP values conditioned on poverty quartile, or a formal partial correlation analysis. Without this, the claim that formaldehyde is "not simply a proxy for socioeconomic disadvantage" remains unconfirmed.
- The model does not include smoking popularity, obesity rates, or physical inactivity. All these factors are established determinants of life expectancy at the county level. The authors acknowledge this limitation but argue that education and poverty partially capture behavioral variation. This is speculative. Formaldehyde exposure is also associated with industrial employment, which correlates with smoking rates. Without direct behavioral covariates, the model cannot distinguish a genuine atmospheric effect from residual confounding by lifestyle factors. Adding at least smoking prevalence to the model and testing whether formaldehyde retains its ranking is a necessary robustness check.
- The 0.75° (~80 km) resolution of the CAMS EAC4 reanalysis is discussed in the Limitations section but its implications are more serious than acknowledged. At this resolution, a single grid cell covers hundreds of counties in the eastern United States, meaning many counties share identical atmospheric values for all CAMS-derived variables including formaldehyde. This has two consequences that go beyond simple measurement uncertainty. First, it structurally suppresses within-region variation in atmospheric predictors, making any atmospheric variable that happens to align with the regional socioeconomic gradient appear more important relative to others. Second, it raises the question of whether formaldehyde's importance reflects the spatial pattern of CAMS formaldehyde specifically, or simply the fact that it correlates with a broad north–south or urban–rural gradient already captured imperfectly by the socioeconomic variables. The authors should quantify how many counties share CAMS grid cells and should compare their CAMS-derived formaldehyde values against OMI or TROPOMI satellite retrievals, which are available at 3.5×5.5 km and would allow validation of the exposure estimates.
- The fraction of time (FoT) metric is defined as the proportion of time that concentrations exceed the 75th percentile computed across all county-year observations in the dataset. This global threshold is computed from the entire 2012–2019 panel, including test set counties. This means the threshold itself encodes information from the test set, which constitutes implicit leakage. The 75th percentile threshold should be computed using training-set observations only and then applied to the test set. The authors must clarify whether this was done. If not, the analysis should be rerun with a training-only threshold.
- The full model achieves a training R² of 0.989 and a test R² of 0.854 — a gap of 0.135. While the authors attribute this to the GroupShuffleSplit validation strategy, this explanation is incomplete. GroupShuffleSplit eliminates county-level leakage but does not eliminate leakage through shared national temporal trends. If annual macroeconomic or atmospheric trends dominate both training and test counties in a given year, the model may be partly interpolating across years rather than generalizing spatially. A temporal out-of-sample test, for example, training on 2012–2016 and testing on 2017–2019 is needed to assess whether the model generalizes across time as well as across counties. This is also relevant to the ablation study, where the top 10 model achieves a higher training R² (0.949) than the top-20 model (0.940) despite having fewer features a non-monotonic result that should be explained.
- County-level life expectancy data in the US is strongly spatially clustered, as is clearly visible in Figure 1. The study does not report any test for spatial autocorrelation in model residuals (e.g. Moran's I). If residuals are spatially autocorrelated, the assumption of independence underlying the GroupShuffleSplit validation is violated, and the reported test R² and RMSE values may be misleading. The authors should test for spatial autocorrelation in residuals and, if significant, either incorporate a spatially-weighted modelling approach or at minimum report the degree of autocorrelation as a quantified limitation.
- Livestock density data were available only for 2010, 2015, and 2020, and linear interpolation was used to generate annual values for 2012–2019. No validation of this interpolation is provided. Linear interpolation between points nine years apart is a strong assumption; non-linear dynamics in livestock populations (e.g. due to disease outbreaks, market shocks, or policy changes) would not be captured. The authors should either provide evidence that linear interpolation is appropriate, for example, by comparing interpolated values for 2015 against known county-level agricultural census data or explicitly quantify the uncertainty this introduces into their livestock predictor estimates.
Minor Comments
- The caption for Figure 4 reads 'Residual diagnostics for the top 20 model,' but the surrounding text in Section 3.1 discusses the full 43-feature model's residuals and validation. It is unclear which model Figure 4 belongs to. This must be corrected and the figure caption made consistent with the text.
- The rightmost panel in Figure 8 is labeled 'Disability Rate' in the caption, but the text at lines 248–251 discusses Disability Rate as the third variable in the SHAP plot (Figure 5), not in the dependence plots of Figure 8. The Figure 8 caption elsewhere refers to the top three predictors (Education, Formaldehyde, Poverty Rate). One of these is incorrect, either the figure itself or the caption label. This must be reconciled.
- Table 3 shows that the top 10 model has a higher training R² (0.949) than the top 20 model (0.940), which is counter-intuitive since more features should allow at least equal training fit. This is presumably an artifact of Bayesian re-optimization of hyperparameters for each feature set, but the authors do not discuss it. A brief explanation is required.
- SHAP importance rankings are reported as point estimates without any uncertainty quantification. Feature importance rankings derived from a single train/test split can be sensitive to which counties are included in training. The authors should report bootstrap confidence intervals on SHAP values across cross-validation folds, or at minimum acknowledge that the rankings are split-dependent.
- Bayesian optimization was performed with only 30 iterations over an 8-dimensional hyperparameter space (lines 175–176). For a space of this dimensionality, 30 iterations is generally considered insufficient for reliable convergence. The authors should either increase the number of iterations and re-run or provide a sensitivity analysis showing that the optimal hyperparameters are stable.
- Mean absolute error (MAE) is reported in the ablation Table 3 but is not reported in Section 3.1 for the full model. Either report MAE consistently across all results tables or remove it from the ablation table and explain why it was omitted from the main results.
- CAMS EAC4 data are provided at 0.75° × 0.75° resolution and ERA5 humidity at 0.25° × 0.25°. The manuscript does not describe how these two grids were reconciled before spatial interpolation to county boundaries. Was ERA5 resampled to 0.75°, or were both grids interpolated independently to county centroids? The procedure should be made explicit.
- The body text (lines 131–134) defines FoT variables as the proportion of time during which concentrations exceeded the 75th percentile 'across all observations,' while the Appendix note (lines 412–416) defines it as exceeding the 75th percentile 'computed from the entire 2012–2019 dataset.' These are subtly different, the former implies a single global percentile; the latter could be interpreted as a percentile computed per county. The definition should be stated precisely and consistently in both locations.
- Line 350: 'strategies that targets' à 'strategies that target'. The notation R² is used inconsistently, sometimes written as R² and elsewhere as R2
Author Response
Please see the attached response to Reviewer 2.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis is a very interesting manuscript with some surprising and almost certainly important results. The design is excellent, using four major data sources to examine a machine learning framework for life expectancy. The modeling looks fine although this is not my area of expetise.
Most of my concerns are minor ones that need explanation. Methane is not a toxic gas unless it displaces oxygen. In the IHME dataset you say "for individuals under 1 year of age". Do you mean all live births? Including peri-natal mortality here is problematic. In the CAMS dataset why was benzene not included, especially since toluene was? I would expect benzene to rival adverse effects of formaldehyde.
The formaldehyde results are striking and unexpected. The cancer evidence is strong only for relatively infrequent cancers, and the evidence for non-cancer effects has been weak. Formaldehyde while being produced from many sources is not known to persistent long in air. The data here is strong, but the reasons why it is unexpected should be better discussed. However, there are some things in figure 5 that are concerning and raise some questions regarding the overall results. Why is being Hispanic so positive? Why do the pig and cattle density improve life expectancy when horse density is negative? Why is low vegetation more positive than high vegetation? Most of the other findings are consistent with expected results but these are not. And in Figure 7, are the results for importance independent of plus or minus, and what do the colors signify? These are all important questions that at minimum necessitate some discussion.
This is a great manuscript but addressing some of other unexpected findings beyond just the very important formaldehyde results will improve it.
Author Response
Please see the attached response to Reviewer 3.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI want to congratulate the authors for their effort. They have responded thoroughly to the major methodological concerns raised in my review. The revised manuscript adds smoking prevalence, poverty-stratified SHAP analysis, partial correlation analysis, temporal validation, Moran’s I residual testing, SHAP ranking stability checks, and livestock sensitivity analysis. These additions substantially strengthen the manuscript and make the formaldehyde finding more credible as a predictive association. The authors also now acknowledge the main remaining limitations, especially the observational design, ecological fallacy risk, residual confounding, coarse CAMS exposure resolution, and the need for future validation against TROPOMI, OMI, and ground-based measurements. My only remaining recommendation is that the manuscript should continue to avoid causal language. The evidence supports formaldehyde as a robust and biologically plausible predictor, not yet as a proven causal determinant of county-level life expectancy. With that framing maintained, the revision is satisfactory.
Author Response
Please see the attached response to Reviewer 2.
Author Response File:
Author Response.pdf

