Abstract
Machine learning (ML) is increasingly applied to geochemical baseline estimation and anomaly detection in soils and sediments, yet the methodological conditions under which machine learning outperforms traditional approaches—and which preprocessing and validation decisions most consequentially determine that advantage—remain incompletely characterized across environmental and mineral exploration domains. A structured systematic scoping review of 146 records from the Web of Science Core Collection applied sequential filtering to yield 78 thematically eligible studies, from which 20 were prioritized through a composite index integrating age-adjusted citation impact, platform usage, and semantic relevance. Four cross-cutting findings emerge. First, performance gains in environmental applications were driven primarily by spatial model structure rather than algorithm selection: incorporating a spatial covariate derived from geographically weighted regression raised test-set explained variance from to for cadmium mobility prediction in a geochemically heterogeneous karst setting, a gain the source study supported with a held-out test set and a Monte Carlo analysis of sensitivity to data size. Second, isometric or centered log-ratio preprocessing was applied in the majority of mineral exploration studies (three of five classical and hybrid studies and four of five deep-learning studies) but in none of the seven environmental studies, representing a systematic methodological gap with direct consequences for covariate importance estimates under compositional closure. Third, Shapley additive explanations and accumulated local effects functioned as instruments of operational value, enabling element-specific anomaly threshold derivation, training sample diagnosis, and grid-cell anomaly type classification; this evidence demonstrates that the accuracy–interpretability trade-off commonly assumed in the machine learning literature is not fundamental in geochemical applications but contingent on algorithm selection. Fourth, 90% of the 20 synthesized studies (18 of 20 by study-area location—13 in China and five in Iran) were evaluated under within-domain validation designs, and the consistently high performance metrics reported should be interpreted as interpolation estimates rather than evidence of transferable predictive capability. Geographic diversification of training datasets and spatially explicit cross-regional validation are identified as structural prerequisites for regulatory-grade applicability.