Stratiﬁed Soil Sampling Improves Predictions of P Concentration in Surface Runo ﬀ and Tile Discharge

: Phosphorus (P) stratiﬁcation in agricultural soils has been proposed to increase the risk of P loss to surface waters. Stratiﬁed soil sampling that assesses soil test P (STP) in a shallow soil horizon may improve predictions of P concentrations in surface and subsurface discharge compared to single depth agronomic soil sampling. However, the utility of stratiﬁed sampling e ﬀ orts for enhancing understanding of environmental P losses remains uncertain. In this study, we examined the potential beneﬁt of integrating stratiﬁed sampling into existing agronomic soil testing e ﬀ orts for predicting P concentrations in discharge from 39 crop ﬁelds in NW Ohio, USA. Edge-of-ﬁeld (EoF) dissolved reactive P (DRP) and total P (TP) ﬂow-weighted mean concentrations in surface runo ﬀ and tile drainage were positively related to soil test P (STP) measured in both the agronomic sampling depth (0–20 cm) and shallow sampling depth (0–5 cm). Tile and surface DRP and TP were more closely related to shallow depth STP than agronomic STP, as indicated by regression models with greater coe ﬃ cients of determination (R 2 ) and lesser root-mean square errors (RMSE). A multiple regression model including the agronomic STP and P stratiﬁcation ratio (P strat ) provided the best model ﬁt for DRP in surface runo ﬀ and tile drainage and TP in tile drainage. Additionally, STP often varied signiﬁcantly between soil sampling events at individual sites and these di ﬀ erences were only partially explained by management practices, highlighting the challenge of assessing STP at the ﬁeld scale. Overall, the linkages between shallow STP and P transport persisted over time across agricultural ﬁelds and incorporating stratiﬁed soil sampling approaches showed potential for improving predictions of P concentrations in surface runo ﬀ and tile drainage.


Introduction
Phosphorus (P) losses from croplands are a major driver of hypoxia and harmful and nuisance algal blooms in waterbodies worldwide [1,2]. Dissolved reactive P (DRP) is the P fraction that primarily drives these algal blooms as it is readily bioavailable, though other fractions of total P (TP) also contribute bioavailable P and so play a secondary role [3]. Phosphorus lost from croplands can be directly derived from recent fertilizers (i.e., incidental P) [4] or from P stored in soil pools (i.e., legacy soil P) [5], but recent research suggests legacy soil P is the dominant P source in the Western Lake Erie Basin (WLEB) [6]. Previous studies have shown that DRP and TP are readily transported via both the artificial subsurface tile drainage network (i.e., tile drainage) as well as overland surface runoff in the Soil Syst. 2020, 4, 67 2 of 16 WLEB [7][8][9]. Thus, to improve predictions of environmental P losses it is necessary to further our understanding of how soil P pools relate to surface and subsurface edge-of-field P losses. Soil P is routinely measured in agricultural systems as a component of agronomic soil testing to estimate crop fertility requirements [10]. Agronomic soil testing was designed to estimate potential crop nutrient demand and yield response to fertilizer applications, so sampling depths typically target the crop's primary rooting zone. For example, in Ohio, Michigan, and Indiana, USA, the 0-20 cm depth is the typical agronomic soil testing depth and is the basis of university extension fertilizer recommendations [11]. Results of these tests are also useful for predicting the potential risk of P loss; P risk assessment tools typically use STP as a risk prediction factor [12][13][14], and previous studies have reported positive relationships between agronomic soil test P (STP) and P concentrations in surface runoff and tile drainage [15][16][17][18][19][20]. However, agronomic testing programs were not designed to monitor potential water quality impacts of soil P, so modifications to these agronomic sampling approaches hold the potential to improve soil testing for environmental purposes.
Depth of sampling is one component of agronomic soil testing that could be adapted to better understand environmental P losses. Soil P can be highly stratified with greater P concentrations in the surface layer, particularly in no-till or reduced tillage systems [21][22][23]. In the WLEB, stratification of soil P in croplands was shown to be prevalent across the Sandusky river watershed, and was hypothesized to be a contributing factor to increasing DRP concentration trends in surface waters [18]. The shallow surface soil horizon is the dominant zone of interaction between soil P and surface runoff water and thus has a disproportionate influence on surface runoff P concentrations [24][25][26][27]. In contrast, water discharged via subsurface tile drains passes through the soil matrix thereby expanding the zone of interaction and increasing opportunities for P sorption or assimilation. However, the interaction between water and the soil matrix decreases where preferential macropore flow dominates, particularly in medium-and fine-textured soils, resulting in elevated P concentrations in tile discharge [28][29][30][31]. In these situations, tile drainage water chemistry may largely reflect the surface soil characteristics. Stratified soil sampling that quantifies soil P in the shallow zone of interaction (0-5 cm) in addition to the agronomic depth (0-20 cm) has the potential to better predict environmental P loss in surface runoff and tile drainage as compared to traditional soil sampling approaches that only quantify soil P in the agronomic depth.
The objective of this study was to determine whether a stratified soil sampling regime could explain more variability in environmental P losses than traditional agronomic depth samples. Stratified soil sampling was conducted on 39 fields distributed throughout NW Ohio, USA instrumented with edge-of-field (EoF) water quality monitoring, which enabled examination of relationships between STP and EoF P losses across a broad range of soil characteristics and management regimes. The hypothesis tested was that STP from shallow (0-5 cm) soil samples will better predict DRP and TP concentrations in both surface runoff and tile drainage compared to STP from agronomic (0-20 cm) soil samples. In addition, repeated soil sampling in individual fields enabled assessments of changes in STP and P stratification ratio (0-5 cm STP/0-20 cm STP; P strat ) over the course of a 3-year study period.

Experimental Sites
Edge-of-field water quality monitoring was used to assess relationships between STP and surface runoff and tile drainage P concentrations across a network of 39 crop fields in the NW quadrant of Ohio underlain with artificial subsurface tile drainage systems. Field locations and management were previously described in detail [32,33]. Soils were medium to fine textured, with drainage classification of somewhat poorly drained to very poorly drained. Fields were nearly level to gently sloping (average slope range 0.4-5.1%; mean 1.6%), and were generally representative of regional topography, soils, and management practices (i.e., nutrient management, tillage, and subsurface drainage). General soil characteristics (textural class, slope, pH, soil organic matter) of the fields in this study are presented Soil Syst. 2020, 4, 67 3 of 16 in Table S1. Field management was performed by farmers and followed common practices in the region. Soybean (Glycine max, (L.) Merr)-corn (Zea mays, L) rotations were the primary cropping system, but winter wheat (Triticum aestivum, L.), alfalfa (Medicago sativa, L.), and winter cover crops (various species) were also included in a subset of fields. Tillage practices ranged from multiple tillage passes each year to long-term no tillage. Soil fertility management typically consisted of springtime N fertilizer applications to corn and wheat and fall broadcast applications of P and K fertilizers once per crop rotation. Starter fertilizers containing N, P, and K were commonly applied at crop planting. Additionally, several fields received manure applications within 2 years of this study.

Runoff Phosphorus Concentrations
Surface and subsurface water quality was monitored from outlets at the edge of fields; sampling approach and instrumentation were described in depth by Williams et al. 2016 [32]. Contributing drainage areas for surface and subsurface flow were between 1.1-18.5 ha (mean = 8.1 ± 4.6 ha). Water quality data were collected via automated measurement of flow rate at 10-min intervals and automated collection of water samples using both time-and flow-weighted basis. Tile drain flow was monitored with compound weirs (Thel-Mar, Brevard NC, USA) and bubbler modules (Teledyne ISCO, Lincoln NE, USA), while surface runoff was measured with 0.6 m H-flumes (Tracom, Alpharetta GA, USA) and bubbler modules. Water samples were collected with Teledyne Isco (Lincoln, NE, USA) automated samplers. Event-based water samples were automatically collected on a flow-weighted basis from surface flumes for the entire period of record. For tile drainage and prior to 2015, samples were automatically collected on a time interval approach (aliquots collected every 6 h and composited for a 24 h period). Starting in 2015, the time interval samples were supplemented with additional event-based samples collected over the rise and fall of the hydrograph to better represent the discharge events. Event sampling was triggered by increased discharge, with a 200 mL aliquot taken for each 1 mm of volumetric depth. Ten aliquots were combined into a single 2 L sample. Event sampling ceased when flow ceased (surface runoff) or when flow declined to baseline levels (tile drainage). Water samples were retrieved from the field at least weekly and were stored at 4 • C until laboratory analysis.
Water samples were analyzed for both DRP and TP concentrations. Briefly, samples were split into two aliquots, with one filtered at 0.45 um for DRP analysis, and an unfiltered sample used for TP analysis. The unfiltered sample underwent alkaline-persulfate digestion prior to TP analysis [34]. The filtered (DRP) and digested (TP) samples were analyzed for orthophosphate concentration using a flow injection analyzer (Lachat Instruments, Loveland, CO, USA) via the ascorbic acid reduction method. The resulting discrete P concentration data were converted into 10 min P concentrations by linear interpolation. The 10 min constructed P concentration values were then multiplied by the corresponding 10 min flow values to calculate 10 min P loads [35]. Resulting load data were summed into daily cumulative P loads. Daily estimates of flow and P load were summed into total flow and P load for the relevant period of each soil sampling event (described below). The total P load over the period of a given soil sampling event was then divided by total flow from that period to calculate the flow-weighted mean DRP (FWM DRP) and TP concentrations (FWM TP) associated with the soil sampling event.

Soil Test Phosphorus
Stratified soil samples were collected from contributing field areas of each monitored outlet between December 2014 and December 2017. The frequency of soil sampling events within a given field depended on crop rotation, establishment of EoF water quality monitoring instrumentation, and resource availability; 14 EoF sites were sampled on three occasions, 24 sites were sampled twice, and two were sampled once. The surface runoff dataset included a total of 52 individual soil sampling events, while 86 soil sampling events were included in the tile drainage dataset. Each soil sampling event consisted of taking three to nine samples at discrete locations within a given field. Sampling locations were selected based on USDA-NRCS soil maps and local topography to ensure Soil Syst. 2020, 4, 67 4 of 16 the sampling captured the variability in soils across the areas contributing to discharge at the field outlets. On average, one sample was collected for each 1.5 ha of contributing field area. Individual soil sampling locations were somewhat consistent from year to year, but limited precision of GPS coordinates meant subsequent samples were likely >10 m apart. Samples collected from 2014-2016 were taken with hand-held push probes (2 cm diameter) and the 2017 samples were collected using a hydraulic soil probe (5 cm diameter). At each location, five individual cores, distributed within~2 m of a central point, were collected, split into 0-5 cm and 5-20 cm depths, and combined into one sample. Soil samples were air dried, ground, and analyzed for STP with Mehlich-3 extractant by the Ohio State University Service Testing and Research laboratory (Wooster, OH, USA). A simple average of the discrete STP data was used to estimate the field average STP concentration values, and within field variation in STP was characterized with the coefficient of variation (CV). The field average STP data were used to calculate the P strat for each field according to Equation (1): Management information was used to assign a range of dates for each soil sampling event for which the STP measurements were considered most relevant for predicting EoF runoff P concentrations. The period of relevance for a given soil sampling event could extend up to one year before and after the sampling date, which would correspond to soil sampling occurring once in a 2-year crop rotation. This period was truncated to less than 2 years if the field was subjected to either a tillage operation or a P fertilizer or manure application as these operations were expected to alter the STP concentrations and P strat . Additionally, if a P fertilizer or manure application occurred prior to the soil sampling event, the first 2 weeks following the application were also excluded from date range to restrict the influence of short-lived direct P fertilizer losses [4]. Thus, no P applications or tillage operations occurred during the period of relevance for the soil sampling events. Resulting lengths of the date ranges were from 159 to 673 days, with an average length of 384 days.

Statistical Analysis
Relationships between STP and FWM P concentrations were examined with ordinary least squares regressions. Prior to regression analysis, the FWM DRP and TP concentrations were natural log transformed to comply with normality assumptions. The goodness of fit of the resulting regressions was assessed using the coefficient of determination (R 2 ) and root mean square error (RMSE; the standard deviation of the model residuals). The field average STP and the maximum STP value in each field were both tested as predictors of FWM DRP and TP, and regression model fits were compared. Residuals from the field average agronomic STP-FWM P concentration regressions were extracted and subsequently correlated with P strat , with Pearson's r used to assess correlation strength. Multiple linear regressions were constructed in a stepwise manner from two predictor variables: agronomic STP and P strat . These predictor variables were checked for multicollinearity. The improvement in model fit provided by the addition of P strat to the agronomic STP model was further examined by extracting residuals from the simple regressions of agronomic STP vs. FWM-P concentration, as well as the multiple linear regression. For a given observation, the magnitude of the two model residuals were compared and the difference between residuals was analyzed for simple linear relationships to soil textural class, field average slope, agronomic STP, and average daily discharge.
In fields with more than one soil sampling event, the influence of management practices on changes in STP and P strat within the fields was examined. The effects of management factors on changes in STP and P strat (e.g., STP in soil sampling period 1-STP in soil sampling period 2) were tested with t-tests for three class variables (+/− P fertilizer, +/− manure, +/− tillage) and with ordinary least squares regression for the rate of P applied (only for fields that received manure or P fertilizer). Analyses were conducted in SAS v. 9.4 (SAS Institute, Cary NC, USA) and Sigmaplot (Systat Software, San Jose, CA, USA).

Soil Test P
Shallow depth (0-5 cm) STP averaged 61 mg P kg −1 across the fields and agronomic depth (0-20 cm) STP averaged 40 mg P kg −1 for the tile drainage dataset (Table 1). Soil test P of individual fields ranged from 19-202 and 12-150 mg P kg −1 for the shallow and agronomic depths, respectively. Variability of STP within fields was high, with an average CV of 32-39% across all the sampling events and both depths, while 14% of the sampling events surpassed a CV of 50% (data not shown). The field average P strat was 1.88, with large variation among the studied fields (range 1.18-3.35). Table 1. Soil test P (STP) concentrations and P stratification ratios (P strat ) across soil sampling events at 39 fields, and edge-of-field discharge and dissolved reactive P (DRP) and total P (TP) flow weighted mean (FWM) concentrations during the relevant sampling windows. Soil test P and P strat within individual fields showed a high degree of variability between subsequent soil sampling events ( Table 2). The agronomic STP in a given field increased or decreased between subsequent soil sampling events by >15 mg P kg −1 in 12% of cases (average STP change: 9.4 ± 1.1 mg P kg −1 ). Shallow STP changed by >15 mg P kg −1 in 23% of cases (average STP change: 13.6 ± 2.1 mg P kg −1 ). Averaged across all fields and sampling events, the absolute change in STP was an increase of 1.8 ± 1.7 and 4.3 ± 3.1 mg P kg −1 for agronomic and shallow depths, respectively. Furthermore, in many fields the within field variability in STP demonstrated large changes between subsequent soil sampling events; for example, the CV for agronomic STP differed by >20% in 15 of the 48 soil sampling event comparisons (average CV change: 15.1 ± 2.1% for agronomic depth, 13.6 ± 1.8% for shallow depth). Similarly, the P strat often demonstrated significant within field changes, with 21% of cases changing by >0.50 between soil sampling events (average change: 0.43 ± 0.07). Management practices occurring between soil sampling events generally did not consistently explain the observed changes in STP or P strat (Table S2). Manure application between soil sampling events was the only management factor that significantly influenced STP changes, where shallow and agronomic STP were 13.3 and 10.6 mg kg −1 greater, respectively, in fields with manure compared to those without (t test; t-statistics −2.3 and −3.4, P = 0.029 and P = 0.0014 for shallow and agronomic depths, respectively). However, neither chemical P fertilizer application (t test; t-statistics 0.41 and 1.1, P = 0.68 and P = 0.26 for shallow and agronomic depths, respectively) nor the amount of P applied between soil sampling events (regression; t-statistics 1.04 and 1.04, P = 0.3 and P = 0.3 for shallow and agronomic depths, respectively) influenced changes in STP. Similarly, changes in STP were not different in fields that were tilled between sampling events compared to fields that did not undergo a tillage operation (t test, t-statistic −0.35 and 0.01, P = 0.7 and P = 0.9 for shallow and agronomic depths, respectively). Additionally, changes in P strat were not related to form or amount of P applied or tillage (t-tests and regression, p-values > 0.52 for all).

Surface Runoff and Tile Drainage Phosphorus Concentrations
The average FWM DRP concentration in surface runoff across all soil sampling windows was 0.19 ± 0.02 mg DRP L −1 , and the FWM TP concentration averaged 0.65 ± 0.05 mg TP L −1 (Table 1). Phosphorus concentrations were lower in tile drainage and averaged 0.066 ± 0.008 mg DRP L −1 and 0.28 ± 0.02 mg TP L −1 . There was a high degree of variability in FWM P concentrations between the sampling windows, for example DRP concentrations ranged from 0.02-0.66 mg DRP L −1 for surface runoff and 0.01-0.27 mg DRP L −1 for tile drainage.

Relationships between STP and FWM P Concentrations
Positive relationships were observed between STP in both sampling depths and the FWM DRP concentrations in both surface runoff and tile drainage (Figure 1). For surface runoff DRP, similar regression slopes (0.015 vs. 0.016) were observed from relationships with shallow vs. agronomic STP, indicating that increases in agronomic or shallow STP resulted in similar increases in FWM DRP concentration (Table 3) Table 2). Comparing R 2 and RMSE values between surface and tile DRP models demonstrated that STP, measured from shallow and agronomic depths, was a stronger predictor of tile drainage DRP than surface DRP.
Flow-weighted mean TP concentrations in surface runoff and tile drainage was also positively related to STP in both sampling depths ( Figure 2). However, the R 2 of the regression models indicated that less variation in TP concentrations was explained by shallow and agronomic STP compared to DRP ( Table 3). As observed with DRP, the slopes of both the surface runoff and tile drainage TP regressions were similar for agronomic and shallow depth samples. More variation in surface runoff TP was explained with shallow STP compared to agronomic STP (R 2 0.26 and 0.21, respectively), but prediction accuracy similar (RMSE 0.44 and 0.45, respectively). The tile drainage TP showed similar patterns as surface runoff, with greater variation explained by shallow STP compared to agronomic STP (R 2 0.19 and 0.11) and similar prediction accuracy (RMSE of 0.50 and 0.53). In general, differences between the predictive power of agronomic and shallow depth samples were lesser for TP than those observed for DRP. similar for relationships using both soil sampling depths (0.014 vs. 0.016). Greater variation in tile drainage DRP was explained with shallow STP (R 2 of 0.44 vs. 0.32) and predictions likewise improved (0.56 vs. 0.62) ( Table 2). Comparing R 2 and RMSE values between surface and tile DRP models demonstrated that STP, measured from shallow and agronomic depths, was a stronger predictor of tile drainage DRP than surface DRP.  Additionally, the maximum agronomic and shallow STP values in each field were also tested as a predictors of EoF P concentrations (Table S3). Maximum STP at both sampling depths was positively and significantly related to DRP and TP concentrations in surface runoff and tile drainage (regressions; R 2 range 0.07 to 0.38, p < 0.05 for all). However, the field average shallow and agronomic STP model explained more variation and improved predictions of EoF P concentration data in all cases.
Residuals of the regression of DRP concentration against agronomic STP were positively correlated with P strat for both surface runoff (r = 0.34, p = 0.01) and tile drainage (r = 0.36, p < 0.001; Figure 3). A significant positive correlation between residual TP and P strat was also observed in tile drainage (r = 0.28, p < 0.01) but not surface runoff (Figure 4). The positive correlations indicate that fields with greater P strat tended to also have more positive residuals, i.e., runoff P concentrations were prone to Additionally, the maximum agronomic and shallow STP values in each field were also tested as a predictors of EoF P concentrations (Table S3). Maximum STP at both sampling depths was positively and significantly related to DRP and TP concentrations in surface runoff and tile drainage (regressions; R 2 range 0.07 to 0.38, p < 0.05 for all). However, the field average shallow and agronomic STP model explained more variation and improved predictions of EoF P concentration data in all cases.
Residuals of the regression of DRP concentration against agronomic STP were positively correlated with Pstrat for both surface runoff (r = 0.34, p = 0.01) and tile drainage (r = 0.36, p < 0.001; Figure 3). A significant positive correlation between residual TP and Pstrat was also observed in tile drainage (r = 0.28, p < 0.01) but not surface runoff ( Figure 4). The positive correlations indicate that fields with greater Pstrat tended to also have more positive residuals, i.e., runoff P concentrations were prone to underprediction by the agronomic STP. Likewise, runoff P concentrations in fields with lesser Pstrat were prone to over-prediction by agronomic STP.    Phosphorus stratification ratio was a significant factor when added to the regressions of agronomic STP vs. EoF P concentrations for surface DRP, tile DRP, and tile TP concentrations (Table  4). Interactions between STP and Pstrat were not significant and so were not included in the final model. However, Pstrat was not significant when added to the surface TP model. Model fit and explanatory power of the two factor models was similar or slightly better to that of the single factor shallow sample models (Table 3). In addition, comparison of residuals from the agronomic STP models to residuals from the agronomic STP + Pstrat models showed that the improvement in fit provided by addition of Pstrat was not related to soil texture class, field slope, STP, or average daily discharge for either DRP or TP (ANOVA and regressions; p > 0.05 for all; data not shown).   Phosphorus stratification ratio was a significant factor when added to the regressions of agronomic STP vs. EoF P concentrations for surface DRP, tile DRP, and tile TP concentrations (Table 4). Interactions between STP and P strat were not significant and so were not included in the final model. However, P strat was not significant when added to the surface TP model. Model fit and explanatory power of the two factor models was similar or slightly better to that of the single factor shallow sample models (Table 3). In addition, comparison of residuals from the agronomic STP models to residuals from the agronomic STP + P strat models showed that the improvement in fit provided by addition of P strat was not related to soil texture class, field slope, STP, or average daily discharge for either DRP or TP (ANOVA and regressions; p > 0.05 for all; data not shown). Table 4. Multiple linear regression results predicting edge-of-field FWM P concentrations with agronomic soil test P (STP) and P stratification ratio (P strat ). Regressions were performed on natural log transformed FWM P concentrations.

Soil Test P
Soil test P values and P strat of individual fields were highly dynamic over the 3 years of this study, highlighting the importance of frequent soil sampling. A large portion of the observed STP changes between sampling events was presumably due to inherent within-field spatial variability in STP. This study used field average STPs derived from multiple samples that were taken at a sampling intensity (average of 1 sample per 1.5 ha) similar to that of commercial agronomic soil sampling in the WLEB. Other studies have shown agronomic STP frequently varies dramatically at the scale of 10 s of meters [36][37][38], so more intensive sampling may be required to improve the accuracy of calculated field average STP. This study was not designed to quantify the relationship between sampling intensity and STP variability, so further research will be required to quantify the soil sampling intensity needed to achieve acceptable precision of field average STP values.
Changes in STP and P strat can also be caused by management activities such as tillage and P fertilizer management (e.g., P rate, fertilizer placement) [21,39], and soil sampling schemes should take into account these management activities. However, in this study we did not see a large influence of management practices on changes in STP or P strat . For example, the amount of P applied between two soil sampling events, which ranged from 0-119 kg P ha −1 (Table S2), was not associated with statistically significant changes in STP, indicating that recent (i.e., within the past 3 years) P application rate was not a primary driver of STP over that period. Relatively large additions of P fertilizer are required to substantially increase STP. For instance, annual P applications of 44 kg ha −1 increased STP by only 2.5 mg kg −1 yr −1 at a location in Iowa [40] and, in Ohio, P fertilizer rates double the crop removal rate did not substantially increase STP after 9 years at three locations [41]. It is likely that within field variability in STP was much greater than changes induced by P fertilizer applications and consequently overwhelmed observation of these changes. In contrast, manure application was associated with increases in STP in subsequent soil sampling events, likely due in part to the relatively large amounts of P (average 44 kg P ha −1 ) added to the manured fields. Additionally, compared to chemical P fertilizer, manure may maintain a greater proportion of P in labile forms that are extracted by the Mehlich-3 extractant over a period of at least several months after application [42]. Manure application has been previously identified as a major factor driving P losses in runoff and tile drainage [43,44]. Interestingly, an earlier analysis of EoF water quality from the same fields used this study found that P losses were significantly greater in fields that received manure applications [45]. These results highlight the importance of manure management for addressing agricultural P losses, and affirm that the connections between manure application, STP, and P losses should be a priority for future research.
Tillage operations prior to a soil sampling event was not related to changes in STP or P strat . However, it is important to note that the tillage operations used were either vertical, non-inversion tillage or shallow tillage. Such conservation tillage operations limit the mixing of surface and deeper soil horizons, and thus typically maintain significant soil stratification [46,47]. A previous study suggested that one-time inversion tillage that fully eliminated P stratification could greatly reduce EoF P losses [18]. Our results provide evidence that non-inversion tillage practices will not substantially mitigate P stratification, but more intensive inversion tillage will be required to reduce the level of P stratification.
The limited influence of management on STP changes in this study is likely due in part to the soil sampling intensity and frequency. More intensive or frequent sampling may have strengthened our ability to identify management influences, but this study used a soil sampling intensity similar to commercial farms in the region so these results may be a reasonable approximation for the strength of individual fertilization and non-inversion tillage effects on STP that could be expected from commercially collected STP data across individual crop fields.

Relationships between STP and FWM P Concentrations
Stratified soil sampling provided improved prediction of environmental P losses compared to typical agronomic soil sampling. Shallow sample STP accounted for more variability in both surface runoff and tile drainage FWM P concentrations (DRP and TP) compared to agronomic sample STP. Additionally, the best model fits for surface runoff DRP and tile drainage DRP and TP were obtained with regressions using agronomic sample STP combined with P strat . Thus, widespread testing of stratified soil samples could be used to improve identification of fields in the WLEB with increased risk of environmental P losses due to high P stratification. Agronomic soil testing is currently widely employed by producers in the WLEB [48], so stratified soil sample collection and analysis could be readily integrated into existing soil testing efforts. However, producer incentives may be required to encourage widespread implementation since stratified soil sampling increases soil testing costs and provides little direct benefit to producers.
This study provided new evidence that stratified sampling was useful for improving predictions of P loss risk for soil types and geography common in the WLEB, but previous research from small experimental plots has provided mixed evidence that stratified soil sampling can improve relationships between environmental P losses and STP. For example, a rainfall simulation experiment on 4 soils in Texas showed that 0-5 cm soil samples produced stronger relationships between surface runoff DRP and STP compared to 0-15 cm samples for two soil types, but on two other soils the 0-15 cm samples produced the stronger relationships [26]. In Manitoba, surface runoff DRP from snowmelt was better predicted by 0-5 cm STP than 0-15 cm STP ( [49] Wilson et al., 2019). Conversely, a rainfall simulation study in Wisconsin showed that shallow samples (0-2 cm) did not provide consistent improvements in relationships between surface runoff DRP and STP, relative to 0-15 cm samples [15]. Similarly in pasture soils shallow (0-2 cm) samples were not better predictors of surface runoff DRP than deeper (0-10 cm) samples [50]. Finally, sandy pasture soils under simulated rainfall showed no change in the relationship between surface runoff P concentrations and STP between several sample depths (0-2, 0-5, or 0-10 cm), but the soils in that study had relatively little P stratification [51]. The depth of agronomic sampling in this study (20 cm) likely caused greater distinction between sampling depths than studies with shallower sampling depths, enabling differentiation between these soil depths effects on EoF P concentrations. Additionally, removal of EoF observations immediately following P applications may have reduced variability in EoF P concentrations and enhanced our ability to identify differences in the relationships between STP and EoF P concentrations. Furthermore, the finely textured soils that dominate NW Ohio may favor the enhanced importance of surface soil layers to EoF water quality, as the zone of interaction with water is likely more limited than in coarsely textured soils with greater infiltration rates. Finally, this study encompassed a relatively large number of fields with a wide range of STP and P strat , which provided a sufficient range of conditions from which we were able to observe significant relationships between STP and EoF P concentrations. The predictive benefit of adding P strat to the agronomic STP models was not related to differences in soil texture, slope, or hydrologic conditions within this study, suggesting that soil P stratification was important across the full range of conditions in the studied fields. However, it should be noted that the patterns observed here may not extend to regions with differing climates, management regimes, and soil characteristics.
Concentrations of DRP were more closely related to STP and P stratification than TP concentrations, indicating that STP is relatively more important for understanding DRP losses compared to TP losses. Agronomic soil tests use extractants, including the Mehlich-3 extractant used in this study, that aim to indicate the plant availability of orthophosphate over the period of a growing season, and thus can be expected to relate relatively closely to DRP concentrations in runoff [51,52]. In contrast, TP in runoff typically includes a significant portion of sediment-bound particulate P that is not measured by soil tests. Surface runoff TP loss has been shown to be closely related to sediment loss which is controlled by multiple factors in addition to STP, such as soil erodibility, ground cover, and conservation practices [15,53]. Incorporating these factors into runoff P concentration predictions (in addition to STP) could further improve model predictive power, particularly for TP loss.
The relationships between STP and P concentrations in runoff reported in this work were based on EoF monitoring, and were unsurprisingly weaker than STP-runoff P concentration relationships previously reported from more controlled rainfall simulator studies (e.g., [15,26]). In this study, the variability in runoff P concentrations that was unexplained by STP is likely in large part a result of the EoF data collection effort occurring over the broad range of environmental conditions and management practices included in the USDA EoF network. Management and environmental factors are known to influence DRP and TP concentrations in runoff, and recent research has identified crop rotation [54], soil texture [55,56], tillage [57,58], and precipitation characteristics [59][60][61] as important factors that can influence surface runoff or tile drainage P concentrations. Regardless, this study demonstrated that the influence of STP on EoF P concentrations was readily observed despite the variability in management and environmental characteristics over time and space. Similarly, the benefits of stratified soil sampling were robust enough to be observed across a wide range of crop production scenarios. Thus, predictions of P losses, whether made using empirical relationships or process-based models, could be improved by enhanced efforts to gather and use information on soil P stratification.
An additional challenge of developing relationships at the field scale is the limited understanding of the variability in contributions to EoF water quality across a large field area with variable soil properties and topography. Improving the accuracy of predictions of environmental P losses from heterogeneous fields can be achieved through soil testing regimes that account for disproportionate contributions of field areas P losses, particularly through surface runoff [62,63]. Furthermore, in soils where macropores play a role in drainage, subsurface tile drains have been shown to have direct connection with surface soils above the drains [64,65]. Additionally, small "hotspots" of high STP within fields could potentially play a disproportionate role in determining EoF P concentrations [38,66]. In this study, the maximum observed STP did not provide better predictions of EoF P concentrations than the field average STP for either surface runoff or tile discharge, but a more intensive soil sampling regime may be necessary to effectively characterize P hotspots. While this study did not take into account any spatial variation in contributions to EoF water quality across the field areas, future research should investigate the feasibility of designing targeted environmental soil testing to account for spatial differences in contributions to EoF water quality in tile drained landscapes.
Soil P and runoff P concentrations were measured in a relatively large number of crop fields (39) in this study, yet how closely these fields represent the broader landscape of the WLEB remains an important question. A recent study of STP stratification in croplands in the WLEB presented results of over 140,000 soil samples [18]. The region-wide average STP presented in that study was similar to the average of the fields studied in this research; the agronomic (0-20 cm) STP averaged 48.1 mg kg −1 compared to 44 mg kg −1 in this study. Furthermore, in Baker et al., (2017) [18] 71% of agronomic STPs were <47 mg P kg −1 , i.e., the state extension recommended range for "build-up" and "maintenance" of STP for corn, whereas in this study 69% of field average STPs were also below that threshold. Finally, on average surface (0-5 cm) STP was 68% higher compared to the 5-20 cm depth, resulting in an average P strat of 1.68 in Baker et al. (2017) [18], which was somewhat less than in this study (1.88). The greater P stratification in this study may be due to the high prevalence of no-till and non-inversion tillage in the fields included in the USDA-ARS EoF network [9]. However, the relatively close agreement between our study and the findings of Baker et al. (2017) [18] indicates that the relationships observed in this study should be expected to hold true across the broader WLEB.

Conclusions
Robust relationships between agronomic STP and P concentrations were observed across 39 production crop fields in Ohio. Phosphorus stratification varied widely across the fields, and P concentrations in both tile discharge and surface runoff were found to be related more closely to STP of shallow samples (0-5 cm) compared to the agronomic samples (0-20 cm). In fields with greater P strat , predicted EoF P concentrations using the agronomic sample STP resulted in systematic underestimation of tile discharge DRP and TP concentrations and surface runoff DRP concentration. The improvement in model predictive power from using shallow sample STP rather than agronomic sample STP was greater for DRP compared to TP. Additionally, both STP and P strat varied significantly within fields and were dynamic over time, highlighting the need for frequent and intensive soil sampling to accurately estimate the P status and risk of environmental P loss of fields. Overall, our results suggest stratified soil sampling can be a readily implemented method to improve understanding of the risk of environmental P losses in the WLEB.