Implementing an Operational Framework to Develop a Streamflow Duration Assessment Method: A Case Study from the Arid West United States

Streamflow duration information underpins many management decisions. However, hydrologic data are rarely available where needed. Rapid streamflow duration assessment methods (SDAMs) classify reaches based on indicators that are measured in a single brief visit. We evaluated a proposed framework for developing SDAMs to develop an SDAM for the Arid West United States that can classify reaches as perennial, intermittent, or ephemeral. We identified 41 candidate biological, geomorphological, and hydrological indicators of streamflow duration in a literature review, evaluated them for a number of desirable criteria (e.g., defensibility and consistency), and measured 21 of them at 89 reaches with known flow durations. We selected metrics for the SDAM based on their ability to discriminate among flow duration classes in analyses of variance, as well as their importance in a random forest model to predict streamflow duration. This approach resulted in a “beta” SDAM that uses five biological indicators. It could discriminate between ephemeral and non-ephemeral reaches with 81% accuracy, but only 56% accuracy when distinguishing 3 classes. A final method will be developed following expanded data collection. This Arid West study demonstrates the effectiveness of our approach and paves the way for more efficient development of scientifically informed SDAMs.


Introduction
Streamflow duration drives biodiversity patterns and ecosystem functions in stream reaches and their adjacent riparian zones, and knowledge of a reach's streamflow duration may be used to inform a wide range of management decisions, such as determining applicable water quality standards or setting goals for restoration efforts [1]. However, accurate characterization of streamflow duration requires long-term hydrologic data, which is typically only available at a relatively small number of gaged reaches [2,3]. Hydrologic models have been developed to predict streamflow duration (e.g., [4][5][6][7][8]). However, hydrologic models typically lack the ability to distinguish between ephemeral and intermittent reaches (due to the scarcity of gaged ephemeral reaches), and models based on gages with undisturbed catchments may not apply to reaches with altered hydrology [1]. Even without these limitations, managers would still need empirical methods to validate predictions from hydrologic models at ungagged reaches. Consequently, resource managers and researchers need rapid methods to assess streamflow duration at the reach scale where long-term data are unavailable; streamflow duration assessment methods (SDAMs) meet this need.
The term "streamflow duration" describes the extent to which a stream reach has continuous surface flow over time, typically on an annual time-scale. Perennial stream reaches have uninterrupted surface flow throughout the year, except in years with extreme drought conditions; non-perennial streams cease to flow for some period of time [1,9]. The period that non-perennial streams cease to flow varies greatly depending on climate, geology and land cover; classification of streamflow duration can also depend on the spatial and temporal scale being evaluated [10]. Stream drying periods exist along a continuum, and it remains unknown how diverse or predictable non-perennial flow regimes are from one another within and among geographic regions [8]. Nevertheless, distinct flow classes are often used to describe variability in non-perennial flow regimes based on flow metrics and watershed characteristics. For example, flow metrics (e.g., mean and variability in zero-flow, low flows, and high flows) were used to classify flow regimes of stream reaches into seven non-perennial archetypes in the Arid West U.S. [8] and three non-perennial archetypes in the Ozark-Ouachita highlands [11]. Resource managers and regulators typically require clearly defined flow classifications that can be determined based on widely available and easily interpretable data or field indicators. For example, the U.S. Texas Commission on Environmental Quality simply distinguishes between intermittent streams with and without perennial pools [12].
Streamflow duration is a primary component of stream hydrological regimes, along with the magnitude, frequency, rate of change, and timing of flow or drying events. It is a fundamental driver of stream ecosystem dynamics, life-history strategies, and diversity patterns (e.g., [13][14][15][16][17][18][19]). Streamflow duration affects fluvial processes, such as the delivery of nutrients, energy and other materials, and flux with terrestrial systems [20]. Stream networks in arid regions are characterized by a high degree of spatial and temporal heterogeneity [21], and transitions between flowing, pooled, and dry phases lead to greater temporal variation in in-stream environmental conditions in non-perennial streams compared to perennial streams [22]. Correspondingly, the biodiversity and ecosystem functions supported by different reaches within a watershed vary, depending on their duration of flow [19,[22][23][24].
There are multiple implications of streamflow duration for water resource management. Whether a stream is perennial or non-perennial informs implementation of state and local mandates and ordinances, such as riparian buffer requirements. Knowledge of stream-flow class improves predictability for ecological assessment of streams to set appropriate water quality expectations, predict the provision of ecosystem services, and may inform the prioritization of restoration and protection efforts. Additionally, this information could help determine whether a stream may be subject to jurisdiction under the U.S. Clean Water Act, which encompasses several regulatory and non-regulatory programs affecting the management of water resources. are to document the development of the beta SDAM for the AW (SDAM AW), evaluate its performance compared to other SDAMs used in portions of the region, and identify the successes and challenges that we experienced when following the approach of Fritz and others [1].

Materials and Methods
We followed the process outlined by Fritz and others [1] in developing this SDAM.

Streamflow Duration Classes
Streamflow duration classifications are applied to a reach where streamflow duration information is needed. A reach is a section of stream or river along which similar hydrologic conditions exist (e.g., discharge, depth, velocity, or sediment transport dynamics) and consistent drivers of hydrology are evident (e.g., slope, substrate, geomorphology, or confinement). A channel is an area that is confined by banks and a bed and contains flowing water (continuously or not). Our definitions of streamflow duration classes followed those used by Nadeau [27]:

•
Ephemeral reaches flow only in direct response to precipitation. Water typically flows only during and/or shortly after large precipitation events, the streambed is always above the water table, and stormwater runoff is the primary water source.
• Intermittent reaches contain sustained flowing water for only part of the year, typically during the wet season, where the streambed may be below the water table or where the snowmelt from surrounding uplands provides sustained flow. The flow may vary greatly with stormwater runoff.
• Perennial reaches contain flowing water continuously during a year of normal rainfall, often with the streambed located below the water table for most of the year. Groundwater typically supplies the baseflow for perennial reaches, but the baseflow may also be supplemented by stormwater runoff or snowmelt.

Study Area
The AW encompasses over 1.4 million km 2 in the western United States, covering portions of states from California to Texas and Montana. The region is defined by a combination of variables related to climatic, landcover, vegetation, and soil conditions ( Figure 2) [28]. The AW includes deserts, grasslands, steppes, shrublands, and woodlands, and is characterized by relatively low rainfall (i.e., <15 inches per year) [28]. In the Mediterranean climate of coastal California, rainfall is restricted to mild and wet winter months; many stream reaches cease to flow in the dry summer [29]. In contrast, other portions of the AW are characterized by desert climates, where rainfall is less predictable. The Sonoran Desert is characterized by both a wet and mild winter, as well as summer monsoons, while the Great Basin, Mojave, and Chihuahuan Deserts regularly experience freezing winters. Snowmelt has minimal influence on stream hydrology in the Arid West, except in high-elevation areas [28]. Hydrologic models suggest that the vast majority of stream-miles in the region are nonperennial [30,31], although the relative extent of ephemeral and intermittent reaches is not well documented. Non-perennial streams dominate the region [21], and (in contrast to Mazor et al. Page 4 Water (Basel). Author manuscript; available in PMC 2022 November 22. other regions) often occur in the middle or lower portions of watersheds in addition to headwaters [28].
Many of the largest metropolitan areas in the USA are located within the AW, and the region continues to experience rapid population growth with expanding urbanization (e.g., [32,33]). Thus, the need for an SDAM in permitting and management programs is high in this region. Within the AW, at least two SDAMs are currently in use, applicable to only specific geographic areas: the Pacific Northwest (PNW) method [27], and the New Mexico (NM) method [34]. However, prior to the current study, the rest of the region lacked any tool to assess streamflow duration. Our effort focused on the portion of the AW outside the PNW ( Figure 2). [1] recommend establishing a technical advisory committee comprised of scientific staff from state, tribal, federal, and local agencies involved in water resources management likely to use the SDAM. At the outset of the project, we assembled a regional steering committee (RSC) consisting of technical staff at Corps Districts and USEPA Regional Offices in the AW region that manage programs where streamflow duration information is often needed (e.g., U.S. Clean Water Act programs, including permits and enforcement). RSC members were selected based on their expertise in both scientific and programmatic elements relevant to streamflow duration classification needs. The RSC served several functions in the development process, such as reviewing technical products, facilitating connections with local experts, and identifying resources such as sources of hydrologic data.

Identify Candidate Indicators-
We identified potential indicators that were supported by the scientific literature (reviewed in [35]) or used in existing SDAMs developed for portions of the AW; specifically, the New Mexico SDAM (NM method) [34], and the SDAM for the PNW (PNW method) [27]. Following input from the RSC, these candidate indicators were then screened using the criteria described by Fritz and others [1], including: • Consistency: Does the indicator consistently discriminate among flow duration classes (e.g., demonstrated in multiple studies)?
• Repeatability: Can different practitioners take similar measurements, given sufficient training and standardization?
• Defensibility: Does the indicator have a rational mechanistic relationship with flow duration, as either a response or a driver?
• Rapidness: Can the indicator be measured during a one-day reach-visit (even if subsequent lab analyses are required)?
• Objectivity: Does the indicator rely on objective (often quantitative) measures, as opposed to subjective judgments of practitioners? Mazor et al. Page 5 Water (Basel). Author manuscript; available in PMC 2022 November 22.
• Robustness: Does human activity complicate indicator measurement or interpretation (e.g., poor water quality may affect the expression of some biological indicators)?
• Practicality: Can practitioners realistically sample the indicator with typical capacity, skills, and resources?
Indicators were included in this study (Table 1) if they met all of the above criteria or were included in the NM or PNW SDAMs to facilitate comparison among the methods [35]. At a typical site, a field crew consisting of two people could measure all indicators and complete an assessment in about an hour (not including time to identify aquatic invertebrates).

Identify candidate reaches
Goals in Selecting Reaches for Method Development: We had two objectives in selecting candidate reaches for the AW region covered by this study: first, to include a sufficient number of reaches in each streamflow duration class to characterize variability in indicator measurements; second, to select reaches representing the range of key natural and disturbance gradients within the region to ensure that the method would work in all conditions where assessments were needed. To support our goal of geographic representativeness, we divided the AW into five subregional strata: one stratum each for California, Arizona, and Nevada; a stratum combining New Mexico and Texas; and a stratum comprising the remaining states (i.e., Colorado, Wyoming, Utah, and Montana; Figure 2). We aimed to select 100 publicly accessible stream reaches (one assessed location per reach) with equal representation of perennial, intermittent, and ephemeral flow duration among and within the five AW subregions.
Classifying Streamflow Duration Based on Hydrologic Data: In order to determine the accuracy of our candidate indicators and the resulting beta method, it was necessary to assign each study reach an independently determined flow classification based on direct observations of stream hydrology (e.g., flow versus no flow), independent of indirect indicators of stream flow duration observed remotely or in the field (e.g., watershed size, hydrophytic vegetation, and channel structure). To screen reaches for use in method development, we first compiled a list of candidate study reaches based on existing hydrologic data records (e.g., USGS stream gages, water presence logger, wildlife cameras, and field photos), published studies, and interviews with local experts familiar with the specific reach's hydrology. Continuous data records (e.g., daily flow loggers and wildlife cameras) were used to classify a reach as perennial if they indicated fewer than 5% zero-flow days, ephemeral if they indicated fewer than 5% flowing days, or intermittent if they indicated an intermediate number of flowing days over the period of record (which varied from a single year to many decades, depending on the data source). These criteria have been used in previous studies (e.g., [40]), and they serve to reduce the influence of extreme climatic events or rare instrumentation failure that could otherwise modify a classification [3]. Discontinuous data (e.g., field notes, field photos) were used to confirm reach flow duration if direct observations of stream hydrology indicated flowing and dry conditions at appropriate times (e.g., seasonal wet or dry periods), or to supplement Mazor et al. Page 6 Water (Basel). Author manuscript; available in PMC 2022 November 22. flow classifications based on multiple data sources. Details about reach classification and correction of misclassified reaches are presented in Supplementary File S1a.
Classified reaches were prioritized for study inclusion based on the number and type of data sources available to determine actual streamflow duration classification. Reaches where flow duration could be determined based on multiple data sources (e.g., water presence loggers and expert knowledge) were categorized as "preferred" for study inclusion. Reaches where flow classes were determined based solely on interpretation of USGS stream gage data without consultation of a local expert were categorized as "USGS gage" reaches. Reaches identified through local expertise alone were categorized as "acceptable" and included in this study where necessary to fill data gaps in study subregions where a sufficient number of "preferred" and "USGS gage" reaches with an intermittent or ephemeral flow classification could not be identified.

Selecting Reaches for Inclusion in This Study
: Once a list of classified candidate study reaches of known streamflow duration was generated as previously described, we first identified ephemeral reaches (that is, the most limited classification among our candidates due to the paucity of hydrologic data on ephemeral reaches) in each of the five subregions. We then added intermittent and perennial reaches in close proximity to create "clusters" of multiple reaches following the design of Nadeau and others [25] to maximize the number of study reaches that could be sampled by limiting travel time and access issues. Whenever possible, "preferred" reaches were selected before "acceptable" reaches. Backup reaches were identified for every cluster in case a reach was inaccessible. All selected study reaches were on publicly accessible property and within a 20 min walk of an access point. Most of these reaches were visited only once (typically in summer); ten percent of reaches were targeted for an additional visit under different seasons to provide information about temporal variability of the indicators and consistency of the final method. This process resulted in a list of 100 target coordinates representing the downstream end of potential study reaches.

Focus-Area
Studies-In addition to the reaches described above, we included additional reaches of interest to local water resource managers in two watersheds: the Santa Margarita River in California and the Hassayampa River in Arizona. Reaches in these watersheds served as focus-area studies. Each focus-area study was led by practitioners with different backgrounds, but each frequently need streamflow duration information as part of their job duties. Thus, they could provide the method development team with early independent feedback of how well the method is likely to suit their needs. Each practitioner was provided with a day of training in the same protocols described below, after which they collected data from each reach during multiple repeated visits throughout the year. Apart from the greater frequency of sampling, data collection procedures at focus-area reaches were identical to procedures conducted at other reaches in this study.
Reaches within each focus area were located along a longitudinal gradient from headwaters to mainstems, without prior knowledge of flow duration. Stream Temperature, Intermittence and Conductivity loggers (STIC loggers, [41]) were installed at each focus-area study reach to enable their eventual classification. The goal in selecting these reaches was to enable study leads to test methods in a real-world application and provide feedback on challenges that became evident in their experience. Additionally, data generated from these focus-area study reaches were also included in method development at reaches where the true streamflow duration could be determined following the same approach used to determine true streamflow duration classes at other study reaches.

Data Collection
To guide data collection, we developed a protocol that described measurement of indicators identified in the literature review, or were part of SDAMs used within the region. Specifically, we included "Level 1" indicators of the NM method [34], and all indicators of the PNW method [27]. Indicators are summarized in Table 1, and the complete protocol provided in Supplementary File S1. A study reach was established by first approaching the target coordinates and measuring bankfull width at three locations (at the target location, 15 m upstream, and 30 m upstream). The total study reach length was then defined as 40 times the average width, but no longer than 200 m and no shorter than 40 m. If necessary, the study reach boundaries were adjusted to exclude major tributaries, improve access, or to maintain consistency with channel features that could affect streamflow duration (e.g., valley confinement, streambed substrate, proximity to a culvert or road crossing). Details about quality assurance are provided in the Quality Assurance Project Plan (Supplementary File S1).
Measured indicators were grouped into several types, briefly described as follows:

Geomorphic Indicators-Valley slope was measured with a handheld
clinometer, and bankfull width was measured with measuring tape or a stadia rod. Several indicators were measured based on visual estimation following the scoring guidance in the NM method, including sinuosity, floodplain and channel dimensions (i.e., the entrenchment ratio), in-channel structure/riffle-pool sequence, the extent of deposition sediment on plants and debris on the floodplain, and the extent of substrate sorting.

Hydrologic Indicators-
The extent of surface and subsurface flow, as well as the number of isolated pools was visually estimated following the PNW method [27]. The extent of water in the channel was scored following the guidance in the NM method [34]. The presence of seeps or springs within one-half channel width of the channel was noted. The presence of hydric soils was evaluated by digging in the top 6 inches of substrate at the toe of the banks in up to 3 locations. In channels without surface water, soil moisture and texture were measured at three locations. The number of woody jams within the reach or up to 10 m outside the reach was noted. For our purposes, a woody jam must contain at least 3 large pieces of wood (>1 m long and >10 cm in diameter), span the entire width of the channel, and be in contact with the streambed such that it would disrupt the movement of water or sediment during flowing conditions.
For hydrologic indicators, we distinguished between those reflecting direct measures of water presence (e.g., percent of the reach with surface flow and soil moisture) from indirect measures (e.g., hydric soils and number of woody jams). Doing so allowed us to compare models with or without these types of indicators.

Biological Indicators-
The abundance of selected biological indicators was scored following the guidance in the NM method: fish, amphibians, aquatic invertebrates (referred to as benthic macroinvertebrates in the NM method [34]), and filamentous algae. Other indicators derived from the NM method requiring subjective scoring included differences in vegetation between the riparian corridor and adjacent uplands, and the absence of upland rooted plants in the streambed. The presence of iron-oxidizing fungi or bacteria were also noted.
Aquatic invertebrates were collected for up to 15 min from at least 6 locations representing the range of microhabitats available in an assessment reach (e.g., riffles, pools, undercut banks). In dry streams, suitable microhabitats (e.g., remnant pools, under large cobbles, and stream margins) were searched for shells, cases, exuviae, and other evidence of aquatic invertebrates. Specimens were identified to the best practical level in the field (generally family), and vouchers of every taxon encountered were retained and sent to a lab to verify identifications. Up to 10 individuals per morphotaxon were counted and recorded. The presence of taxa designated as indicators of perennial flow by Nadeau [27] were noted.
Hydrophytes (i.e., those with Facultative-Wet [FACW] or Obligate [OBL] status in the AW Regional Wetland Plant List from the US Army Corps of Engineers by Lichvar and others [36]) growing within the channel or within a half-channel-width of the channel were noted, regardless of prevalence or dominance. Taxa not included by Lichvar and others [36], such as Populus freemontii, were not considered to be hydrophytes. Where necessary, photo vouchers or specimens were collected to verify identifications.
Observations of live fish or aquatic life stages of snakes and amphibians were noted. Non-native mosquitofish (Gambusia sp., typically G. affinis) were noted separately.
Streambed cover by live or dead algal mats, liverworts, or mosses with "streamer" morphology were also estimated.

Geospatial
Data-Geospatial data were collected in order to evaluate potential indicators and as co-variates in models. Level 2 and Level 3 Omernik Ecoregions were derived by overlaying points on shapefiles downloaded from the EPA [39]. Climate data derived from PRISM were assessed for each sampling location using the PRISM package [38]. A large number of landscape-scale metrics relating to watershed characteristics (e.g., soil type, geology) were acquired from the StreamCat dataset [37] by first determining the unique identifier of the nearest stream segment in the National Hydrography Dataset Plus (NHD Plus, [42]); however, a handful of study reaches were located on stream segments that are not represented in the NHD Plus dataset, and therefore had no data available in StreamCat.

Data Analysis
All data and code used in analyses is provided in Supplementary File S1.
2.5.1. Calculation of Metrics-Data from indicator measurements were converted into metrics that could be used in an SDAM. For example, "number of mayflies" is a metric derived from the aquatic invertebrate indicator data. Whenever possible, metrics were expressed in continuous or ordinal formats, although binary metrics (e.g., "presence of hydrophytic plants") were also considered. This process resulted in 54 metrics derived from field-collected data (7 geomorphic metrics, 7 hydrological metrics, 40 biological metrics), and 101 metrics derived from geospatial data, of which 83 were derived from StreamCat. Most biological metrics for aquatic invertebrates were expressed as richness or abundance, both relativized to sample totals and in unrelativized (raw) forms; because our collection methods may have undercounted non-insects in streams where insects dominated, metrics that focused on non-insects (e.g., Gastropoda, Oligochaeta, and Diptera [GOLD] taxa) were only evaluated in relativized forms. The full list of analyzed metrics are presented in Supplementary File S3.

Metric
Screening-As an initial data exploration step, we visualized the relationships between streamflow duration class and indicators by ordinating all 155 metrics for all samples in the dataset in a nonmetric multidimensional scaling using Gowers' distance. Convex hulls were drawn around each streamflow duration class to help visualize their distributions in ordination space. The 2-axis ordination was computed using the metaMDS function in the vegan R package [43]. Correlation coefficients (Spearman's rho) were calculated between ordination axes and metric values.
Metrics were evaluated against a number of criteria to determine their suitability for inclusion in an SDAM (Table 2). We developed criteria following approaches for screening metrics in bioassessment indices (e.g., [44]), and applied them to data from initial reachvisits (i.e., data from revisits were withheld from analysis). One criterion was a distribution statistic, calculated as percent dominance of the most common value (which was typically zero); all metrics had to meet this criterion. The remaining criteria were based on measures of responsiveness. Most of these measures were based on statistical comparisons of mean values at different subsets of reaches (e.g., t-statistic from a comparison of metric values at perennial and non-perennial reaches), as has been used in other studies [45][46][47]. Another responsiveness statistic was based on variable importance (specifically, mean decrease in accuracy) from a random forest model to predict streamflow duration class from all possible metrics; the model was calibrated using the default option from the randomForest function in the randomForest package in R [48]. Metrics had to meet at least one responsiveness criterion to be considered in further analyses.

Metric
Selection-Once a limited number of candidate metrics could be identified by the screening process, we used the recursive feature elimination (rfe) function in the R caret package [49] to select a final set of metrics for the beta SDAM based on their importance in random forest models. Briefly, rfe is a form of stepwise selection where complex (i.e., those based on many metrics) are calibrated, and simpler models are considered by calibrating new models after eliminating the least important metrics. We considered the most complex model (i.e., all candidate metrics included), eliminating 5 variables at a time in each step based on low variable importance until a 20-variable model was identified; after this point, only one variable was eliminated in each step. The best performing model (highest accuracy in predicting streamflow duration class, as measured by Cohen's Kappa) was identified, and the simplest model (i.e., the one with the fewest variables) with a Kappa statistic within 1% of the best was selected to identify the final set of metrics. If the best-performing model selected by this approach had more than 20 variables, the 20-variable model was selected.
We applied this modeling process to different subsets of the dataset, including: • The full region-wide dataset, and

•
Five separate datasets, one for each subregion shown in Figure 2. For both of these two options, the modeling process was implemented: • With or without considering geospatial metrics; and • With or without considering metrics based on direct measures of water presence.
Thus, we developed 24 models to explore eight approaches (1 full vs. 5 subregional models, with or without geospatial metrics, with or without metrics based on direct measures of water presence). Analyses were conducted on data from the initial reach visits alone. For each of the 24 models, data were split into 80% calibration and 20% validation datasets, stratified by the 5 subregions and 3 streamflow duration classes.

Model Calibration and Performance Evaluation-Models
were fit for each of the 24 options identified in the previous step. We explored two types of models: random forest models, and models based on single classification trees (following the method of Nadeau and others [25] for the PNW). A method based on classification trees are easier for inexperienced practitioners to interpret and use, but are more prone to overfitting data than random forest models, which are based on large numbers of classification trees.
Random forest models were fit using the randomForest function in the randomForest package in R [48] using default parameters, except that the number of trees was set to 1500 instead of the default 500. Classification trees were fit using the rpart function in the rpart package in R [50]. Only the initial visit for reaches in the calibration dataset were used for model fitting.
Model performance evaluation focused on two aspects: accuracy and repeatability. Accuracy was assessed by calculating the same comparisons used to evaluate metric responsiveness during the metric screening phase (e.g., ephemeral versus at least intermittent reaches and perennial versus wet intermittent reaches; Table 2). Accuracy was measured using the initial reach-visit in both the calibration and validation datasets independently. We compared validation and calibration measures to see if models validated poorly, suggesting that they may be overfit.
Repeatability was assessed using data from the 12 reaches that were revisited and was calculated as the percent of reaches where classifications from both visits were the same (regardless of whether the classification is correct). Due to the limited amount of data, repeatability was only assessed on a region-wide basis, and not within each subregion.

Selection of a Final
Model-A final model was selected based on its performance, as well as the advice we received from the RSC. We presented performance measures (i.e., accuracy and repeatability) and other characteristics of final models to the RSC to provide feedback and advice on selecting a final model. In particular, we asked them: • Is a subregionally stratified approach warranted?
• Should we include geospatial metrics in the model?
• Should we include direct measures of water presence in the model?
• Should we use a single decision tree or a random forest model?

Refinement and Creation of a Final Beta Method:
After selecting a final model, we made several revisions to facilitate its use and acceptance by the management community, based on feedback and interactions with the RSC. As explained below, the RSC ultimately recommended a random forest model over a single decision tree, and many of these refinements were conducted with these types of models in mind. Performance of the refined model was re-evaluated following these modifications.

Refinement of Indicators:
The metric selection process described above identified an optimal set of metrics to use in the SDAM, but it did so without considering difficulties in measuring each metric or effort required to measure all of the metrics. For example, rfe may have selected a metric based on the total number of aquatic invertebrates, even if there was little new information provided once 20 were observed. That is, field crews might be able to cease counting aquatic invertebrates once they found 20 individuals. Thus, we explored ways to simplify metrics in order to reduce the burden on field crews and facilitate use of the method (e.g., avoid reliance on access to statistical software). We also identified metrics that could be eliminated because they were closely related to another metric in the final method. Metrics that were more complicated to measure were rejected if a simpler to measure alternative was available, and continuous metrics were converted to binary or ordinal variables based on visual interpretation of random forest partial dependence curves. Accuracy and repeatability measures were re-evaluated to ensure that overall model performance was not substantially affected by the modifications.
Increased Confidence Required for Classifications: Random forest models, when used in classification mode, traditionally make assignments based on the class that receives the highest number of votes by each "tree" in the forest. Thus, in a 3-way decision, the "winning" class could receive much less than a majority of votes-as low as 34%. The RSC believed such low-confidence classifications may be insufficient for certain uses of the SDAM, and instead recommended exploring approaches to distinguish between high-and low-confidence classifications.
Based on this input from the RSC, we explored increasing the minimum number of votes required to make a confident classification from 50% to 100% by increments of 2.5%. When the final model was applied to a novel test reach and a single class received a sufficient percent of votes, then the reach was classified accordingly. If none met the minimum, but the combined percent of votes for intermittent and perennial classes exceeded the minimum, then the reach was classified as At least intermittent. In all other cases, the reach was classified as Need more information. This decision framework reflects the opinion of the RSC that distinguishing between ephemeral and at least intermittent reaches was a high priority use of the SDAM, more so than distinguishing between perennial and nonperennial reaches. We calculated the percent of reaches under each of the five possible classifications with increasing minimum vote agreement thresholds, and presented the results to the RSC to select a minimum threshold.

Addition of Single Indicators:
Single indicators can supersede model classifications of Ephemeral or Need more information to At least intermittent. Single indicators provide technical benefits (i.e., improved accuracy), as well as non-technical benefits (such as greater acceptance of the SDAM, given public understanding of the role of streamflow duration in supporting wildlife), which is why they are used in most other SDAMs (e.g., [27,34,51,52]). We evaluated the following potential single indicators, based on recommendations from the RSC: • Presence of live fish, • Presence of live amphibians, • Presence of any living aquatic vertebrate (fish, amphibians, or reptiles), and • Live or dead (desiccated) algal cover on the streambed ≥10%.
We evaluated the number of instances where the change would correct a misclassification (i.e., the reach was truly intermittent or perennial), and the number of times it would introduce a misclassification (i.e., the reach was truly ephemeral).

Evaluation of the Final Beta SDAM and Comparison with Other SDAMs
Used in Portions of the AW-We applied the final beta SDAM model to the dataset and calculated the same accuracy and repeatability measures described above. We investigated reaches where the beta SDAM classifications did not correspond to known streamflow duration class for that reach or resulted in classifications of Need more information.
Classifications and performance measures from the beta SDAM for the AW were compared to the PNW method [27] and the NM method [34]. Our data collection only allowed classification following the first phase of the NM method, which may result in "gray zone" classifications of tentatively intermittent and tentatively perennial. For our assessments, these results were treated as intermittent and perennial, respectively.

Application to Two Focus-Area Studies
Each reach assessed as part of the focus-area studies was classified according to the final Beta SDAM AW. These classifications and underlying indicator values were presented to the practitioners. We then asked them whether the classifications agreed with their understanding of the study area, what elements of the protocol worked well, and which presented challenges. Finally, we asked for their thoughts on whether the final method would be suitable for their programs and monitoring needs.

Identification of Candidate Indicators
We identified eight flow duration methods for temperate regions, two of which cover portions of the AW: the SDAM for the Pacific Northwest (the PNW method [27]), and the New Mexico method (the NM method [34]). Other methods that were identified covered arid regions in other parts of the world (e.g., Mediterranean Europe [53,54]), or non-arid portions of the United States (e.g., North Carolina [52], Ohio [55], Kentucky [56], Oregon [51], and temperate portions of the United States [57]; the Oregon method was a predecessor to the PNW method). From these methods, as well as the scientific literature, 12 geomorphological, 14 hydrological, and 15 biological candidate indicators were identified. Based on the initial screening, as well as precedented use in the PNW or NM methods, a subset of 6 geomorphic, 7 hydrologic, and 13 biological indicators were selected for further evaluation, as described by McCune and Mazor [35]. In addition, we identified five classes of geospatial indicators to explore in addition to these field-measured indicators ( Table 1).

Identification of Candidate Study Reaches
Our efforts yielded 725 candidate reaches, of which 13% were "preferred" (meaning that multiple sources of hydrologic data were available). Across the region, 48% of these preferred reaches were perennial, 36% were intermittent, and 16% were ephemeral. Data from USGS stream gages were available for 87% of the preferred perennial reaches, and 67% of intermittent reaches, but only 42% of ephemeral reaches. The remaining reaches were designated as "acceptable".

Data Collection
From the list of 725 candidate reaches, 100 were targeted for a sampling campaign that spanned 2018 and 2019 ( Figure 2). We initially targeted preferred reaches and reaches with USGS gages, but we included acceptable reaches in order to achieve the desired number of ephemeral and intermittent reaches in each subregion, and to allow field crews to visit multiple reaches in a single day. Overall, 36% of ephemeral reaches had "acceptable" status, in contrast with 12% of perennial reaches and 23% of intermittent reaches.
Reach-visits with incomplete data, or where crews did not sample the intended location (e.g., tributaries close to the target coordinates, but where the true streamflow duration class could not be determined) were excluded from further analysis, yielding a final dataset of 89 reaches, 12 of which were visited on two occasions (Table 3).

Metric
Screening-Ordination of all 155 biological, hydrological, and geomorphic metrics using data from the first visit of all 89 reaches showed although ephemeral and perennial reaches were distinct from each other, intermittent stream reaches were highly variable and overlapped with the other classes ( Figure 3). Ephemeral reaches were tightly clustered, indicating that they are relatively homogenous with respect to flow duration indicator metrics compared to perennial and intermittent stream reaches. Several biological and hydrological metrics were strongly correlated (i.e., rho 2 > 0.5) with an ordination axis, but no geomorphological or geospatial metrics did. All of these strongly correlated metrics had negative relationships with the first ordination axis (i.e., the axis that separated perennial from ephemeral reaches), indicating that higher values of these metrics were largely indicators of longer flow durations, and no metrics had higher values at reaches with shorter flow durations. Metrics related to vertebrates were positively correlated with axis 2, but none with a rho 2 > 0.5 (correlation coefficients are provided in Supplementary File S3). That is, there were no metrics for which high values were associated with ephemeral reaches.
The within-metric metric distribution and responsiveness screens shown in Table 2 applied to data from the first visit of all 89 reaches reduced the total number of candidate metrics from 155 to 38 biological metrics, 4 geomorphological metrics, and 5 hydrologic metrics, in addition to 53 geospatial metrics. Most metrics (95%) passed the % dominance criterion, in contrast to the more restrictive responsiveness criteria. Slightly more than half the metrics were able to discriminate among three streamflow duration classes (F > 2), and a similar number could distinguish between ephemeral and at least intermittent reaches (t > 2). However, only 23% could discriminate between ephemeral and dry intermittent reaches, and just 12% (mostly invertebrate metrics, and metrics related to Ephemeroptera, Plecoptera, and Trichoptera [EPT] taxa in particular) could discriminate between perennial and flowing intermittent reaches. Screening criteria for biological, hydrological, and geomorphological metrics that passed screens are shown in Table 4, and the full list of metrics are presented in Supplementary File S3.

Metric
Selection-A total of 28 metrics were selected by recursive feature elimination for at least one of the 24 model iterations described in the Metric Selection section of the Methods above. Fifteen biological metrics, seven geospatial metrics, five hydrological metrics, and one geomorphological metric were selected at least one time ( Figure 4). Some of the most frequently selected metrics include those related to invertebrate abundance, hydrophytic vegetation, relative abundance of EPT taxa, and algal abundance. Soil moisture (SoilMoist_MaxScore) and water in channel (waterinchannel_score) were selected every time they were eligible (i.e., in models that allowed direct measures of water presence).

Model Calibration and Performance Evaluation-Performance measures
for most models were similar, regardless of whether stratification was applied, or if geospatial metrics or direct measures of water presence were included. Accuracy in discriminating among the three flow duration classes ranged from 0.52 to 0.79 proportion of reaches correctly classified, but 0.77 to 0.88 when discriminating ephemeral from at least intermittent reaches. Repeatability ranged from a low of 0.33 (i.e., only a third of reaches had the same classification on two visits) to 0.78 (Table 5). In general, the single decision tree models showed a marked decline in some performance measures when independent validation data were evaluated (Table 5, Figure 5). There was little evidence of benefit from stratified approaches implemented at subregional scales, whether evaluated across the region (Table 5) or within individual subregions.

Selection of a Final Model-
The RSC was presented with results for a range of options for several factors in model construction: regional stratification, the inclusion of geospatial metrics, the inclusion of metrics that directly measure the presence of water, and the choice of a model based on single decision trees versus random forest. The RSC initially expressed a preference for whatever options provided the best performance. However, in the absence of clear differences, they made recommendations based on several non-technical concerns. For example, they advised against a sub-regionally stratified approach. The primary disadvantage of stratified approaches is that they introduced needless complexity. Several RSC members expressed skepticism that subregional stratification would be successful given the amount of available data.
The RSC expressed concerns about using geospatial metrics as indicators in the SDAM because indicators derived from NHD flowlines (e.g., StreamCat data) may not represent unmapped headwaters or other reaches where the SDAM is likely to be used [42], and because there was little clear benefit to including geospatial metrics, in terms of improved accuracy. Additionally, the RSC advised against including hydrologic metrics based on direct measures of water presence. Although such metrics might provide valuable supporting information in an assessment, including it in the SDAM could introduce circularity (i.e., using hydrologic data to classify actual flow duration in method development and to classify flow duration in applying the method), and reduce acceptance of the tool by certain communities, as previously described [1].
The RSC initially preferred single tree models over random forest models because the former are relatively transparent and easy to use in the field. However, they ultimately recommended a random forest model. The individual trees created in our calibration steps were relatively simple, and while this made them easy to use, they led to a number of outcomes the RSC felt were indefensible and would reduce acceptance of the method. For example, one of the trees we produced identified ephemeral streams based on the absence of aquatic invertebrates even though no signs of aquatic invertebrates were observed at 45% of dry intermittent streams. Such outcomes were likely the result of our relatively small dataset, which could only support the calibration of relatively simplistic trees. Random forest models avoid this scenario by incorporating a large number of trees, each with their own unique subset of "in-bag" calibration reaches, and reduce over-fitting. Thus, the RSC advised us to select a random forest model (contingent on ensuring such a complex method would remain accessible to practitioners), and to reconsider a single-tree approach after additional data collection.
Based on this feedback, we selected the random forest model that did not include geospatial metrics or direct measures of water presence. This model contained six metrics, all of which were biological ( Figure 6). Two of the metrics were based on riparian vegetation (i.e., the number of hydrophytic plant species, hydrophytes_present_noflag in Figure 6, and stream shading cast by riparian vegetation (i.e., PctShading). Three metrics were related to aquatic invertebrates. Two of these measures were related to abundance (i.e., TotalAbundance, which is derived from the tally of individuals collected during sampling, and bmiabund_score, a qualitative assessment of abundance derived from the NM method), and one to taxonomic composition (i.e., Richness, family-level richness). The final metric was a qualitative measure of algal abundance following the NM method.

Refinement and Creation of a Final Beta
Method-Subsequent to the model refinements described below, the RSC recommended adoption of the method, henceforth called the streamflow duration assessment method for the AW (SDAM AW), as a "beta" method to be evaluated in an interim period during which additional data collection could occur, and feedback from end-users in the region could be obtained.
The accuracy and repeatability of the refined method versus the unrefined model based on metrics selected by recursive feature elimination can be evaluated by comparing the highlighted rows in Table 5.

Refinement of Indicators:
We modified the selected metrics as described below. Our intent was to reduce the time and expertise required to measure indicators without sacrificing performance of the final method. In addition, we wanted to replace continuous metrics with ordinal categories that would be easier to measure and interpret. These two simplifications (i.e., reducing the overall number of metrics, and converting continuous metrics to ordinal metrics with few categories) would enable us to create an easy-to-use table that could provide the same function as a complex statistical model, but without the need for statistical software. Following these refinements, the performance of the final method was re-evaluated.
Riparian Vegetation: Two metrics related to riparian vegetation were included in the final model: the number of hydrophytic plant species reported in the reach, and streambed shading. We selected only the former because it was a more direct measure of the plant community, whereas measures of shading may capture sources unrelated to streamflow duration, such as canyon walls or nearby structures. Furthermore, the use of hydrophytic plants in the PNW method as well as in jurisdictional wetland delineation methods meant that its inclusion would likely have a greater degree of acceptance among end-users. We converted the continuous measurement to three categories (no hydrophytic species observed, one to two species observed, or three or more species observed) based on inspection of partial dependence plots ( Figure 7). Partial dependence plots illustrate how variation in a single predictor can affect the outcome of a model when other predictor values are held constant, and they can be used to identify important change-points in the relationship between an indicator and streamflow duration outcomes.
Aquatic Invertebrate Abundance: Similarly, two metrics related to aquatic invertebrate abundance were included in the final model: a qualitative assessment based on the effort required to observe aquatic invertebrates throughout the assessment reach, and a quantitative assessment based on a tally of collected, sorted, and identified organisms. We selected the latter because the subjective assessments were difficult to standardize and could be prone to differences in expertise among practitioners. We converted the continuous measurement to three categories (no aquatic invertebrates observed, one to nineteen individuals observed, or twenty or more individuals observed) based on visual inspection of partial dependence plots (Figure 7).

Aquatic Invertebrate Composition:
One metric in the final model was related to aquatic invertebrate composition: taxonomic richness at the family level. Although family-level identifications are used in other SDAMs (e.g., [27]), many likely practitioners lack the training to generate these data. We therefore substituted this metric with a simpler to measure metric: the presence or absence of Ephemeroptera, Plecoptera, or Trichoptera (EPT) taxa. First, this metric only requires order-level identifications. In addition, EPTrelated metrics were selected in most other models ( Figure 4). Thus, this metric was a relatively simple way to retain information about the taxonomic composition of the aquatic invertebrate assemblage that was still relevant to streamflow duration assessment.
Algal Abundance: One metric of algal abundance was included in the final model: a qualitative assessment based on the effort required to observe algae throughout the assessment reach. As with aquatic invertebrates, this qualitative assessment is difficult to standardize (i.e., it is based on the level of effort required to observe algae, rather than a quantitative estimate of algal cover). Therefore, we substituted it with a more quantitative metric based on the total streambed cover of live or dead algal mats (excluding dead mats that were clearly deposited from upstream sources). The metric was originally measured in 5 categories (i.e., not detected, <2% cover, 2 to 10% cover, 10 to 40% cover, and ≥40% cover).
We reduced this to a binary measure of presence/absence. As described below, algal cover ≥10% may be used as a single indicator, so practitioners may record this indicator in three categories (i.e., not detected, <10% cover, and ≥10% cover).

Increased Confidence Required for Classifications:
Although increasing the minimum number of votes from a random forest model required to make a classification improved overall accuracy, it did so at the expense of being able to make precise classifications ( Figure  8). For example, when 90% of the votes were required to make a classification, no reaches were classified as intermittent, whereas 28% were classified as At least intermittent. At the same time, the number of reaches where classifications could not be determined (i.e., Need more information classifications) increased to 17%. Based on these factors, the RSC recommended a minimum threshold of 50% of votes required to make a classification.

Addition of Single Indicators:
Most of the single indicators that we considered had no impact on the accuracy of the method. For example, fish were only detected at reaches that were already classified as Intermittent, Perennial, or At least intermittent.
Amphibians corrected a misclassification at one reach, but introduced a misclassification at another reach. The presence of ≥10% algal cover corrected three misclassifications without introducing additional misclassifications.
Based on these results, the RSC recommended inclusion of two single indicators: fish, because of their broad acceptance by the public as an important resource provided by perennial and intermittent streams, and algal cover ≥10% because it improved accuracy. Amphibians were not recommended because they decreased the method accuracy, although they endorsed the reporting of amphibians as supplemental information when conducting assessments. the method was based on a small number of  indicators with few categories, it was possible to generate a table to crosswalk all possible  combinations of indicator values to their outcomes in a simple table (Table 6). Thus, practitioners would be able to obtain a classification without the use of statistical software to run the underlying random forest model.

Evaluation of the Final Beta SDAM and Comparison with Other SDAMs Used in Portions of the AW-Because
Among the 89 reaches in the development dataset, the Beta SDAM AW had better success at classifying ephemeral reaches (80% correct, and 81% if revisits are counted) and perennial reaches (84% and 76%) than intermittent reaches (50% and 54%; Table 7). Two ephemeral reaches sampled under flowing conditions were both classified as Need more information, as were three of the ephemeral reaches sampled under dry conditions. Surprisingly, two dry reaches were classified as perennial. One reach was thought to be truly intermittent based on gage data (Sabino Canyon, USGS gage 0948400), but it had large permanent pools that likely contributed to the high levels of indicators observed. The other reach (a reach on Cottonwood Creek in Arizona) was thought to be truly ephemeral, and although it lacked water at the time of sampling, several hydrophytes and caddis cases were observed, as well as algal mats, damp soil, and adult amphibians. The original determination was based on a single year of data from a wildlife camera, and thus may have underestimated the extent of flows at this reach. Intermittent reaches were more likely to be correctly classified when they were flowing (64%) than when they were dry (38%). Among the 17 reaches classified as At least intermittent, about half were perennial and half were intermittent. No perennial reaches received a classification of Need more information. There was no apparent relationship between error rates and whether the original classification of a reach was "acceptable" or "preferred".
The SDAM AW's performance was comparable to those of the PNW and NM methods (Table 5, Figure 5, Supplementary File S4), and they agreed more than two-thirds of the time (Table 8). While the overall rates of agreement between the beta SDAM AW and the other methods were nearly identical, there were large differences in how intermittent or perennial reaches were classified. For example, for reaches classified by the SDAM AW as perennial, the NM method agreed 91% of the time, whereas the PNW agreed only 55%; in contrast, the NM agreed with the beta SDAM AW's classifications of intermittent only 30% of the time, whereas the PNW method agreed 65%. For reaches classified by the beta SDAM AW as "need more information", the other methods were both more likely to classify these reaches as intermittent. Based on this comparability, the RSC determined that programs that had been relying on the NM or PNW method could switch to the beta SDAM AW with little consequence for the majority of reaches they are likely to encounter.

Application to Two Focus-Area Studies
Results from the two focus-area studies are presented in Supplementary File S5.

The Beta SDAM AW Can Support a Range of Management and Monitoring Needs
Following the process described in [1], we were able to develop an effective SDAM that may be used to classify reaches in the AW where streamflow duration information is needed. The accuracy with which the SDAM AW distinguishes ephemeral reaches from at least intermittent (81%) is lower than accuracy reported for other methods (e.g., 94% in PNW [25], 96% in NM [34]) but should still be sufficient for many applications. The finding that the beta SDAM had greater success classifying ephemeral and perennial reaches than intermittent reaches corroborates previous studies evaluating indicators and assessment methods of streamflow class [16,25,59]. The RSC wanted an unbiased method that did not consistently over-or under-estimate streamflow duration, but they prioritized the ability to discriminate between ephemeral and at least intermittent reaches over the ability to discriminate between perennial and non-perennial reaches. The performance of the beta SDAM AW indicates that it is sufficient for use in research and management applications where streamflow duration information is needed during the beta testing period.
Although methods were already available for portions of the AW [27,34], the development of this method on a large geographic scale greatly reduces uncertainty about assessing streamflow duration, particularly in areas outside the intended scope of existing methods. Nonetheless, the relatively small dataset (e.g., 89 reaches vs. 264 reaches in the PNW method), combined with the lower than desired accuracy in distinguishing perennial from intermittent streams suggests that improvements may be possible with additional data collection. Thus, the beta method we present here will be used during an interim period while data collection continues. In addition, this testing period creates an opportunity to solicit additional feedback from users, which may further improve the performance, ease of use, and acceptance of a final method.
The SDAM AW has already been adopted into a few management and monitoring programs. For example, the Arizona Department of Environmental Quality has begun collecting data with the full development protocol in order to determine where to apply ephemeral, intermittent, and perennial aquatic life uses in the Hassayampa River (see focus-area studies in Supplementary File S5). Outside of regulatory contexts, the Stormwater Monitoring Coalition of Southern California is using the method to map streams and identify reaches where future bioassessment may be warranted [60].

Strengths and Limitations of the Beta SDAM AW
The beta SDAM AW combines the strengths of both the PNW and NM methods, while minimizing their weaknesses. Like the PNW method, the SDAM AW relies on a small set of objectively measured indicators, each linked with streamflow duration. We were able to avoid including the subjectively assessed indicators of the NM method, and it did not include any geomorphological indicators that reflect aspects of the hydrologic regime unrelated to duration (such as magnitude or stream power). However, like the NM method, the SDAM AW can withstand a degree of error and sampling variability without greatly affecting the likelihood of obtaining a correct classification. This robustness is due to the multiple pathways through which one can arrive at the same classification, despite the low number of required indicators (Table 6). In contrast, the PNW's decision tree has relatively few pathways to any one classification, and it is highly vulnerable to errors where indicators are missed or misidentified.
The beta SDAM AW method has room for improvement. The relatively poor ability to distinguish perennial from intermittent streams may be due to the possibility that the indicators themselves are not strongly different in the two classes of streams, at least at the level of effort we used to measure them. For example, many studies show that aquatic invertebrate communities have significantly different composition in intermittent and perennial streams (e.g., [19,[61][62][63]), these differences are less apparent when assessed at a more coarse (e.g., family-level) taxonomic resolution (e.g., [64,65]). It is likely that many reaches in the AW experience drying with some frequency, even if they are apparently perennial based on available periods of record from stream gages [3]; thus, the biota of both perennial and intermittent streams may be relatively similar, as both stream types would demand life histories that are adapted to dry conditions. Although additional data collection may improve the ability to distinguish between these stream types, it is possible that, at least in the AW, the similarity of these streams with respect to ecological measures and other indicators may limit the potential for improvement.
Although the beta SDAM appears to work in the majority of settings we evaluated, a few may present challenges. For example, streams managed as flood control channels may undergo frequent maintenance to remove some or all vegetation in the assessment area. Although some biological indicators recover quickly from these disturbances, the results from assessments conducted shortly after such disturbances may be misleading. Poor water quality in streams may also affect biological indicators-notably, the presence of EPT taxa. Indeed, several studies have documented the absence of these sensitive taxa in effluentdominated rivers in the Southwest (e.g., [66][67][68]). However, upgrades to water treatment plants can lead to a recovery of mayfly taxa [69]. Consequently, the SDAM AW may fail to identify perennial systems as Perennial in situations where water quality has been severely degraded by wastewater or other types of stress such that EPT taxa are eliminated. The SDAM AW includes other biological indicators that are less affected by poor water quality, and therefore it will typically classify such streams as At least intermittent.
We need to provide better guidance on resolving ambiguous Need more information classifications. Although this classification was uncommon in our development data, it may be more common in real-world applications, which are likely to focus on borderline ephemeral/intermittent reaches, where ambiguous outcomes could result in contentious management decisions. The focus-area study in Arizona confirmed that Need more information was more common in real-world applications than expected, based on results from our development data (Supplementary File S5). To support managers needing to resolve these ambiguous classifications, we include a number of resources and supplemental information in the user manual to help them make a classification [70], such as evaluating additional information gathered during the assessment (such as the presence of aquatic vertebrates or aquatic invertebrate families that prefer long-duration flows), or gathered through desktop analysis (such as reviewing databases of historic aerial imagery, or the USACE's Antecedent Precipitation Tool [71]). However, more structured guidance would be helpful.

Indicators Used in the Beta SDAM AW Have a Strong Conceptual Link to Streamflow Duration
Biological indicators appear to be particularly well suited for streamflow duration assessment. The SDAM AW consists exclusively of biological indicators, and no other indicator types were selected in our data-driven metric selection process. Biological indicators are widely used in ecological assessments because of their ability to integrate and reflect long-term variability in conditions due to their diverse life histories [72], and it is this same quality that makes them excellent tools to measure streamflow duration and other hydrological impacts [1].
Indicators can represent responses or controls of streamflow duration [1]. Although a number of geomorphological and geospatial "control" metrics did pass the distribution and responsiveness criteria ( Table 2) and were identified as candidate indicators, only biological response indicators were selected by statistical analysis for inclusion in the beta AW method. In contrast, a number of stream classification models developed for parts of the AW [4] or include the AW [5,7,8,31] use geospatial predictors that characterize climatic, physiographic, geologic, and land cover controls at the scales larger than an assessment reach. Measures collected remotely and/or calculated from spatially and temporally coarse datasets may not be able to capture variation at smaller scales that can be captured by reachscale field measurements [73]. However, it is likely that physical controls of streamflow duration within stream reaches are best represented by a combination of watershed and local reach-scale factors that describe the dynamic balance between flow sources and losses from infiltration and evapotranspiration [10].
Many of our study reaches did not coincide spatially with the pour point of subcatchments. Unlike previous studies [4,5,8] that used primarily reaches with stream gages that are often positioned near the base of subcatchment, our study design sought to have reaches evenly distributed across ephemeral, intermittent and perennial reaches. This design meant that many of our study reaches were distant from the pour-point of catchments recognized by geospatial datasets, like StreamCat [37], and the associated subcatchment data may not reflect the actual upstream characteristics. This may, in part, explain why relatively few geospatial metrics were identified candidate indicators in our study relative to previous studies. As the spatial and temporal resolution and continuity of geospatial datasets improve through improvements in remote sensing technology and user platforms, we expect that those datasets will become more frequently utilized in streamflow duration classifications.
Although hydrologic indicators were among the 155 candidate metrics in our study ( Table  1), most of them were ultimately excluded from consideration because those data were used to confirm the direct flow duration classification. For example, if a stream thought to be ephemeral had flowing water during sampling, we conducted additional investigations (such as contacting local experts) to determine if a reclassification was warranted. Direct hydrologic measures or metrics have been identified as important predictors in other AW classification models [4,5,8]. The degree of circularity of using hydrologic metrics will depend upon whether or not and to what extent those measures were used in the original streamflow duration classification.

Hydrophytic Plants-Our finding that hydrophytic plants are a good indicator
of streamflow duration is well supported by several studies in the AW (e.g., [25,[74][75][76]). Caskey and others showed a decrease in wetland plant occurrence after diversion of perennial flow along stream reaches in the Routt National Forest, CO [76]. Reynolds and Shafroth noted a number of plant species indicative of perennial versus intermittent flow regimes in high-and low-elevation streams in the Colorado Basin [77]. Although that study did not identify ephemeral streams, the authors report that the driest streams in their study were dominated by upland plants, such as sagebrush and juniper (Lindsay Reynolds, personal communication). Thus, the taxonomic composition of riparian and wetland plants appears to be a well-supported indicator of flow duration.
An advantage of riparian plants over other biological indicators of flow duration is that they are non-motile organisms, some of which have very long lifespans (i.e., decades). Therefore, they are well suited to reflect local, long-term conditions in a way that fish or invertebrates cannot. Another factor that sets them apart from other biological indicators is that they function both as responses to streamflow duration gradients, and, through their effect on evapotranspiration rates, a driver as well (e.g., [78,79]).

Aquatic Invertebrate Abundance and Composition-Numerous studies
have demonstrated strong relationships between flow duration and the composition of aquatic invertebrates (e.g., [15,19,22,64,65,80]), although only a few have reported effects on abundance [25,81]. Within the AW, Bogan and others report comparable levels of abundance in perennial and intermittent headwater reaches, both of which were considerably lower than downstream perennial river reaches [62]. Despite this relatively limited evidence about the efficacy of invertebrate abundance, it is a widely used indicator in other SDAMs [35], including both the PNW and NM methods.
In contrast to abundance differences, compositional differences in aquatic invertebrate assemblages of perennial and intermittent streams are well documented in the literature. Most studies report higher richness or diversity at reaches with longer duration, particularly among EPT taxa [82,83]. Several studies identify individual taxa as indicators of perennial flow (e.g., [82,[84][85][86]), while a few identify taxa that prefer intermittent flow (e.g., [87,88]). In general, intermittent reaches within arid regions appear to support a subset of the taxa found at perennial reaches in a region, rather than a distinct set of taxa [81], although a few intermittent specialists are known (e.g., the stonefly Mesocapnia arizonensis [89], the fishfly Neohermes filicornis [90], and several genera of Chironomidae [91]). However, our reliance on family-level identifications likely limited the influence of these taxa on our method.
Studies on the effects of flow duration on aquatic invertebrates almost exclusively focus on intermittent and perennial reaches, presumably because of the difficulty of collecting aquatic taxa from ephemeral reaches. However, one study collected diverse macroinvertebrates from ephemeral streams sampled shortly after the onset of flow [92]. Clarke and others collected macroinvertebrates from debris jams in an ephemeral stream, which retained moisture throughout the dry season [81]. In our study, no dry streams were noted as having woody jams that harbored moisture or sustained aquatic invertebrates, suggesting that the microhabitat studied by Clarke and others ([81]) may not be prevalent in the AW. Thus, it appears that aquatic invertebrates are only rarely observed in ephemeral streams sampled under typical hydrologic conditions.

Algal
Indicators-Although algae may begin to grow immediately following inundation, extensive growth of mats and other forms that are easily visible typically requires multiple weeks of flow [93][94][95][96]. We did not observe evidence of algal growth in any ephemeral stream in our study, but the rapid growth rate of algae (particularly in unshaded reaches) may lead to overestimates of flow duration from this indicator if ephemeral reaches are sampled shortly after flow events. However, Robson and others note that recolonization is slowest in hydrologically isolated streams that lack perennial refugia, which likely limits the opportunity for growth of algal mats in most ephemeral streams [94].
The persistence of dead or desiccated algal mats contributed to the beta SDAM's ability to discriminate between dry ephemeral and dry intermittent stream reaches. Although many algal species are resistant to desiccation, the breakdown of algal cells begins within a few hours of exposure to air [96][97][98]. The persistence of particulate organic matter in dry streams has been well documented (e.g., [99][100][101]), although most studies focus on leaf litter rather than algal mats. Robson noted that dry algal biofilms are often visible in rocky intermittent streams in Victoria, Australia, and it is likely they are conspicuous features of intermittent streams in most arid regions of the world [102]. Although the breakdown of dried algal mats may be accelerated by terrestrial scavengers (e.g., tetrigid grasshoppers [103]), our own observations suggest that in arid climates, visible evidence of algal growth often persists throughout the dry season.

Lessons Learned about SDAM Development
This study demonstrates that the steps outlined by Fritz and others [1] support the successful development of an SDAM. Our experience in developing a beta SDAM for the AW reinforces the importance of several steps highlighted by Fritz and co-authors and brings a few new ones to light.

Engage End-Users throughout the Development Process-Engagement
of representatives of diverse groups into the development of environmental assessment tools facilitates acceptance of final products, particularly if tools will be used for regulatory purposes [104], and is a key principle for open science for applied environmental research [105,106]. The RSC played a key role in guiding method development, reviewing interim products, and vetting major decisions. They provided technical insights, local knowledge, and connected us with valuable resources informing the conceptual framework of building an SDAM (indicators, study reaches, and hydrological data).
While the RSC was limited to staff from federal regulatory agencies, members provided the perspective of both end-users and implementers to the development of the beta method. An important driver behind release of a beta method prior to a final method in the process steps described by Fritz and others [1] (Figure 1) is that it provides a meaningful feedback opportunity for affected sectors and the wider end-user community. As end-users, the RSC helped assure that the beta method is relevant, rapid, user friendly, and likely to be accepted by the user community. As representatives of the implementing agencies, the RSC also is playing an important role, post-beta release, in engaging diverse groups and end-users through outreach and training during the beta period, assuring that a wide range of input is considered in creating the final method. The RSC is poised to play a similar role in the production and implementation of the final method and forms the core of a trained regional practitioner network.
The focus-area studies proffered an additional feedback opportunity through engagement with study partners: a state regulatory agency and a private consultant working on behalf of regional stormwater agencies. We received valuable insights regarding on-the-ground usability of method components (e.g., the user manual, field forms, and data management systems), level of effort for application, and consistency of results. In describing challenges and needs, these partners contributed to many improvements during the development of the beta method and helped us consider how to produce a tool that is useful for multiple management needs.

Statistical Complexity Does Not Need to Create a Barrier for End-Users
-We used random forest statistical models because of their ability to handle complex, non-linear relationships that are common to ecological and hydrological data [107,108]. These characteristics make random forest and related machine learning methods increasingly popular in environmental research, yet their complexity and dependence on specialized software creates challenges for communication, as well as for adoption into programs where the typical end-users have little familiarity with machine learning methods [109]. However, by converting the indicators into simple categorical metrics, we were able to create a table to crosswalk metric values to outcomes of the random forest model (Table 6). Thus, no expertise is required to make use of the complex statistics underpinning the beta SDAM.

Poor Documentation of Ephemeral Streams Creates Major but
Surmountable Challenges-Although ephemeral reaches are recognized to be widespread globally [31], very few are documented with sufficient rigor to support their use in SDAM calibration. Careful interpretation and verification are necessary when using the flow classifications reported in studies for SDAM development. Terminology and definitions used in reports or studies varies considerably [9], and a stream described as ephemeral in one study may be more appropriately described as intermittent using the definitions above. Studies may differ in how snowmelt affects classification, or how many flowing days distinguish ephemeral from intermittent streams. For example, Jaeger and Olden identified ephemeral reaches as those with <5% flowing days [40], while the threshold used by Hedman and Osterkamp was twice as high [110]. Some hydrology studies may identify reaches as ephemeral based on rigorously documented ephemeral flows (e.g., [111]), yet longer-duration flows that occur outside the study period may be undetected or unreported.
Due to the inconsistent terminology applied to non-perennial streams in the literature, evaluating hydrologic data may be the most reliable way to identify ephemeral reaches. When studies provide access to the underlying hydrologic data, these data may be re-examined and classified according to a potentially different but within-study consistent set of rules. However, long-term hydrologic data are rarely collected from ephemeral reaches, and the availability of reliable information may vary widely across the country based on state and local monitoring programs, regulatory treatment of ephemeral streams, and the density and perceived importance of ephemeral streams. In the absence of reliable data, SDAMs can use tools like baseline monitoring via loggers, as well as observations from local experts (see next section) to classify streams as ephemeral [1,9]. However, baseline monitoring may require long-term support in order to generate reliable data that can be used to identify ephemeral reaches with high confidence in light of year-to-year variability.

Make the Best Use of Local
Expertise-Local expertise allowed us to fill in crucial data gaps (e.g., finding ephemeral streams in regions where they were poorly documented). Local experts came from a wide range of backgrounds, such as hydrologists, engineers, monitoring specialists, professional and non-professional (citizen) scientists, park rangers, and university researchers. Without their assistance, it may have been impossible to identify enough ephemeral reaches for method calibration.
Given the importance of local expertise, we need better ways to interpret, standardize, and make use of this data source. Our quality assurance process characterized how well local experts knew a study reach based on years of experience and recency of visitation (Supplementary File S5), but this process may not have been sufficient to gauge how well expert definitions of streamflow duration classes matched our study requirements. Few experts reported more than 5 years' experience with a reach, which may be insufficient to assess the frequency of flow events that might sustain hydrophytes in what might otherwise appear to be an ephemeral system. Given the long-term perspective required to understand streamflow duration, traditional ecological and hydrologic knowledge may be especially important in verifying streamflow duration classes (e.g., [112][113][114][115][116]).

Recognize the True Complexity of Streamflow Duration Gradients-To
calibrate the SDAM, study reaches were classified into one of three categories. Streams exist along a continuum of flow duration, and the length of flow events is just one dimension of this variability in addition to seasonality, timing, predictability, and frequency [3,9,13]. Thus, each class of stream reaches encompasses a great deal of hydrologic variability, which certainly impacted our efforts to develop an SDAM. In this study, we made no effort to focus on streams that characterize "extreme" or ideal representations of their class, nor did we try to exclude borderline cases (such as reaches that exhibit intermittent flow in only wet years).
It remains an open question whether our strategy is more effective for SDAM development (because the calibration data represents the full range of hydrologic conditions found in the AW), or if we would have had greater success by focusing on non-borderline reaches. This question may be investigated with larger datasets generated in the future.

Investigate the Persistence of Indicator Expression at Reaches That
Have Undergone Changes in Streamflow Duration-In general, reaches that have recently experienced a long-term change in flow duration or are in transition from one flow duration class to another should not be used for development of SDAMs. However, evaluation of such reaches could provide insight into timelines of how indicators respond to these changes. This research would allow managers and researchers to understand whether a streamflow duration assessment is providing information about present-day or historic conditions. Such questions may arise when making assessments on reaches that have undergone changes due to either natural or anthropogenic causes. Several studies have documented the slow decline of long-lived hydrophytic trees following diversions or groundwater extraction, whereas studies of other indicators (e.g., short-lived plants, invertebrates) tend to show a more rapid response (e.g., [76,117,118]). Therefore, it may be possible to identify indicators that can detect reaches that have transitioned from one flow duration class to another.

Address Challenges Created by the Dependence on Taxonomic
Expertise-Feedback from both the RSC and the practitioners of the watershed studies indicated that field-based family-level identification of aquatic invertebrates would be a challenge for widespread use of an SDAM. This was true even for practitioners that had extensive backgrounds in bioassessment and experience identifying aquatic invertebrates in lab settings. We addressed this challenge by requiring only identification of three insect orders (Ephemeroptera, Plecoptera, or Trichoptera), which all practitioners agreed could be reasonably achieved through brief trainings. However, order-level identifications may not have the same ability to discriminate between intermittent and perennial reaches as genus or species level identifications. It is likely that better taxonomic resolution is required to detect the influence of flow duration on aquatic invertebrate assemblages [65,85,88]. These challenges may be overcome either by requiring higher taxonomic resolution, perhaps in a tiered approach following the NM method [34], or by harnessing new technology for taxonomic data generation, such as DNA barcoding (e.g., [119,120]), or automated image recognition (e.g., [121,122]). Taxonomic expertise is also required for the identification of hydrophytes, although this barrier is less substantial than for aquatic invertebrates among practitioners typically involved in wetland delineation. Notwithstanding the considerable diversity of riparian plants in the AW [123][124][125], only a small number of species accounted for the majority of hydrophyte observations in this study. For example, at least one of four willow species (Salix exigua, S. goodingi, S. laevigata, and S. lasiolepis) were observed at 63% of intermittent or perennial reaches. Therefore, new practitioners need only develop expertise in identifying a handful of species to use the method.

Get More and Better Hydrologic Data-Development of the beta SDAM
AW highlighted the need for increased spatial and temporal resolution of hydrologic data collection in the AW at non-perennial stream reaches, as anticipated in Fritz et al. (2020). An analysis based on the National Hydrography Dataset at the 1:100,000 scale showed that non-perennial streams comprise approximately 59% of the total stream length in the U.S., excluding Alaska; however, intermittent and ephemeral streams are more highly concentrated in the western U.S. (e.g., 94% of Arizona's stream length was found to be non-perennial [21]). Despite a high proportion of non-perennial streams in our study region, identifying documentation of non-perennial stream classifications was a primary challenge for developing the beta SDAM. We expect to experience the same challenge in developing SDAMs for other regions of the country. We anticipate that the rising interest in mapping and modeling non-perennial streams coupled with the availability of low-cost data loggers [41,126] may increase the availability of continuous hydrologic data for characterizing ephemeral streams to support SDAM development. The AW would also be a strong candidate region to explore and test remote sensing options (e.g., daily satellite imagery) to fill in hydrologic data gaps at certain reaches, such as at large rivers in the desert with low canopy cover, potentially in conjunction with hydrologic models [127]. As described by Fritz and others, future efforts should focus not just on increasing spatial coverage of flow records but also on increasing the temporal coverage of flow records for intermittent and ephemeral streams [1]. Long-term records are required to ensure that the flow classification is not influenced by a year or two of atypical data (e.g., due to droughts or floods) and to identify changes in hydrologic regime due to anthropogenic influences (e.g., water withdrawals) or climate change. Recent advances in timeseries statistical analyses have enhanced our ability to characterize streamflow duration, even when gage or logger records are interrupted by gaps created by instrument malfunction or ice-jams [128,129].

Identify Positive Indicators of Ephemeral Streamflow
Duration-All the indicators in the beta SDAM AW were positively correlated with streamflow duration. That is, they were present with higher frequency, abundance, or diversity at perennial or intermittent streams than at ephemeral streams. Thus, ephemeral status is inferred from lower values or the absence of these indicators. We recognize that indicators whose presence can be interpreted as positive evidence of ephemeral status could increase confidence in these classifications. However, our review of the literature and discussions with experts identified few such indicators that were practical to explore in SDAM development [35]. One potential indicator from the NM method (prevalence of upland plants in the streambed) was able to discriminate between perennial and non-perennial reaches, but not between ephemeral and intermittent reaches (Table 5), suggesting it may not be appropriate to interpret it as a positive indicator of ephemerality. It was not selected for inclusion in the model, presumably because other metrics provided greater discriminatory power. Positive indicators of ephemerality may lead to greater confidence in the acceptance of ephemeral classifications from an SDAM. Some studies suggest that terrestrial arthropods and nonhydrophytic plants may serve as useful indicators [77,130,131], but further exploration is needed to incorporate them into SDAMs.

Conclusions
This study illustrates the successful implementation of an approach to develop SDAMs described by Fritz and others [1] in the Arid West of the United States. We found that biological indicators were particularly useful because of their ability to reflect long-term patterns in streamflow duration exhibited at a site. Our final SDAM was more successful in distinguishing ephemeral from intermittent reaches than perennial from intermittent, consistent with SDAMs developed in other regions. The most substantial limitation was the scarcity of non-perennial reaches with sufficiently documented flow regimes that could be used to calibrate an SDAM. Thanks to the oversight of a regional steering committee comprised of technical experts who require streamflow duration information in their management and monitoring programs, we were able to ensure that this SDAM would be accessible and easy to use by its target audience, despite the statistical complexity underlying its classifications. All data collected for this study and code used for analysis are available in Supplementary File S1.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.  Operational framework for SDAM development. Small black arrows indicate stepwise actions within a process step, although some actions occur simultaneously and may be repeated throughout the project (e.g., outreach to local experts). The gray arrows denote that implementation actions are iterative, ideally supporting public release of an interim beta SDAM and then a final SDAM. Reproduced from Fritz and others [1]. Mazor et al. Page 39 Water (Basel). Author manuscript; available in PMC 2022 November 22.

Figure 2.
Study reaches in the AW. Colors indicate the 5 strata used for ensuring geographic representativeness of the dataset (blue: California; orange: Nevada; red: Arizona, purple: New Mexico and Texas; and green: Utah, Colorado, Wyoming, and a small portion of Montana). The gray area indicates arid portions of the Pacific Northwest, which are covered by the method of Nadeau [27] and was excluded from the present study. Black dots indicate sampling reaches.  Summary of metric selection. Metrics are described in Table 4 and Supplementary File S3. Numbers in parentheses indicate the number of times the metric was selected, and the number of times the metric was eligible for selection. Strat: Stratified approaches. Unstrat: Regional (unstratified) approaches. GIS: Approaches that considered geospatial metrics. H2O: Approaches that considered direct measures of water presence. Each cell indicates if a metric was selected by recursive feature elimination applied to fitting random forest models. Black cells indicate that the metric was selected (for at least one subregional model in stratified approaches). Gray cells indicate that the metric was not selected. White cells indicate that the metric was ineligible for selection (either geospatial metrics or direct measures of water presence). Only metrics that were selected at least once are shown.   Variable importance (calculated as the % mean decrease in accuracy when that variable is removed) of the selected model-an unstratified random forest model based on six biological metrics. Metric descriptions are provided in Table 4.   Percent of classified reaches versus the minimum proportion of votes from a random forest model required to make a classification. The dashed line represents the minimum proportion selected by the RSC (i.e., 0.5). This figure was created using data from both calibration and validation reaches. Indicators evaluated in the present study. Indicators with "NM" in the Origin column were measured following the NM method protocol [34], and indicators marked with "PNW" were measured following the PNW protocol [27]; other indicators were measured with protocols developed for this study, which are provided in Supplementary File S1, and come from sources reviewed in a study by McCune and Mazor [35]. Asterisks (*) indicate hydrologic indicators that are considered direct measures of water presence.

Sediment deposition on plants and debris
Visual estimate of the extent of evidence of sediment deposition on plants and on debris within the floodplain NM

Hydrologic indicators
Surface and subsurface flow * Estimates of the percent of the reach-length with surface and subsurface flow PNW Isolated pools * Number of pools in the channel without any connection to flowing surface water PNW Water in channel * Visual estimate of the extent of surface flow in the channel NM Seeps and springs * Presence/absence of springs or seeps within one-half channel width of the channel NM

Hydric soils
Presence/absence of hydric soils within the channel, measured at up to 3 locations NM Soil moisture and texture * Extent of soil saturation and texture measured at three locations in the channel

Woody jams
Number of woody jams within the channel

Biological indicators
Live and dead algal cover Visual estimate of the percent of streambed covered by live or dead algal growth

Filamentous algal abundance
Estimate of the overall abundance of filamentous algae within the channel NM

Stream shading
Percent shade-providing cover above the streambed measured with a densiometer at three locations

Hydrophytic plant species
Number of OBL or FACW-rated plants (as listed in [36]) growing within the channel or a half-channel width from the channel PNW

Fish
Estimate of the overall abundance of fish (other than non-native mosquitofish) in the channel. NM

Aquatic invertebrates
Abundance and richness of aquatic invertebrate families collected from the channel PNW

Aquatic invertebrates
Estimate of the overall abundance of aquatic invertebrates within the channel NM

Amphibians
Estimate of the overall abundance of amphibians within the channel NM

Mosses and liverworts
Visual estimate of the percent of streambed and banks covered by live or dead bryophytes or liverworts

Differences in vegetation (riparian corridor)
Visual estimate of the distinctiveness of vegetation in the riparian corridor compared to surrounding upland vegetation NM

Absence of upland rooted plants in the streambed
Visual estimate of the extent of upland rooted plants growing within the streambed NM

Presence of iron-oxidizing fungi or bacteria
Presence of oily sheens indicative of iron-oxidizing fungi or bacteria within the assessment reach NM Water (Basel). Author manuscript; available in PMC 2022 November 22.

Indicator Description Origin
Presence of aquatic or semiaquatic snakes Presence of aquatic or semi-aquatic snakes (e.g., most garter snake species) in the channel PNW

Geospatial Location and watershed characteristics
Latitude, longitude, elevation, and watershed area (watershed area retrieved from StreamCat database [37])

Long-term normal precipitation and temperature
30-year normal mean annual and monthly precipitation, and 30-y normal mean, maximum, and minimum annual temperature (PRISM climate data; [38]).

Soil type
Landscape metrics related to soil (such as erodibility, hydraulic conductivity, and bulk density) calculated at the watershed and catchment scale (StreamCat database [37])

Geology
Landscape metrics related to geology (such as geological nitrogen content in bedrock) calculated at the watershed and catchment scale (StreamCat database [37])

Ecoregion
Level 2 and 3 ecoregions for the Western United States [39] Water (Basel Metric screening criteria. Metrics had to meet the distribution criterion and at least one responsiveness criterion to be considered in further analysis.

Criterion Definition
Distribution Criterion % dominance of most common value <95% Frequency of most common value (typically, zero) in the development dataset.    Performance of final methods and calibrated models. SDAM AW: the final streamflow duration assessment method for the AW, including all modifications and use of single indicators. NM: the New Mexico method [34]. PNW: the Pacific Northwest method [27]. Base: Biological metrics, geomorphological metrics, and hydrologic metrics that did not directly measure the presence of water. GIS: Geospatial metrics. H2O: Hydrological metrics that directly measure the presence of surface water. Accuracy measures are proportion of correct classifications. Cal: Calibration data. Val: Validation data. PvIvE: Proportion of reaches correctly classified as perennial, intermittent, or ephemeral. EvALI: Proportion of reaches correctly classified as ephemeral or at least intermittent reaches. PvIwet: Proportion of flowing reaches correctly classified as perennial or intermittent. EvIdry: Proportion of dry reaches correctly classified as ephemeral or intermittent. EnotP: Proportion of ephemeral reaches correctly not classified as perennial. PnotE: Proportion of perennial reaches correctly not classified as ephemeral. PvNP: Proportion of reaches correctly classified as perennial or non-perennial. Repeatability: Proportion of revisited reaches with the same classification for each visit. Repeatability was not assessed for calibration and validation data separately. Because the NM and PNW methods were developed with independent data, the division of calibration and validation data is not applicable. The highlighted rows marked with a indicate the performance measures for the final method.

PvIvE
The highlighted rows marked with b indicate the model that was selected for refinement to create the final model.