Evaluating Stream Restoration Projects: What Do We Learn from Monitoring?

Two decades since calls for stream restoration projects to be scientifically assessed, most projects are still unevaluated, and conducted evaluations yield ambiguous results. Even after these decades of investigation, do we know how to define and measure success? We systematically reviewed 26 studies of stream restoration projects that used macroinvertebrate indicators to assess the success of habitat heterogeneity restoration projects. All 26 studies were previously included in two meta-analyses that sought to assess whether restoration programs were succeeding. By contrast, our review focuses on the evaluations themselves, and asks what exactly we are measuring and learning from these evaluations. All 26 studies used taxonomic diversity, richness, or abundance of invertebrates as biological measures of success, but none presented explicit arguments why those metrics were relevant measures of success for the restoration projects. Although changes in biodiversity may reflect overall ecological condition at the regional or global scale, in the context of reach-scale habitat restoration, more abundance and diversity may not necessarily be better. While all 26 studies sought to evaluate the biotic response to habitat heterogeneity enhancement projects, about half of the studies (46%) explicitly measured habitat alteration, and 31% used visual estimates of grain size or subjectively judged 'habitat quality' from protocols ill-suited for the purpose. Although the goal of all 26 projects was to increase habitat heterogeneity, 31% of the studies either sampled only riffles or did not specify the habitats sampled. One-third of the studies (35%) used reference ecosystems to define target conditions. After 20 years of stream restoration evaluation, more work remains for the restoration community to identify appropriate measures of success and to coordinate monitoring so that evaluations are at a scale capable of detecting ecosystem change.


Introduction
With increasing popularity of stream restoration in the US, a number of publications in the early 1990s argued for more monitoring and evaluation of projects, so that the experience gained from current projects could be used to improve future endeavors (e.g., [1][2][3][4]).These calls for more stringent evaluations of restoration success presented an opportunity to treat restoration actions as experiments to develop better understanding of the river systems and test out approaches.The need for monitoring and evaluation was echoed in subsequent works, such as post-project appraisal approaches proposed by Downs and Kondolf [3], and in detailed guidance for selecting metrics and indicators for restoration offered by Woolsey et al. [4].Palmer et al. [5] identified "pre-and post-assessment" and public availability of data as one of five criteria for successful projects.Evaluation has become increasingly common, though many evaluations do not definitively answer if restoration has succeeded.Many metrics have been used, but they are not always tied to project objectives, nor necessarily appropriate for measuring the changes effected by the restoration interventions.Perhaps the most universal insight from multiple evaluations of stream restoration is the importance of understanding the complexity of stream systems and their potential responses to restoration.

Challenges to Evaluation
It may be easier to call for evaluation of restoration projects than to actually carry it out.As Rutherford et al. [6] warned, "it is often unwise for managers to evaluate the bio-physical impacts of their interventions unless they do it 'properly'".Routine or casual monitoring is unlikely to demonstrate change resulting from a small-scale restoration action for multiple reasons: the restoration effect (even if successful) may be too small to produce a measurable result, the restoration project may involve multiple actions whose effects may be difficult to distinguish, baseline data may be inadequate, or the measured variables may have high natural variability [6].Using data for the Latrobe River, Australia, Rutherford et al. [6] demonstrated that to detect a statistically significant decrease in turbidity of 10% would require 80 years of sampling.Larger reductions could require less time to document.In some cases, a properly designed and executed evaluation study could be more expensive than the restoration action itself, a reality that is difficult for many managers to accept (or at least to sell to the public).The investment needed for meaningful assessment implies that restoration evaluation may be possible only in some cases, and effective evaluation strategies may require pooling resources across multiple projects.

Shortcomings of Commonly Used Evaluation Metrics
Many project proponents, funders, and (for mitigation projects) regulatory agencies have adopted linear or areal measures, such as area of riparian habitat or length of stream restored.For example, to comply with requirements from the Office of Management and Budget to quantify outputs from its ecosystem restoration projects, the US Army Corps (the Corps) reports "acres restored" or "acres restored per $1 spent" [7].Areal metrics do not distinguish high quality habitat (or habitat critically needed for a given species and life stage) from lower quality habitat (or habitat that was not critically needed by important species).The Corps has also undertaken efforts to develop more sophisticated metrics of restoration, developed to evaluate the likely benefits of ecosystem restoration and tradeoffs between economic and environmental benefits [8][9][10][11].
Monitoring target populations may also be ineffective as an evaluation metric.Throughout northwestern North America, habitat restoration projects are implemented to benefit anadromous salmonids, but the populations of these fish are notoriously influenced by other factors such as climate and ocean conditions.Indeed, after 80 years of constructing in-stream structures for salmonid habitat, there is no clearly documented evidence of population increases as a result [8] a sign of the challenges in using population monitoring to evaluate restoration success.As noted by Krebs [9], "Monitoring of populations is politically attractive but ecologically banal unless it is coupled with experimental work to understand the mechanisms behind system changes".

Biological Integrity Indices
One of the main applications of biological indices is as an indicator of water quality, the idea being that while periodic sampling of water chemistry could easily miss a transient pulse of pollution, the macroinvertebrate community would reflect the water quality over a long time period, with more pollution-sensitive taxa absent in streams subject to poor water quality.However, in the context of restoration, many projects have not intended to improve water quality per se, but to improve instream habitat through additions of rocks and logs, or even through complete channel reconstruction.It is not immediately clear that indices based on pollution sensitivity of macroinvertebrate taxa are appropriate to measure effectiveness of restoration projects, though they are commonly used.Habitat heterogeneity projects may also be seeking to reduce streambank erosion and fine sediment is a common water-quality impairment.In such cases, use of a biological index may be warranted.However, as reviewers of these studies, we were surprised that authors frequently did not justify the appropriateness of their chosen metrics.

The Opportunity to Look Back
Although rigorous (multiple years of pre-and post-project evaluation, quantitative measurements of habitat conditions, etc.) monitoring of restoration projects remains the exception rather than the rule [10], over the past two decades there have been many careful studies of the performance of restoration projects, and these provide us with a potentially important database with which to assess the effectiveness of different metrics to evaluate success.Two relatively recent papers by Miller et al. and Palmer et al. independently reviewed multiple studies of restoration projects, all of which were intended to enhance habitat heterogeneity (i.e., complexity of pool-riffle structures, undercut banks, large wood, etc.) [12,13].While the meta-analyses of Miller et al. and Palmer et al. sought to assess whether restoration programs were succeeding, our review focuses on the evaluations themselves, and asks what exactly we are measuring and learning.We were intrigued by these two reviews (cited more than 400 times, with individual studies collectively cited 1500 times), and the fact that they reached divergent conclusions.We carefully analyzed all the original studies used in the two reviews.We sought to determine how the different methods and metrics used might have influenced determination of "success".We also critiqued the underlying premise that macroinvertebrate diversity and richness should be universally applicable metrics of restoration success.Although we focus our review on the evaluation of habitat heterogeneity enhancement projects, (e.g., remeandering, rock and wood structures, etc.) we draw upon examples of other restoration approaches and expect that the principles and considerations will be applicable to the evaluation of many types of restoration projects.

Review of Habitat Heterogeneity Enhancement Restoration Evaluation Studies
In our systematic review of methods and metrics in the 26 studies used by Palmer et al. [11] and Miller et al. [14] (Table 1), we categorized each study according to twelve criteria representing a range of important considerations for restoration evaluation as presented in the literature (e.g., [7,15]).We documented location, extent of pre-and post-project monitoring (to control for temporal variability); sampling frequency (to control for seasonal variability); and whether the studies sampled different habitats (e.g., pool, riffle, banks) separately, in aggregate, or were restricted to certain habitat types (Table 2).We identified underlying assumptions and evaluation methods employed in each study, and considered these in light of observational and theoretical studies of food webs and species interactions.We also noted whether control (degraded, unrestored) sites were included to control for temporal variability and provide basis for comparison, whether regional reference (e.g., nearby, best potential ecological condition, sensu Reynoldson et al. [16]) sites were included to control for spatial variability and provide a regionally appropriate standard of success, what standards of success were stated and employed, and whether studies measured habitat heterogeneity directly, visually, or not at all.Other considerations were whether potential construction impacts from the restoration project were considered, and whether the regional reference sites had comparable drainage areas and slopes (Table 2).We scrutinized the methods and designs of each individual study, to assess the likelihood that the study designs would be adequate to detect biological change.

Friberg et al. 1998 [17]
Jutland, Denmark D, S, V Taxonomic Richness, density, composition.Abundance of stone-dwelling species were analyzed to specific level.

Gerhard & Reich 2000 [18]
Central Germany D, F, L, S, V, W Maximum and average number of species and density.

Jungwirth et al. 1993 [24]
Epipotamal and Melk Rivers, Austria D, S, V For macroinvertebrates number of species and drifting biomass.For fish number of species and diversity.

Roni et al. 2006 [37]
Umpqua and Coquille basins, OR, USA L, P, S Total abundance, Taxonomical richness,relative abundance (proportion of total abundance) of FFG (shredders and collectors) orders EPT tax, and I-IBI.

Tullos et al. 2009 [40]
North Carolina Piedmont, USA E, G, O, P, S Taxonomical and Trait composition, and Shannon Diversity.Species indicator analysis of restored and un restored sites.
Notes: B is braiding index, C is canopy cover or shading, D is depth, E is bank erosion, F is facies mapping, G is generic quality assessment (EPA's RBP, or Ohio's QHEI), H is hyporheic exchange, L is count or volume of wood, M is moss cover, O is Organic Content of substrate or leaf retention, P is pool spacing or pool area, Q is water quality, S is substrate size, V is velocity, W is width, Y is shear stress, FFG is functional feeding groups.Standard of success 9 (35%) used a reference site as the standard of success for benthic macroinvertebrates.17 (65%) compared restored to unrestored conditions (pre-restoration or control site)-typically assuming increased diversity, richness or B-IBI score is an improvement.

Construction influence assessed/discussed
9 (35%) measured construction harm (through multiple measurements per year or multiple years, emphasizing the first year after construction and including pre-construction data or control sites.

Reference site parameters presented?
Of the 9 studies using reference stream conditions as the measure of success, 5 studies (55%) did not present any watershed attributes of the reference sites (watershed area, stream gradient, width, depth, discharge).

Land use assessed or discussed
50% discussed the land use of the catchment while 50% did not.
Miller et al. [14] conducted a quantitative meta-analysis on 24 studies (searching the literature using keywords including: restoration, rehabilitation, stream, river, invertebrates, macroinvertebrates, habitat, heterogeneity, channel reconfiguration), differentiating replicated from unreplicated studies, and concluded that heterogeneity enhancement projects had increased the macroinvertebrate richness but not diversity.Palmer et al. [11] compiled 18 such studies (12 of which overlapped the studies studied in Miller et al. [14]), and rather than combining the studies to test for statistical significance, evaluated each study independently.They reported no evidence of increased stream invertebrate diversity.Both Miller et al. [14] and Palmer et al. [11] cited the lack of robustness in the studies as a potentially important limitation.In particular, Miller et al. [14] noted the "(1) low quantity and poor quality of published biotic and abiotic data; (2) lack of rigorous study designs; (3) a dearth of replicated restoration efforts within physiographically similar areas".The 26 studies reviewed by Miller et al. [14] and Palmer et al. [11] included a wide range of habitat enhancement actions and locations across the world, from Europe (14 studies), North America (10), Australia (1), and Japan (1).These studies represent genuine efforts to quantify the biological effect of the physical interventions to increase habitat complexity.However, the shortcomings of these studies were not previously systematically summarized.

Habitat Metrics
Although all studies sought to evaluate the biotic response to habitat heterogeneity enhancement projects, 12 studies (46%) quantitatively measured habitat alteration, with depth variability being the most common metric (Table 1).Other habitat heterogeneity metrics such as wood and organic matter retention, log or pool spacing, and bank erosion were used less frequently (Table 1).
Six studies (23%) did not report any data about habitats (in effect assuming that the restoration projects had actually increased habitat heterogeneity).If the hypothesis being tested is that enhanced habitat heterogeneity through restoration caused increased macroinvertebrate diversity, then it is important to quantitatively measure habitat heterogeneity to confirm whether or not the habitat heterogeneity has, in fact, been increased.This issue is further complicated by the absence of clear, agreed-upon, definitions of how to define and measure habitat heterogeneiity.The importance of quantitatively and precisely assessing habitat heterogeneity was demonstrated by Laub et al. [42] who found that many unrestored (but non-channelized) urban streams had relatively high heterogeneity (measured through several specific metrics including variability in width, depth, velocity, thalweg profile, and bed sediment sorting) when compared to reference, forested sites, and that "restored" sites were often not more complex than unrestored sites.Thus, there is no a priori reason to assume that "heterogeneity enhanced" sites have more heterogeneity than unrestored urban stream sites.
Eight studies (31%) used visual estimates of grain size or adopted standard monitoring protocols for habitat quality (which include visual substrate estimates), such as the EPA Rapid Bioassessment Protocol (RBP) [43], the Ohio EPA Qualitative Habitat Evaluation Index (QHEI) [44], or the Bank Erosion Hazard Index (BEHI) [45].These habitat quality metrics were developed primarily for quick implementation at a large number of sites, and they may not be well-suited for evaluation of individual restoration projects.They may not measure at the detail needed to determine causal relationships between habitat and macroinvertebrate communities.For example, visual substrate estimates have been shown not to be repeatable, and can lead to erroneous estimates.Visual estimates of substrate used in habitat modeling studies for the Physical Habitat Simulation Model (PHABSIM) overestimated median grain size when compared to the results from the scientifically established and repeatable method of pebble counts [22][23][24].Whitacre et al. [46] compared six rapid assessment protocols and found statistically significant differences in results for nine out of ten basic habitat attributes such as sinuosity, percent pools, and median grain size.While rapid assessments may be useful for systematic monitoring of large spatial extents, there is little basis for their use in evaluating individual, reach-scale projects.
More comprehensively, Lisle et al. [47] questioned the basic premise of using rapid assessments to evaluate impairment of gravel-bed streams based on any protocol because any single metric is unlikely to reveal causative relations and channel condition can result from multiple pathways.Instead, Lisle et al. argued that channel condition be interpreted through the context of predictive mapping, site history and human influence.Standard metrics cannot be a substitute for hypotheses linking causes of impairment, project objectives, and restoration actions.For example, using bank erosion as a general measure of habitat quality (as is done in the EPA's Rapid Bioassessment Protocol and other rapid assessments [19]) either assumes that excess erosion is occurring at the site or that all bank erosion is undesirable.While channel incision and bank erosion are well-documented problems [48], particularly in urban settings, it may be misguided to seek more channel stability universally.To equate "failure" with bank erosion or the displacement of an in-channel structure wrongly assumes that successful stream restoration should create fixed, "stable" streams [49] even though ecological theory suggests, and experience on many rivers shows, that bank erosion can have many benefits, such as delivering spawning gravel to the channel [50], facilitating riparian vegetation succession [30,31], and providing habitat for early successional plants or disturbance-dependent species like bank swallows and yellow-billed cuckoo [32,33].These examples call into question the underlying premise of "restoration" projects that seek to establish persisting forms rather than restoring the dynamic processes that would, in turn create such forms naturally.

Biological Metrics
Macroinvertebrate abundance, diversity, and composition (and indices of biological integrity derived from them) are some of the most commonly used measures of restoration success [34][35][36][37][38][39]). Of the 26 studies we reviewed, taxonomic richness of macroinvertebrates was the most common biological metric used to test the effects of habitat heterogeneity enhancements (21 studies, 81%).Abundance and density of macroinvertebrates were used in 17 studies (65%).Other diversity measures such as Shannon index or evenness indexes were used by 13 studies (50%).Composition of macroinvertebrate community or assemblage was used in nine studies (35%).Functional measures, such as functional feeding groups or trait composition were also used in nine studies (35%).Biological indices (such as the Benthic Index of Biotic Integrity B-IBI) were used in six studies (23%).Although all studies used taxonomic diversity, richness, or abundance of invertebrates as biologic indices, none presented explicit arguments why those metrics were relevant measures of success, but presumably these metrics were used as indicators of overall ecosystem health.
Of the 26 studies, five used the abundance of EPT (Ephemeroptera, Plecoptera, Trichoptera) taxa or the Benthic Index of Biotic Integrity (B-IBI) [51] as measures of success.The B-IBI was developed as a measure of water quality, based on the sensitivity of EPT taxa to water pollution.However, reach-scale habitat heterogeneity projects would not be expected to improve water quality per se unless, (1) excessive, locally-derived sediment was a limiting factor for sensitive species.Alternately, (2) habitat heterogeneity projects could increase the abundance of pollution-sensitive species by expanding or enhancing their desired habitats.In cases where log or rock structures were designed to create pools, macroinvertebrate communities could be expected to shift towards more pool-dwelling taxa.In cases where riffles were constructed, communities may shift towards more riffle-dwelling taxa.Without clear hypotheses developed before monitoring, monitoring pollution-sensitive taxa would yield essentially uninterpretable results.An increase in pollution-sensitive taxa could indicate either an improvement in water quality (1) or a change in morphological units that benefit the sensitive taxa (2).Since the abundance of sensitive taxa are being used as an indicator of water quality and are not typically themselves the restoration goal, knowing how and why changes occurred may be more important than simply documenting the change.

Controlling for Time, Space, and Variability
Of the 26 studies we reviewed, 16 (62%) included one year of post-project monitoring (Table 2).Sixteen studies (62%) had no pre-project monitoring and one study (4%) included more than one year of pre-project monitoring.Although the projects were designed specifically to create habitat heterogeneity (e.g., pools and riffles, velocity and depth heterogeneity, etc.), eight studies (31%) either sampled only riffles or did not specify what habitats were sampled.Ten studies (38%) sampled and analyzed biota from different habitats separately, and eight (31%) used either pooled or random designs that integrated samples from all habitat types.Nine studies (35%) used reference ecosystems as the standard of success.The other studies compared the restored reach to pre-restoration conditions or to a control site where legacy impacts persisted.Prolonged monitoring is necessary to discern the influence of restoration activities on biota [2,3,52].For example, the disturbance caused by the restoration activity itself (i.e., vegetation clearing, dewatering the channel, compaction from heavy machinery in the channel, and grading the bed and banks) may decrease or increase biotic metrics such as abundance and diversity for days or decades, depending on severity of impact and rates of recovery.Many studies suggest the need to monitor at least several years in order to allow benthos to recover and to recolonize a restored reach [14].The disturbance caused by restoration actions may increase diversity over intermediate time scales [53], so both pre-project monitoring and reference targets would be needed to interpret whether biotic differences following restoration are due to restored conditions, or to temporary effects.
Several decades of research have established how watershed conditions and position in the river system influence channel processes and forms [54] and ecosystem characteristics [42,43].However, local conditions such as channel geometry can vary greatly over small distances or short time periods [55] because of changes in lithology and vegetation, tributary confluences, beaver activity, climate, or land use [45,56,57].Therefore, it is essential to quantify the range of natural variability by compiling sufficient information from historical reference periods, multiple regional reference sites (using space for time substitution), or patterns, gradients, and processes that define reference conditions.Without understanding the natural variability of a system, meaningful restoration targets will be challenging to identify and monitoring data will be hard (or impossible) to interpret.

Relationship between Study Design and Biological Improvement
Seventy-eight percent of the studies that monitored for more than one year found increased diversity or richness.By contrast, 44% of the studies that sampled for only one year found increases.Similarly, studies that evaluated habitat directly were more likely to find increases (66%) than those using visual estimates or not presenting any habitat measures (50% in both cases) possibly reflecting a relationship between study effort and the significance of results, or perhaps indicating that more thorough evaluations followed better-planned restoration efforts.Again, studies using multihabitat sampling for macroinvertebrates either by pooled samples or differentiated habitats were also most likely to find improvement (75% and 60% respectively).By contrast, 20% of studies that sampled only one habitat (riffles) found increased diversity/richness.

Are Reach-Scale Diversity and Abundance Universal Indicators of Success?
At the regional or global scale, changes in biodiversity may reflect overall ecological condition.However, in the context of reach-scale restoration, the "more is better" diversity assumption may not be appropriate.The use of macroinvertebrate diversity as an indicator is based on research linking species diversity to habitat heterogeneity, e.g., [46][47][48].Increased habitat heterogeneity is assumed to provide more ecological niches for members of a community [58], provide refugia that stabilize predator-prey and host-pathogen dynamics, and generally support greater diversity and a more resilient ecosystem [12,50].However, the universal applicability of diversity as a meaningful indicator has not been established.Consider a few cases as thought experiments: an intermittent desert stream, a glacial outwash stream, and a headwater stream in old-growth forest.Human impacts could result in increased macroinvertebrate diversity (likely at the expense of native or endemic species) in all these cases.
In the intermittent desert stream example, perennialization resulting from urban development or irrigation return flow may allow new species to colonize desert washes.Richness and diversity have been strongly correlated with stream permanence in desert streams, though Feminella [59] found 7% of species were found only in intermittent streams, and presumably would be lost through perennialization.In the glacial outwash stream example, a warming climate may allow new species to establish, increasing diversity but threatening endemic specialists [60].In the headwater stream example, logging of the old-growth forest could allow light penetration to the stream and increase diversity of habitats and species [61].In all of these examples, diversity may be increased through additional generalist species, but communities of endemic specialists may be negatively affected.While all three thought experiments might be considered extreme cases, we know of no threshold for distinguishing "regular" streams (where more diversity might be an appropriate target) from these "extreme" cases, and thus find no basis for applying diversity as a universal indicator of success.Specifically addressing this issue in the context of restoration, Lepori et al. [28] found less macroinvertebrate species richness in Swedish reference streams than in either channelized or restored streams.
Food web interactions may be far more significant than diversity or abundance for influencing populations of the top predators for whose benefit restoration projects are undertaken.For example, in the Eel River, California, Power et al. [62] found that scouring winter floods promoted trophic interactions that produced more prey for steelhead, whereas drought years (or flow regulation) favored grazing macroinvertebrates that steelhead cannot eat.In this case, we can see that it is not necessarily the abundance or diversity of species that is important to top predators, but rather which species and energy pathways become dominant.In other cases, the removal of a key species can affect the whole ecosystem, like the Amazonian fish Prochilodus mariae whose removal from the Rio Las Marias in Venezuela altered nutrient cycling, sediment structure, and diatom and macroinvertebrate assemblages [51,63].These examples suggest that restoration monitoring (and restoration objective themselves) require knowledge of the biophysical interactions that underpin the structure and dynamics of the target ecosystem.
Although restoration projects that aim restore habitat heterogeneity are supported by literature indicating that richness and diversity of biota increase with increasing diversity of habitats, changing the habitat away from its natural form and function is not typically an acknowledged goal.Leps et al. [63] conducted an exemplary and extensive assessment of 44 habitat heterogeneity projects in Germany.They quantitatively assessed 10 metrics of physical habitat and 33 biological metrics including measures of abundance, diversity, EPT taxa, and functional feeding groups.Leps et al. found that the restoration projects increased habitat heterogeneity, but did not produce detectable biological responses.What does that tell us?The study by Leps et al. included no reference sites, and ultimately had no method for defining target conditions at the 44 sites from across Germany.Were all 44 sites degraded with respect to abundance, diversity, EPT taxa, and certain functional feeding groups?Considering that the 44 sites were from different regions and watersheds of different sizes, then we would certainly expect target conditions to vary in significant ways.Indeed, an abundance of foundational ecological literature suggests that macroinvertebrate communities vary across the landscape and along river profiles [42,64].We do believe that macroinvertebrates are useful indicators in many cases, but believe that the use of such indicators would be greatly improved by a priori establishment of target conditions for each site.The study by Leps et al. represents a tremendous effort in data collection and was conducted with robust methods and best practices.Ultimately, however, its results do not permit drawing conclusions about the success of habitat heterogeneity projects, because of what is probably considerable (but unmeasured) natural variability in conditions among sites.What we learn from Leps and the papers of Miller and Palmer is that restoration evaluation may not be possible without a conceptual model that describes ecosystem functions and helps define target conditions.

Reference States and Best Practices
Appropriate regional reference sites may be challenging or impossible to find [53,54], especially for larger watersheds.Watershed position, regional climate, disturbance history, tectonic uplift rates, local geology, and many other factors will all influence stream processes and forms.However, physical controls such as channel slope, sinuosity, bed-material size, and precipitation are increasingly identifiable through automated monitoring and remote sensing [65].That makes identification of appropriate reference sites more practical using GIS, and we expect analytical models will produce increasingly reliable predictions of watershed condition based these physical controls.Watershed comparisons have traditionally used a "paired basin" approach, in which "treated" are compared to "untreated" reference basins.Paired watershed experiments suffer from at least three problems (as reviewed by Reid et al. [66]) similar to those outlined above for restoration evaluations.First, "treatment" variables are usually only qualitatively characterized (e.g., "Managed vs. unmanaged" or "logged vs. unlogged').Second, "control" treatments are never pristine.Third, even if untreated and treated watersheds have been matched with respect to aspect, area, slope, forest type, drainage density, and geological parent material, they may differ in subtle but important respects (e.g., structural orientation of bedrock, undetected ancient landslides whose scars are presently filled, and disease or fire history of vegetation).Comparisons of watershed outputs, such as total sediment yield at their mouths or changes in salmon escapement back to watersheds over the experimental period, are too noisy to reveal causality, particularly when observation records are short (less than decades).Several long-term and intensive monitoring efforts including the US Forest Service Experimental Watersheds, and the Intensively Monitored Watersheds Program, demonstrate the effort required to meaningfully quantify ecosystem change.The Caspar Creek Experimental Watershed has investigated logging practices and sediment yields for more than 50 years, finding that the sediment load doubled in response to logging, returned to initial levels 11 years after harvest, and then increased again a decade later as road crossings deteriorated during large storms [67].Without a long-term record, assessing restoration and management actions would be confusing or misleading.Other long-term records and intensively studied watersheds provide important context on temporal and spatial variability.
Human alterations are widespread, and landscapes may still be responding to impacts from decades or centuries ago.Several reference streams are almost certainly required to adequately account for variability between streams and to test assumptions about variability in target conditions.Using an inappropriate reference stream (with different slope, drainage area, hydrologic regime, watershed position, etc.) can be misleading, and having only one reference stream is unlikely to help understand the natural range of processes and communities.Despite the challenges to use of reference sites, they provide a better basis for restoration targets than generic standards, which may not be locally appropriate.Of the 26 studies we reviewed, nine used reference streams, and of those, four presented information about the reference watershed.Of those four, two accepted considerable differences between fundamental watershed characteristics for reference and restored streams: in one study, the drainage area of the reference site was five times larger than the restored stream.

Conclusions
In the two decades since published calls for evaluation of stream restoration projects, most restoration projects remain un-evaluated.However, enough projects have now been evaluated that we can learn something from this experience.Metrics such as area or length of channel restored will probably not go away, but they are clearly inadequate to evaluate the success of restoration.Moreover, as reflected in recent experience, such metrics can greatly distort the types of projects undertaken for mitigation, as entrepreneurs seek to maximize mitigation credits.Many projects intended to enhance ecological complexity by increasing heterogeneity of habitats have been evaluated using abundance and diversity of macroinvertebrate taxa.Interestingly, we found that the studies that conducted longer duration or more rigorous evaluations (e.g., direct habitat measurements, multi-habitat sampling for macroinvertebrates) were more likely to detect statistically significant increases in richness or diversity.This could be simply because the effects of the restoration required better monitoring to detect, but could also reflect a co-varying relationship between the resources available and the quality of both the restoration effort and the evaluation.Better-funded projects may be both better implemented and better evaluated, and thus more likely to produce a real effect, and more likely to detect the effect.
By contrast, some evaluations may be accomplished quite easily where the conceptual model of ecosystem dynamics is clearly established.Rood and colleagues [68] convincingly argued that dams and diversions in the Truckee River basin led to the collapse of the riparian ecosystem by altering the flow regime upon which cottonwood (Populus fremontii) and willow (Salix exigua) recruitment depended.Flows intended to restore the population of an endangered fish had the collateral benefit of recruiting riparian vegetation.In a straightforward evaluation, Rood et al. correlated seedling establishment and survival of seedlings with the changed flow regime.Finally, wetland-and riparian-dependent bird species were surveyed and compared to historical observations, demonstrating that several locally rare or extirpated species had returned with increased abundance [68].The effort required for this evaluation was not extraordinary.What makes this an exemplary model is the clarity with which the conceptual model linking physical processes to habitat and species use was presented and evaluated.
We recommend a portfolio approach-combining knowledge from regional and historical reference streams, unrestored (control) sites, analytical models, and manipulative experiments-to define relevant target conditions.Prior to restoration, project designers and evaluators should develop conceptual models (which increasingly should include analytical reference states) of their ecosystems and consider success criteria carefully, in light of predictions generated from these models.Such an approach will add to the time and effort required for evaluation, but it is probably wiser to do a good job evaluating fewer projects than a poor job attempting many superficial evaluations.

Harrison et al. 2004 [20]
Shannon diversity (H ), total abundance and the abundances of individual taxa.