Evaluating Inter-Rater Reliability and Statistical Power of Vegetation Measures Assessing Deer Impact

: Long-term vegetation monitoring projects are often used to evaluate how plant communities change through time in response to some external inﬂuence. Here, we evaluate the efﬁcacy of vegetation monitoring to consistently detect changes in white-tailed deer browsing effects. Speciﬁcally, we compared inter-rater reliability (Cohen’s κ and Lin’s concordance correlation coefﬁcient) between two identically trained ﬁeld crews for several plant metrics used by Pennsylvania state agencies to monitor deer browsing impact. Additionally, we conducted a power analysis to determine the effect of sampling scale (1/2500th or 1/750th ha plots) on the ability to detect changes in tree seedling stem counts over time. Inter-rater reliability across sampling crews was substantial for most metrics based on direct measurements, while the observational based Deer Impact Index (DII) had only moderate inter-rater reliability. The smaller, 1/2500th ha sampling scale resulted in higher statistical power to detect changes in tree seedling stem counts due to reduced observer error. Overall, this study indicates that extensive training on plant identiﬁcation, project protocols, and consistent data collection methods can result in reliable vegetation metrics useful for tracking understory responses to white-tailed deer browsing. Smaller sampling scales and objective plant measures (i.e., seedling counts, species richness) improve inter-rater reliability over subjective measures of deer impact (i.e., DII). However, considering objective plant measures when making a subjective assessment regarding deer browsing effects may also improve DII inter-rater reliability.


Introduction
Long-term ecological monitoring is crucial for tracking how forest ecosystems change through time, especially in the context of human influence [1].The issue with long-term monitoring, however, is that there is inherent year-to-year variability associated with unmeasured conditions (i.e., noise [2]), and sampling variability (i.e., sampling error) contributes to imperfect collection methods [3,4].Long-term ecological data requires sufficient statistical power to detect change in measured conditions following some external influence, whether it be climate [5], invasion [6], disturbance [7], pest/disease [8], or Forests 2018, 9, 669 2 of 17 some other environmental variable [9].Statistical power is defined as the inverse of the probability of a type II error, or, more simply, one minus the probability of failing to reject the null hypothesis when the null hypothesis is false (a "false negative").The effectiveness of long-term monitoring programs is tied directly to specific, achievable management objectives partnered with an appropriate sampling design to address those objectives [10].
Forest inventories are commonly used by land managers to document baseline forest conditions, and, when resurveyed, evaluate how forest vegetation changes through time [11][12][13][14][15]. Nationally, the United States Department of Agriculture (USDA) Forest Service inventories permanent plots across each state annually, visiting 20% of all plots per state per year [16].Prior to 1998, data collection protocols for the Forest Service's Forest Inventory and Analysis (FIA) program were not nationally consistent, resulting in variable trends by state across several forest measures [17].Many studies have shown that inconsistent data collection, observer bias, and varying sampling methods have an effect on data analysis and interpretation [3,10,17,18].To combat these issues and improve statistical power, researchers recommend extensive training on data collection protocols, specific objectives, consistent observers, and/or quality control data assessments [18].
Over time, the Forest Service has enhanced the FIA program and expanded data collection to include soils, understory vegetation, white-tailed deer browsing impact, and other limiting factors on a subset of plots per state [19].The objective of the FIA program is to monitor changes in abundance and species composition of all US forests, and correlate those changes with other measures of ecosystem health [20].The Pennsylvania Department of Conservation and Natural Resources, Bureau of Forestry adopted similar enhanced forest inventory protocols (Continuous Forest Inventory, or CFI) in 1997 to continuously monitor 890,000 ha of public state forest land [21].Both agencies have similar missions, and work to promote the long-term viability and productivity of forested lands for future generations [22,23].In addition to sustainable forest management, the Bureau of Forestry is responsible for the protection of rare and endangered plants, and its inventory protocols more intensively sample understory vegetation at all CFI plot locations compared to FIA methods [21].The FIA protocol, however, provides more data on overstory conditions, stocking rates, and available timber [24].
In 2014, the Bureau of Forestry increased the number of vegetation monitoring plots and implemented a new monitoring program (the Vegetation Impact Protocol, or VIP) specifically designed to assess the effect of deer on understory plant communities [25,26].White-tailed deer browsing has been identified as an important driver of plant community composition throughout the white-tailed deer range [13,[27][28][29], and selective browsing by white-tailed deer has the ability to reduce understory diversity and structure in areas where deer populations are near carrying capacity [30][31][32].The VIP is modeled after CFI protocols, but VIP primarily monitors understory phyto-indicators of deer browsing impact.Ultimately, the Bureau of Forestry uses VIP data (tree seedling stem counts and the abundance/reproductive status of browse-sensitive herbaceous plants) to decide whether to enroll an area of state forest land into the deer management assistance program (DMAP).Following a DMAP permit application process and the appropriate justification by the Bureau of Forestry, the Pennsylvania Game Commission (PGC) issues additional antlerless tags for purchase by the hunting community in these select areas on public, state forest land.The Bureau of Forestry continuously monitors DMAP areas to determine when enrollment status should change due to a change in understory vegetation.
The PGC also evaluates deer effects on forests to meet agency objectives, which includes maintaining deer impact that supports sustainable forests, and implementing measures of deer impact in forests to improve effectiveness of deer management programs [33].The PGC uses FIA habitat data, including the FIA deer impact index (DII), from all private and public forested lands in Pennsylvania to make deer management decisions and set tag allocations for the upcoming hunting season for each Wildlife Management Unit (WMU) across the state [33].The integration of habitat data and deer impact assessments into both federal and state agency decision models highlights the need for consistent evaluations of habitat characteristics across different sampling scales and vegetation measures.We developed a forest inventory approach that merges FIA and VIP data protocols to intensively monitor both overstory and understory vegetation data and evaluate both FIA and VIP sampling designs.Because the Bureau of Forestry collects understory inventory data on 5 1/2500th ha subplots per sampling site, and the FIA protocol collects understory data on 4 larger, 1/750th ha subplots per site, we chose to collect data at both scales at 5 subplots for each sampling location.Our goal was to have data that were comparable to both FIA and VIP data, specifically focusing on the effect of scale on observer error and data consistency.Larger search areas may result in increased sampling errors as the amount of vegetation encountered increases [34].We also assessed deer impact using the FIA DII, a subjective measure based on observer assessments of vegetation conditions and browsing levels at a site.In general, subjective measures like the DII may be less reliable than vegetation metrics based on objectively collected data.
We compared inter-rater reliability of both categorical and continuous vegetation metrics between two identically-trained field crews separately for each scale of data collection, and expected reliability to be inversely correlated to rates of sampling (observer) error.We hypothesized that if reliability is inversely related to error, there would be greater inter-rater reliability at smaller sampling scales than larger ones (H1).For categorical variables, we specifically compared the DII, the total number of subplots at a sampling site with tree regeneration present, the total number of subplots at a sampling site with the presence of lilaceous forest herbs preferred by white-tailed deer (phyto-indicators of deer browsing effects, or indicators), and indicator species' richness between crews with Cohen's Kappa (κ).For continuous variables, we compared the total counts of all tree seedling taxa <2.54 cm diameter at breast height (DBH), total counts of all tree seedling taxa >0.3 m tall and <2.54 cm DBH, total counts of all tree seedlings paired by taxon, and total counts of all tree seedlings >0.3 m tall and <2.54 cm DBH paired by taxon at each sampling location.
To further assess the effects of sampling error on detecting changes in vegetation characteristics, we conducted a power analysis across a range of sample sizes using tree seedling data simulated from real-world parameters for each scale.Forested ecosystems are ecologically complex [2,35], and studies in forest systems often require large sample sizes (n > 30) or very specific sampling methods to detect measurable changes in forest conditions [36].Increased error rates from inaccurate counts of tree seedlings reduces the statistical power to detect changes in deer browsing pressure when these counts are used to assess deer browsing effects on tree regeneration [31,[37][38][39][40].To evaluate the effect of count errors on statistical power to detect changes in tree seedling abundance, we simulated both initial starting tree seedling counts (time 1 data) and a mean stem change to those counts (time 2 data) using means and variances derived from actual field-collected and CFI data.We added sampling error to each data set based on tree seedling counts for both crews at overlapping sampling locations using the residuals of a generalized linear model.Then, we compared whether we could detect a difference in mean stem counts between both data sets (time 1 vs. time 2) for a given sample size.Power analysis specifics are described in detail below in the Materials and Methods section and illustrated in Figure 1.We hypothesized that if sampling variance was related to sampling scale, the smaller scale of data collection would have increased statistical power due to reduced variability in stem counts between crews (H2).

Materials and Methods
Three separate field crews collected vegetation data from 26 May to 10 August 2015 as part of a larger study assessing the effects of white-tailed deer on Pennsylvania forest plant communities.We trained all crews together from 11 May to 22 May 2015 on plant identification, data collection protocols, and assessment of white-tailed deer browsing impact on plot understory vegetation.

Materials and Methods
Three separate field crews collected vegetation data from 26 May to 10 August 2015 as part of a larger study assessing the effects of white-tailed deer on Pennsylvania forest plant communities.We trained all crews together from 11 to 22 May 2015 on plant identification, data collection protocols, and assessment of white-tailed deer browsing impact on plot understory vegetation.Following training, crews independently collected data at different plot locations across three state forests: The Susquehannock State forest in the Appalachian Plateau Province of north-central Pennsylvania, and the Rothrock and Bald Eagle State forests located in the Ridge and Valley Physiographic Province of central Pennsylvania.In total, each crew visited either 68, 24, or 59 plot locations over the summer sampling season, depending on their crew assignment.Two of those crews in the Rothrock and Bald Eagle State Forests collected data at 13 overlapping sampling sites independently so that we could assess inter-rater reliability.Sampling locations for all 3 crews are shown in Figure 2.   Nested within each plot location was a network of 5 subplots oriented such that a single subplot's center was not more than 36.5 m straight-line distance from the next nearest subplot's center (Figure 3).Further nested within each subplot was a smaller survey area, or sampling zone, where crews conducted full vegetation inventories at two scales (1/2500th ha and 1/750th ha).We defined subplots as the 7.32 m radial area around a center stake in the middle of the subplot, and sampling zones as the 2.07 m radial area around a stake located 3.66 m due east (90 • azimuth) of subplot center.Within a sampling zone, a smaller 1/2500th ha microplot is nested within a larger 1/750th ha microplot.The larger, 1/750th ha microplot replicates Forest Inventory and Analysis (FIA) sampling protocols [24], and the smaller, 1/2500th ha microplot replicates the Pennsylvania Department of Conservation and Natural Resources Bureau of Forestry's sampling protocol to assess deer impact [26].Crews collected vegetation data independently for each scale at all 5 subplots at each sampling location.
subplot's center was not more than 36.5 m straight-line distance from the next nearest subplot's center (Figure 3).Further nested within each subplot was a smaller survey area, or sampling zone, where crews conducted full vegetation inventories at two scales (1/2500th ha and 1/750th ha).We defined subplots as the 7.32 m radial area around a center stake in the middle of the subplot, and sampling zones as the 2.07 m radial area around a stake located 3.66 m due east (90° azimuth) of subplot center.Within a sampling zone, a smaller 1/2500th ha microplot is nested within a larger 1/750th ha microplot.The larger, 1/750th ha microplot replicates Forest Inventory and Analysis (FIA) sampling protocols [24], and the smaller, 1/2500th ha microplot replicates the Pennsylvania Department of Conservation and Natural Resources Bureau of Forestry's sampling protocol to assess deer impact [26].Crews collected vegetation data independently for each scale at all 5 subplots at each sampling location.

Study Area Characteristics
The study area was primarily forested with even-aged (75-100-year-old) oak-hickory hardwood stands, and located within the Ridge and Valley Physiographic Province of Pennsylvania.On average, the growing season for the region was 182 days from 22nd April to 21st October [41].The Sampling zone circles represent a 1/750th ha microplot (small dashed circle; 1/300th acre; 13.5 m 2 ), and nested within each sampling zone is a second 1/2500th ha microplot (1/1000th acre; 4.05 m 2 ; not illustrated).The larger, 1/750th ha microplot replicates Forest Inventory and Analysis (FIA) sampling protocols, whereas the smaller, 1/2500th ha microplot replicates the Pennsylvania Department of Conservation and Natural Resources Bureau of Forestry's sampling protocols.

Study Area Characteristics
The study area was primarily forested with even-aged (75-100-year-old) oak-hickory hardwood stands, and located within the Ridge and Valley Physiographic Province of Pennsylvania.On average, the growing season for the region was 182 days from 22 April to 21 October [41].The climate was temperate with 104 cm of mean annual precipitation, and mean summer temperatures range from 16 • C at night to 29 • C during the day [42].Across sampling locations, the overstory was comprised mainly of oak, including red (Quercus rubra L.) and chestnut oak (Quercus montana Willd.), and the understory was dominated by ericaceous shrubs, including mountain laurel (Kalmia latifolia L.), huckleberry (Gaylussacia spp.), and blueberry (Vaccinium spp.).Elevation of all plots ranged from 400 to 700 m above sea level.within the same 10-day window (mean number of days between crew visits was 3 days, with a range of 0 to 10 days).Crews identified all understory taxa at each microplot scale, and counted total numbers of tree seedlings (any live, arborescent, woody species <2.54 cm DBH) by taxon.Crews also counted the number of ramets of phyto-indicators of deer browsing effects (Indian cucumber-root (Medeola virginiana L.), Canada mayflower (Maianthemum canadense Desf.), Trillium (Trillium spp.), false Solomon's seal (Maianthemum racemosum (L.) Link), and true Solomon's seal (Polygonatum spp.)) by taxon [43][44][45].Lastly, crews assessed deer browsing impact at a site based on FIA deer impact assessment criteria (Table 1).High-Browsing evidence common OR seedlings are rare.

5
Very High-Browsing evidence omnipresent OR forest floor bare, severe browse line.

Data Analysis
We assessed the level of inter-rater reliability between the two Rothrock and Bald Eagle survey crews across several plant metrics at both microplot scales to assess consistency in data collection.We also compared plot means of tree seedling counts >0.3 m tall and <2.54 cm diameter between the two crews to determine count sampling error, and conducted a power analysis to determine what sample size would be necessary to detect a change in mean stem counts given observer error.We evaluated the two crew's agreement for both categorical and continuous vegetation survey data that would help assess the effects of deer browsing on tree seedlings and herbaceous plants at both microplot scales.Categorical data assessed included: (1) number of microplots per plot with at least 1 (1/2500th ha scale) or 4 (1/750th ha scale) tree seedling(s) >0.3 m tall and <2.54 cm diameter (see Table A1 for full list of taxa, of which nearly all were preferred browse by white-tailed deer), (2) the number of microplots per plot with at least 1 phyto-indicator of deer browsing, (3) indicator species richness across all microplots, and (4) deer impact assessment at the plot level.Continuous variables compared were: (5) count of all tree seedlings per plot, (6) count of all tree seedlings >0.3 m tall and <2.54 cm diameter per plot, (7) count of all seedlings by taxon per plot, and (8) count of all seedlings >0.3 m tall and <2.54 cm diameter by taxon per plot.
Inter-rater reliability of categorical data was assessed using Cohen's Kappa (2 raters) in the "irr" package (version 0.84, Hamburg, Germany) [46] in R (version 3.5.1,Vienna, Austria) [47].Cohen's Kappa (κ) is a statistical measure of agreement between two independent raters that accounts for potential agreement by chance alone [48,49].Positive values of κ indicate a level agreement beyond what would be expected randomly, but interpreting exact values is subjective.Landis and Koch (1977) developed 5 categories of agreement for Cohen's κ that have been consistently reported in the medical literature: κ < 0.20 is poor, κ = 0.20-0.40 is fair, κ = 0.41-0.60 is moderate, κ = 0.61-0.80 is good, and κ > 0.80 is very good.We use these categories in our interpretation of Cohen's κ.
For continuous variables, we assessed rater agreement using Lin's concordance correlation coefficient (CCC) in the "cccrm" package (version 1.2.1, Barcelona, Spain) [50] in R (version 3.5.1,Vienna, Austria).The concordance correlation coefficient is a measure of the level of deviation from an expected 1 to 1 ratio of measurement agreement between raters [51].Like with Cohen's κ, CCC gets its origin from the medical field and interpretation of CCC values is subjective.McBride (2005) proposed 4 criteria for agreement to ensure consistency in reporting: CCC < 0.90 is poor, CCC = 0.90-0.95 is moderate, CCC = 0.96-0.99 is substantial, and CCC > 0.99 is almost perfect.We use these categories to present results from CCC.Following assessments of inter-rater reliability, we conducted a power analysis to determine the sample size required to detect varying levels of change in mean tree seedling stems given observer error.We simulated a single data set of stem counts comprised of 10,000 "plots."Mean (µ) and variance (σ 2 ) of simulated data was set equal to the µ and σ 2 of stem counts from all plots sampled in 2015 (Figure 2).The data set was generated using a negative binomial distribution for stem counts (y) for each plot (i): where and p = r/(r + µ) The first data set, the "time 1" data set, was created by randomly sampling n "plots" with replacement from the simulated data for each sample size (n).Sample sizes ranged from 20 to 100 in increments of 10, and simulated data were sampled 100 times to create a data matrix with n rows and 100 columns.We simulated the second data set, a the "time 2" data set, by duplicating the "time 1" data set and adding in a mean change in stem counts (m) simulated from a normal distribution with a constant mean (u ∆ ) and mean-dependent variance (σ 2 ).The mean-variance relationship was modeled directly from past CFI data (see Figure A1 for details on the mean-variance relationship): where Lastly, we added observer error to the "time 2" data set using the residuals from a negative binomial regression of crew 1 and crew 2 mean stem counts for each of the 13 plot locations (i) sampled by both crews.Crew 1 stem counts were considered the baseline "no error" values in the regression because both crew members were more experienced in both plant identification and project protocols (both individuals were returning members from the 2014 sampling season).Any differences in counts from crew 1 to crew 2 was attributed to crew 2 observer error using the following equation: where and error values for each simulation were calculated as: where ĉrew1 i is the predicted stem count value for crew1 i given the linear relationship between crew stem counts.For each of the 100 simulations, error values were rounded to the nearest whole number, and added to the "error" dataset to represent expected observer error.Any values less than 0 were truncated at 0. We compared output for each of the 100 simulations for both the "time 1" and "time 2" data sets using a paired t-test to determine the probability of committing a type II error (false negative).Because both datasets were initially simulated using the same values and the "time 2" data set had a known level of mean stem change applied, the failure to detect a difference mean stem counts within the 100 simulations (p > 0.05) was solely attributed to simulated observer error.We repeated this process for 5 simulated changes in mean stem counts per plot ranging from 0.5 to 2 in 0.5 increments for each different sample size (n).We calculated the power (p) to detect change as 1-probability of a type II error (total number of tests with p > 0.05) for each mean stem change and sample size combination.We repeated the entire sampling process, including generating "time 1" and "time 2" data, 100 times and report the mean level of power across all simulations.These power analysis steps are illustrated in Figure 1.

Results
Inter-rater reliability (IRR) among categorical variables was lowest for the DII (κ = 0.54), but agreement was perfect for both the number of microplots with indicator plants (κ = 1.0) and indicator richness (κ = 1.0; Figure 4a) at both sampling scales.There was no difference in IRR across scale for either indicator plant measure, but for the number of microplots with tree regeneration, IRR was higher at the 1/2500th ha scale (κ = 0.89) than the 1/750th ha scale (κ = 0.72).There was no difference in agreement classification for total counts of all tree seedlings across scale (Figure 4b); agreement was substantial for both the 1/2500th ha (CCC = 0.96) and 1/750th ha (CCC = 0.98).There were discrepancies in agreement category across the 3 remaining tree seedling counts across scale (1/2500th = substantial, 1/750th ha = near perfect), but overlapping confidence intervals indicate that IRR differences were not statistically significant.
for each different sample size ( ).We calculated the power ( ) to detect change as 1-probability of a type II error (total number of tests with > 0.05) for each mean stem change and sample size combination.We repeated the entire sampling process, including generating "time 1" and "time 2" data, 100 times and report the mean level of power across all simulations.These power analysis steps are illustrated in Figure 1.

Results
Inter-rater reliability (IRR) among categorical variables was lowest for the DII (κ = 0.54), but agreement was perfect for both the number of microplots with indicator plants (κ = 1.0) and indicator richness (κ = 1.0; Figure 4a) at both sampling scales.There was no difference in IRR across scale for either indicator plant measure, but for the number of microplots with tree regeneration, IRR was higher at the 1/2500th ha scale (κ = 0.89) than the 1/750th ha scale (κ = 0.72).There was no difference in agreement classification for total counts of all tree seedlings across scale (Figure 4b); agreement was substantial for both the 1/2500th ha (CCC = 0.96) and 1/750th ha (CCC = 0.98).There were discrepancies in agreement category across the 3 remaining tree seedling counts across scale (1/2500th = substantial, 1/750th ha = near perfect), but overlapping confidence intervals indicate that IRR differences were not statistically significant.
Power was high (≥0.90)only when the sample size was ≥40 for the largest mean stem change (2.0 stems per plot), and differences in stem counts between the time 1 and time 2 data sets were detected, on average, at a rate of 95% for the 1/2500th ha scale and a rate of 91% for the 1/750th ha scale (Figure 5).For a 1.5 mean stem change, power was comparable and equaled or exceeded 90% when the sample size was ≥50, regardless of scale.For a 1.0 stem change, power reached 90% when sample size was 60 for the 1/2500th ha scale and 80 for the 1/750th ha scale.Overall, power was moderate (<0.84) for the 0.5 mean stem change for both scales of data collection, and differences in stem counts were only detected at a rate of 84% for the 1/2500th ha scale and 76% for the 1/750th ha scale for the largest sample size (100).Generally, there was increased power to detect change in mean stem counts at the 1/2500th ha scale compared to the 1/750th ha scale, regardless of the amount of change modeled.Power was high (≥0.90)only when the sample size was ≥40 for the largest mean stem change (2.0 stems per plot), and differences in stem counts between the time 1 and time 2 data sets were detected, on average, at a rate of 95% for the 1/2500th ha scale and a rate of 91% for the 1/750th ha scale (Figure 5).For a 1.5 mean stem change, power was comparable and equaled or exceeded 90% when the sample size was ≥50, regardless of scale.For a 1.0 stem change, power reached 90% when sample size was 60 for the 1/2500th ha scale and 80 for the 1/750th ha scale.Overall, power was moderate (<0.84) for the 0.5 mean stem change for both scales of data collection, and differences in stem counts were only detected at a rate of 84% for the 1/2500th ha scale and 76% for the 1/750th ha scale for the largest sample size (100).Generally, there was increased power to detect change in mean stem counts at the 1/2500th ha scale compared to the 1/750th ha scale, regardless of the amount of change modeled.

Discussion
Vegetation monitoring programs with pre-survey training contribute to increased data consistency [18].Following training, our field crews were consistent in their evaluation of quantifiable, measurable habitat characteristics including counts of tree seedlings, counts of microplots with tree regeneration or indicator plants, and indicator richness.The same was true with

Discussion
Vegetation monitoring programs with pre-survey training contribute to increased data consistency [18].Following training, our field crews were consistent in their evaluation of quantifiable, measurable habitat characteristics including counts of tree seedlings, counts of microplots with tree regeneration or indicator plants, and indicator richness.The same was true with counts of microplots with tree regeneration at the 1/2500th ha scale, but there was a significant reduction in reliability at the 1/750th ha scale demonstrating higher rates of observer error even with extensive training on data collection protocols.Of all the categorical metrics evaluated, the DII was the least reliable (there was only a moderate level of agreement between field crews in their DII scores), suggesting that data derived from actual field measurements is inherently more dependable than subjective indices derived from general plot observations.The difference in reliability across measured and subjective categories is seen in other disciplines, in which subjective categorical indices have lower measures of inter-rater reliability [52][53][54][55][56].
Counts of tree seedlings, regardless of count type, were reliable between field crews.Agreement was substantial to almost perfect, and there was no significant difference between reliability across scale, supporting the use of tree seedling counts as a consistent way to track changes in tree regeneration over time.Furthermore, crews consistently counted tree seedlings regardless of height or taxon, indicating that both tree seedling identification and height class counts were comparable.Reliability of tree seedling counts by taxon suggests that changes in species composition and/or seedling height classes could also be monitored in addition to overall seedling counts to consistently evaluate shifts in tree regeneration as a part of a long-term monitoring program.Plant communities are complex entities, and taxa within a community rarely respond uniformly to changes in their environment [57,58].The availability of several metrics to evaluate changes in tree seedling regeneration could better capture the mechanisms behind these shifts or provide a more complete picture of tree seedling responses [30,59,60].In sum, there was substantial inter-rater reliability across all vegetation metrics yielding confidence in the consistency of data collection across field crews.Additionally, there was no support for increased reliability of vegetation survey metrics at smaller sampling scales (H1), because there was no difference in inter-rater reliability across scale for 6 of the 7 metrics evaluated.
As expected, power to detect changes in tree seedling stem counts increased with increasing sample size and magnitude of change modeled [61].Across each mean stem change, statistical power was greater at the smaller sampling scale than the larger one, signaling a positive relationship between observer error and scale.Reductions in statistical power across scale are exclusively attributed to observer error because mathematically, only the error term varied across scale in the power analysis.A reduction in statistical power signals that there was more variability in stem counts between crews at the larger scale, supporting our hypothesis (H2).Our study design does not limit survey time at each location, because limitations on survey time have been shown to increase observer error [62].Instead, the increased probability of finding more species, combined with higher cumulative counts of abundant vegetation (including seedlings), likely explains increases in observer error at the 1/750th ha sampling scale.Crews are more likely to miss or double count stems in dense vegetation, especially when seedling abundance is high.Overall, the effectiveness of the smaller 1/2500th ha plot size at detecting changes in tree seedling counts indicates that it is an adequate scale of data collection for our study.Despite studies quantifying observer error as a part of long-term vegetation monitoring or citizen science projects [3,18,63,64], the incorporation of observer error into the analysis of plant data is a relatively new technique in plant ecology [65,66].To our knowledge, despite past studies assessing the effects of plot size on ordination techniques [67], variances in basal area estimates [68], or species constancy and species richness [69,70], this is the first study to evaluate the effect of scale on power to detect changes in understory vegetation.
The power analysis suggests that a relatively large sample size (n ≥ 40) is needed to detect a modest 2.0 stem change in tree seedling counts per plot.This threshold is above the general ecological rule of thumb of 30 sampling sites for ecological studies [36,71], suggesting high inter-plot variability in stem counts across the study area.This is not surprising given the wide geographic area covered (~100 km 2 ), and the variation in habitat types encompassed by our random design [72].However, past CFI data suggests that most changes in mean stem counts between sampling intervals across all state forest lands are less than 2.0 stems per plot (Figure A1), suggesting that even larger sample sizes (n ≥ 60) may be required to detect differences in tree seedling counts over time.Due to the inherent variability in vegetation characteristics across sampling sites, the effect of observer error on detecting changes in understory vegetation is amplified.Reducing observer error and sampling at smaller scales will improve inter-rater reliability and statistical power.

Conclusions
Vegetation monitoring programs designed to intensely monitor understory vegetation data (including seedlings and herbaceous plants) require consistent data collection methods to monitor changes to understory plant diversity and abundance [62].As this study demonstrates, smaller sampling scales can improve statistical power through improved consistency in data collection, but the required sample size to detect change was high (n > 40).The use of 1/2500th ha plots and objective, reliable vegetation metrics (like tree seedling counts and the presence of phyto-indicators) suggests that the Bureau of Forestry's VIP is likely to detect changes in deer impact in areas enrolled in DMAP.However, the PGC's reliance on the FIA DII to make deer management decisions across larger geographic areas may prove problematic, because the DII is a subjective measure prone to only moderate levels of inter-rater reliability.Revising the DII to incorporate more reliable vegetation metrics may improve its inter-rater reliability, but it is more likely that objective vegetation measures (like FIA tree seedling stem counts) will better detect changes in deer impact across PGC wildlife management units.Revising decision models to incorporate direct FIA vegetation measures would help the PGC make better deer management decisions that align with their objectives.
The relatively high levels of inter-rater reliability between field crews across both categorical and continuous (count) data metrics highlight the importance of quantifiable, measurement-based data collection protocols.Alternatively, the reliability of the only subjective, observational metric (the DII) was low, raising concerns about its ability to track changes in deer browsing effects through time.Our large, landscape-level sampling areas required 10 additional sampling locations than recommended for ecological studies due to the high variability in vegetation conditions between sampling locations [71].This heterogeneity among sampling locations indicates that other studies of similar size and vegetation composition may require larger sample sizes to detect vegetation changes through time.Identifying specific project objectives and designing sampling protocols to achieve those objectives is recommended for other studies considering long-term ecological monitoring of vegetation conditions to track responses to changes in deer browsing pressure.

Figure 1 .
Figure 1.Illustration of all power analysis steps including simulations of each sample size (n = 20, 30, 40, 50, 60, 70, 80, 90, 100) and mean stem change (m = 0, 0.5, 1.0, 1.5, 2.0) scenario, beginning with data simulation and proceeding to all other steps in order following diagram arrows.All sample size and mean stem change scenarios were replicated 100 times to calculate a mean level of power for each group.

Figure 1 .
Figure 1.Illustration of all power analysis steps including simulations of each sample size (n = 20, 30, 40, 50, 60, 70, 80, 90, 100) and mean stem change (m = 0, 0.5, 1.0, 1.5, 2.0) scenario, beginning with data simulation and proceeding to all other steps in order following diagram arrows.All sample size and mean stem change scenarios were replicated 100 times to calculate a mean level of power for each group.

Forests 2018, 9 ,
x FOR PEER REVIEW 5 of 17Following training, crews independently collected data at different plot locations across three state forests: The Susquehannock State forest in the Appalachian Plateau Province of north-central Pennsylvania, and the Rothrock and Bald Eagle State forests located in the Ridge and Valley Physiographic Province of central Pennsylvania.In total, each crew visited either 68, 24, or 59 plot locations over the summer sampling season, depending on their crew assignment.Two of those crews in the Rothrock and Bald Eagle State Forests collected data at 13 overlapping sampling sites independently so that we could assess inter-rater reliability.Sampling locations for all 3 crews are shown in Figure2.

Figure 2 .
Figure 2. Map of the study area and Pennsylvania land cover (agriculture, forest, urban, and water).The study areas are highlighted by dark gray polygons (Susequhannock State Forest South, (a) bottom; Susquehannock State Forest North, (a) top; Rothrock State Forest, (b) left; Bald Eagle State Forest, (b) right), and are located on state forest land (light polygons).White circles represent plots sampled by crew 1 only (11), black circles represent plots sampled by crew 2 only (46), and white circles with black dots represent plots sampled independently by both crews (13).Black triangles represent additional plots sampled in the northern study area in 2015 (68).

Figure 2 .
Figure 2. Map of the study area and Pennsylvania land cover (agriculture, forest, urban, and water).The study areas are highlighted by dark gray polygons (Susequhannock State Forest South, (a) bottom; Susquehannock State Forest North, (a) top; Rothrock State Forest, (b) left; Bald Eagle State Forest, (b) right), and are located on state forest land (light polygons).White circles represent plots sampled by crew 1 only (11), black circles represent plots sampled by crew 2 only (46), and white circles with black dots represent plots sampled independently by both crews (13).Black triangles represent additional plots sampled in the northern study area in 2015 (68).

Figure 4 .
Figure 4. Inter-rater reliability (IRR) assessments of 4 categorical vegetation metrics (a) compared with Cohen's Kappa (κ), and 4 continuous vegetation metrics (b) compared with Lin's concordance correlation coefficient (CCC).Gray squares indicate values for the 1/2500th ha scale, black circles indicate values for the 1/750th ha scale, and the gray diamond indicates a plot-level metric.Dashed and gray lines represent 95% confidence intervals for 1/750th and 1/2500th ha point values, respectively.

Figure 4 .
Figure 4. Inter-rater reliability (IRR) assessments of 4 categorical vegetation metrics (a) compared with Cohen's Kappa (κ), and 4 continuous vegetation metrics (b) compared with Lin's concordance correlation coefficient (CCC).Gray squares indicate values for the 1/2500th ha scale, black circles indicate values for the 1/750th ha scale, and the gray diamond indicates a plot-level metric.Dashed and gray lines represent 95% confidence intervals for 1/750th and 1/2500th ha point values, respectively.

Figure 4 .
Figure 4. Inter-rater reliability (IRR) assessments of 4 categorical vegetation metrics (a) compared with Cohen's Kappa (κ), and 4 continuous vegetation metrics (b) compared with Lin's concordance correlation coefficient (CCC).Gray squares indicate values for the 1/2500th ha scale, black circles indicate values for the 1/750th ha scale, and the gray diamond indicates a plot-level metric.Dashed and gray lines represent 95% confidence intervals for 1/750th and 1/2500th ha point values, respectively.

Figure 5 .
Figure 5. Statistical power to detect mean tree seedling stem change across sample size (20 to 100) and scale (1/2500th ha and 1/750th ha) based on simulated tree seedling stem counts with added sampling (observer error).Shapes indicate different levels of mean tree seedling stem change added to initial plot stem counts (circle = 0.5, square = 1.0, diamond = 1.5, and triangle = 2.0), while color indicates scale (gray = 1/2500th ha, and black = 1/750th ha).Dashed black lines and solid gray lines represent the standard deviation of each point estimate for the 1/2500th and 1/750th ha, respectively.

Figure 5 .
Figure 5. Statistical power to detect mean tree seedling stem change across sample size (20 to 100) and scale (1/2500th ha and 1/750th ha) based on simulated tree seedling stem counts with added sampling (observer error).Shapes indicate different levels of mean tree seedling stem change added to initial plot stem counts (circle = 0.5, square = 1.0, diamond = 1.5, and triangle = 2.0), while color indicates scale (gray = 1/2500th ha, and black = 1/750th ha).Dashed black lines and solid gray lines represent the standard deviation of each point estimate for the 1/2500th and 1/750th ha, respectively.

Table 1 .
Forest Inventory and Analysis (FIA) deer impact index assessment criteria.