A Review of Operational Ensemble Forecasting Efforts in the United States Air Force

: United States Air Force (USAF) operations are greatly inﬂuenced and impacted by environmental conditions. Since 2004, USAF has researched, developed, operationalized, and reﬁned numerical weather prediction ensembles to provide improved environmental information for mission success and safety. This article reviews how and why USAF capabilities evolved in the context of USAF requirements and limitations. The convergence of time-lagged convection-allowing ensembles with inline diagnostics, algorithms to estimate the sub-grid scale uncertainty of critical forecasting variables, and the distillation of large quantities of ensemble information into decision-relevant products has led to the acceptance of probabilistic environmental forecast information and widespread reliance on ensembles in USAF operations worldwide.


Introduction
Weather impacts United States Air Force (USAF) operations worldwide, both via threats posed to life and property and via advantages conferred (with superior understanding compared to an adversary) in the sphere of conflict.
The canonical example of the latter is the D-Day invasion [1] in June 1944 during WWII, when in the midst of an overall stormy pattern that led Nazi leaders in Europe to let down their guard, Allied forecasters found a brief window of adequate weather for attack. The Allies both reduced risks to property and safety and improved their likelihood of mission success by exploiting the opportunities gained from superior weather information.
An example of failure occurred during Operation Eagle Claw in April 1980 [2] when mission planners and meteorologists did not anticipate and account for convectivelygenerated dust storms that could not be seen by the satellites of that era. The zero visibility conditions were the primary factor that led to the mission being aborted, indirectly led to eight fatalities when a helicopter later crashed into a transport aircraft, and arguably cost a US President his re-election bid [3].
An unpublished study of 266 major battles from 1479 BC to 2003 AD [4] found that weather was a factor in the outcome of 36% of them, evidence that the D-Day and Operation Eagle Claw scenarios were not isolated in history. Generating superior environmental information, and also appropriately exploiting it for advantages in conflict, is therefore integral to USAF mission success.
Since the days of WWII, Numerical Weather Prediction (NWP) has provided great advancements in weather forecasting skill, but until recently weather hazard probabilities were usually inferred from deterministic NWP output by the forecaster or decision maker based on their own subjective experiences, per the US National Research Council's report "Completing the Forecast" [5].
As stated in that report: "...some operators may refuse to accept weather forecasts worded in probabilistic terms...Unfortunately, this attitude makes the forecaster-not the operatorthe decision-maker." The culture of deterministic thinking for weather in the USAF is rooted deeply both in the traditions of how weather information has been presented historically, and in the preference of some decision makers to push decision responsibility to the weather forecaster. This culture is contrary to the likely benefits of a probabilistic and risk-based approach to weather for reducing costs and increasing mission effectiveness [10][11][12][13]. What is the magnitude of possible benefits that may be unrealized due to these traditions?
As of 2021 the US military overall has $1.7 trillion in weapons system assets [14] and $1 trillion in facilities [15] that have varying degrees of exposure to weather risk. As a thought experiment applying the Lazo et al. [16] estimate of 3.4% weather sensitivity for US economic activity to the combined $2.7 trillion of US military assets, $89 billion could be sensitive to weather.
More tangible examples of weather exposure costs include $5 billion dollars of damage to Tyndall Air Force Base in Florida from 2017's Hurricane Michael [17], up to $2 "...some operators may refuse to accept weather forecasts worded in probabilistic terms...Unfortunately, this attitude makes the forecaster-not the operator-the decision-maker." The culture of deterministic thinking for weather in the USAF is rooted deeply both in the traditions of how weather information has been presented historically, and in the preference of some decision makers to push decision responsibility to the weather forecaster. This culture is contrary to the likely benefits of a probabilistic and risk-based approach to weather for reducing costs and increasing mission effectiveness [10][11][12][13]. What is the magnitude of possible benefits that may be unrealized due to these traditions?
As of 2021 the US military overall has $1.7 trillion in weapons system assets [14] and $1 trillion in facilities [15] that have varying degrees of exposure to weather risk. As a thought experiment applying the Lazo et al. [16] estimate of 3.4% weather sensitivity for US economic activity to the combined $2.7 trillion of US military assets, $89 billion could be sensitive to weather.
More tangible examples of weather exposure costs include $5 billion dollars of damage to Tyndall Air Force Base in Florida from 2017's Hurricane Michael [17], up to $2 billion in damage to 17 F-22 aircraft undergoing maintenance at Tyndall during Michael [18], $46 million in hail damage to aircraft that were not evacuated or sheltered in advance of a hail storm at Laughlin Air Force Base in Texas in 2016 [19], and $10 million in damage to unsheltered and partially sheltered aircraft from a tornado at Offutt Air Force Base in − Ensemble Transform Kalman Filter methods generally performed better than Perturbed Observations − All methods of accounting for model uncertainty improved forecast skill at least marginally − Model physics diversity was critical for increasing/obtaining forecast skill in variables in the planetary boundary layer − Utilizing Stochastic Backscatter provided for the highest degree of skill in variables aloft − Diverse physics, parameter perturbations, and backscatter employed together led to the most skillful ensemble Evaluations from USAF confirmed these findings [39] and added another-that the use of GEFS for regional ensemble initial conditions (ICs) was the primary contributor to forecast underdispersion ( Figure 2) for the short-term sensible weather variables that USAF was most interested in. It was hypothesized that the initial condition spread in the global ensemble was perhaps optimized for appropriate dispersion at longer lead times and for the larger scales as had been observed in the European Center for Mediumrange Weather Forecasting ensemble [45], and that a new methodology for generating initial condition diversity was needed before this capability could become operational. Regardless, as expected, the regional ensemble was found to be able to resolve finer scale features that the global ensemble could not (Figure 3), promising superior forecast utility if the underdispersion problem could be addressed.
USN's Navy Operational Global Atmospheric Prediction System (NOGAPS [41]). Simple algorithms were developed and applied to outputs from each of the ensemble members (all equally weighted) to estimate smaller-scale, high-impact phenomena such as lightning, hail, snowfall, turbulence [42], and cloud ceilings that were relevant to USAF decision thresholds. USAF primarily evaluated reliability/attribute diagrams [43] and Brier Skill Scores [44] in addition to subjective impressions of performance in a variety of case studies.

Results
Hacker et al. [38] contains a description of the myriad of regional ensemble designs that were tested. Key results included:

−
Ensemble Transform Kalman Filter methods generally performed better than Perturbed Observations − All methods of accounting for model uncertainty improved forecast skill at least marginally − Model physics diversity was critical for increasing/obtaining forecast skill in variables in the planetary boundary layer − Utilizing Stochastic Backscatter provided for the highest degree of skill in variables aloft − Diverse physics, parameter perturbations, and backscatter employed together led to the most skillful ensemble Evaluations from USAF confirmed these findings [39] and added another--that the use of GEFS for regional ensemble initial conditions (ICs) was the primary contributor to forecast underdispersion ( Figure 2) for the short-term sensible weather variables that USAF was most interested in. It was hypothesized that the initial condition spread in the global ensemble was perhaps optimized for appropriate dispersion at longer lead times and for the larger scales as had been observed in the European Center for Medium-range Weather Forecasting ensemble [45], and that a new methodology for generating initial condition diversity was needed before this capability could become operational. Regardless, as expected, the regional ensemble was found to be able to resolve finer scale features that the global ensemble could not (Figure 3), promising superior forecast utility if the underdispersion problem could be addressed.  Rank histogram of 48 hour ensemble wind speed forecasts from multi-physics WRFbased ensemble systems using GEFS ICs with no data assimilation (NoDA), model parameter perturbations (MP), and Ensemble Transform Kalman Filter (ETKF) data assimilation. DGFS used GEFS ICs but with the same WRF configuration for all members, serving as a baseline for comparison. The global ensemble was found to perform reliably and skillfully on the larger scales that it could resolve. The technique of combining ensembles from multiple centers served both to increase membership and diversity (Figure 4), and long-range forecasts of high impact events like a blizzard during Christmas 2009 ( Figure 5) were sometimes provided with a level of confidence that would never be possible solely with deterministic models, given their typical error characteristics. This result supported the initial hypothesis of Palmer and Tibaldi [46] that the forecast could indeed be forecast using ensembles. Due to these positive results, plans were made to formally operationalize this blended ensemble into the Global Ensemble Prediction Suite (GEPS). The global ensemble was found to perform reliably and skillfully on the larger scales that it could resolve. The technique of combining ensembles from multiple centers served both to increase membership and diversity (Figure 4), and long-range forecasts of high impact events like a blizzard during Christmas 2009 ( Figure 5) were sometimes provided with a level of confidence that would never be possible solely with deterministic models, given their typical error characteristics. This result supported the initial hypothesis of Palmer and Tibaldi [46] that the forecast could indeed be forecast using ensembles. Due to these positive results, plans were made to formally operationalize this blended ensemble into the Global Ensemble Prediction Suite (GEPS).
User training and outreach was the most successful part of the development effort. A webpage was set up with real-time forecast products, and training modules were developed and socialized. Training was most successful when one person in a forecasting unit was motivated to embrace and incorporate the new ideas into his or her practice, and thus served as an information bridge between the USAF ensemble experts and the forecasters in their unit. Numerous requests for product enhancements were fulfilled, with feedback loops established for ongoing refinements to better meet needs. A skeptic of the project set out to prove it was wasted effort by exploring the (presumed lack of) webpage visits, but instead was turned into a proponent when he or she saw its popularity.  User training and outreach was the most successful part of the development effort. A webpage was set up with real-time forecast products, and training modules were developed and socialized. Training was most successful when one person in a forecasting unit was motivated to embrace and incorporate the new ideas into his or her practice, and   User training and outreach was the most successful part of the development effort. A webpage was set up with real-time forecast products, and training modules were developed and socialized. Training was most successful when one person in a forecasting unit was motivated to embrace and incorporate the new ideas into his or her practice, and While the improved information on forecast confidence from the ensemble was certainly a part of this success, the authors believe that the key was developers utilizing "teal" practices [47] to be responsive to and collaborate with users to create mission-tailored ensemble products. Probability products for sensible weather parameters like lightning or wind gusts exceeding a key threshold were most popular, while statistical products like mean and spread were less so, similar to the findings of Evans et al. [48].
Despite these successes, the ensemble information was generally not making it all the way to the end decision; rather, forecasters were using it to refine their deterministic forecasts and to help them manage their time by using the ensembles to identify areas that most needed their attention. Additionally, probability interpretation issues were revealed. The ensembles generally forecasted probabilities on smaller spatial and temporal scales than the products forecasters were used to issuing and they would therefore misinterpret the ensemble probabilities as being too low. For example, the probability of winds exceeding a certain threshold were output hourly from the ensemble, but warnings for strong winds were typically issued for periods covering a multi-hour event. Any exceedance during the warning period verified it, but the ensemble output covered the probability for just a one-hour output interval, not their warning interval. This unveiled a conundrum-is it better to output at frequent intervals to enable refinement of event timing? Or better to output over longer intervals so that probabilities more closely match observed relative frequencies over typical warning time intervals? This issue was explored more in future efforts. By 2008, even though the capability was a prototype, forecasters were using it to inform their operational decision making (forecaster quotes in Kuchera et al. [39]).

Methods
Following the hypothesis that GEFS ICs were under-dispersive for short term regional runs, there was a significant amount of investigation in 2009-2010 on how to best generate ICs for the mesoscale ensemble. USAF ultimately changed its configuration to use just three ICs-deterministic runs of GFS, GEM, and the United Kingdom's Unified Model (UM [49]) which had just recently been installed and run operationally by USAF, dubbed the Global Air-Land Weather Exploitation Model (GALWEM). This IC generation methodology was identical to another being developed and evaluated at this same time for the German Weather Service (DWD [50]). The choices of WRF physics packages for the ensemble members [51,52] were changed periodically but the philosophy of combining them in unique configurations of WRF to maximize dispersion of sensible variables of interest to USAF was unwavering. For simplicity of maintenance and concern about realism, the regional ensemble did not use physics parameter perturbations or backscatter. Ensemble data assimilation, while showing promise in initial tests, had been computationally unstable during tropical cyclones and for high resolution runs and was subsequently scrapped.
These choices made the USAF regional ensemble akin to a "poor man's ensemble" [53] where rather than systematically designing an ensemble system to account for known uncertainties, the ensemble was now basically a collection of deterministic runs. While perhaps sacrificing some forecasting skill by abandoning advanced ensemble design techniques, ease of sustainment and maintenance in operations demanded a simpler design.
The most impactful event thus far in the USAF ensemble journey was the visit to the NOAA Hazardous Weather Testbed (HWT) in 2008 for their annual Spring Forecasting Experiment. Each spring forecasters and researchers gather to jointly evaluate and discuss novel models and techniques for convective forecasting. Two critical advancements for future USAF use were in their initial evaluation stages during this experiment. The first was the use of convection-allowing (i.e., convective parameterization disabled to allow the model to explicitly simulate convective processes) ensembles at 4-km resolution for thunderstorm forecasting [54][55][56][57]. The second was the use of "inline" diagnostics calculated at each time step (~30 sec) in the model integration that captured key details about the strength and character of the convection as it evolved [58][59][60]. Both of these addressed two critical USAF needs-improved forecasting of convection (e.g., Eagle Claw) and decision-tailored information about that convection (e.g., peak gust strength, lightning/hail potential). With assistance from the HWT group, USAF rapidly imported these capabilities and began generating prototype 4-km ensemble forecasts in the US and in convectively prone areas overseas like southwest Asia.
Two key challenges to operationalize the 4-km ensembles were high computational cost and writing large output files in a timely manner. The first challenge was overcome by reducing the vertical resolution of the member runs substantially, to between 21 and 27 levels depending on the member [52]. Aligo et al. [61] found that convective precipitation forecast skill was similar between 31 and 62 levels, highlighting this as an area of potential cost savings to explore and exploit. Additionally, the WRF executable was compiled to run 20% faster at the expense of bit-for-bit reproducibility. For generating ensemble members, small variations due to computational shortcuts are actually a desirable trait.
The large file challenge was addressed by utilizing the "quilting" option in WRF (which uses a designated number of processors to perform file writes at output times in the background while the model integration proceeds). In addition, the output variable list was reduced to the bare minimum by deriving input-heavy algorithms (e.g., winter precip type, radar reflectivity, cloud cover) as inline diagnostics, extending the techniques learned at HWT 2008 [62]. This eliminated a standard deterministic post-processor used for each member, as simpler derivations (e.g., wind gust, wind chill) not already derived as inline diagnostics, could be calculated in the ensemble post-processor when probabilistic threshold variables were calculated.
While NWP ensembles are designed to account for uncertainties that exist on resolved scales, uncertainties also exist on unresolved (sub-grid) scales, and these uncertainties/scales often produce sensible weather that greatly impacts USAF operations. To more reliably forecast sensible weather variables that were not explicitly resolved, the Weibull [63,64] distribution was selected to enable output of a tailorable distribution of possible values for each member's forecast.
Take for example an algorithm to predict snow accumulation. A deterministic algorithm (e.g., 10 cm of snow for every cm of liquid precipitation output from the model) does not account for the uncertainty inherent in predicting snow accumulation with varying ice crystal habits, melting, compaction, etc. [65]. By subjectively specifying the parameters that describe a Weibull probability distribution function (PDF), this uncertainty can be estimated and accounted for to make probabilistic predictions of exceeding specific snow accumulation thresholds (e.g., a 30% chance of seeing more than 11 cm of snow with 1 cm of liquid, a 10% chance of seeing more than 12 cm, etc.) for a single ensemble member. Most probabilistic variables in the USAF ensemble post-processor were calculated this way, as outlined in Creighton et al. [62] and Creighton [66]. Many variables were assigned Weibull parameters for a Gaussian-like distribution, while others were assigned parameters to modify the distribution shape, all subjectively based on developer experience with the variable. This approach is admittedly not rigorous, but follows from Weibull [64]: "It is believed that in such cases (distribution functions of random variables) the only way of progressing is to choose a simple function, test it and stick to it as long as none better has been found." This was the approach taken to provide a first-order estimate of sub-grid scale uncertainty for key variables. The ultimate result is that each member, rather than contributing a "yes" or "no" vote to the final ensemble probability of exceeding a threshold, instead contributes a probability that is averaged with all other member probabilities to comprise the final ensemble probability. This methodology accounts both for sub-grid scale and (as part of an ensemble) flow-dependent uncertainties and was expected to improve forecast reliability and skill.
Previous work used deterministic threshold values of diagnostic parameters (e.g., CAPE, updraft helicity) to determine if a discrete severe weather event (e.g., tornado, large hail) should be forecast from a single ensemble member [59,67,68]. In reality, the probability of occurrence of a sub-grid scale weather event will not likely jump from 0 to 1 as the environmental conditions pass a diagnostic threshold value. To attempt to address this more realistically, empirically derived algorithms were created not to predict sub-grid scale event occurrence, but instead its magnitude (e.g., maximum tornado wind speed or Atmosphere 2021, 12, 677 9 of 33 number of lightning strikes in a given area over a period of time) using inputs from larger scale diagnostics. Then, by empirically shaping the Weibull PDF for those variables, any threshold probability value could be obtained (e.g., the probability of a tornado exceeding EF1 wind speed, the probability of more than 5 or 10 lightning strikes in a given area over a period of time) for each ensemble member. Details can be found in Creighton et al. [62].
Additionally, forecasting a very rare event like a tornado in a small area (i.e., a 4-km box) at a given time will almost always result in very low probabilities that can be difficult to apply to decision making (as was learned in user feedback during the development era). Thus, probabilities for lightning, hail, supercell, and tornado were artificially upscaled to decision-tailored ranges of occurrence within 10 and 20 nautical miles (NM). Since there was an unknown degree of dependence on the probability of occurrence from grid box to grid box in the larger area, the probability in the larger area for each member was calculated twice. First, all grid points were assumed to be completely independent (i.e., the probability of occurrence is one minus all of the probabilities of "no" occurrence in the radius multiplied together), and then they were assumed to be completely dependent (i.e., the maximum probability in the upscaling radius is the probability for the total area). These two probabilities were then averaged to obtain the final probability of occurrence for that member in the 10 or 20 NM radius. Probability values from each member were then simply averaged to calculate the total ensemble probability in the radius.
Parallel efforts were undertaken during this time to improve dust forecasting due to ongoing hostilities in Iraq and Afghanistan. Source region and saltation/lofting improvements highlighted in Hunt et al. [69] and LeGrand et al. [70] were combined with convection-allowing ensemble runs in an attempt both to highlight areas where convective outflows would generate severe dust storms in Iraq, and where airborne dust would flow in areas of highly variable terrain in Afghanistan.
In this era, frequent requests for new mission-specific probability thresholds highlighted that an on-demand capability was required to provide the potentially limitless combinations of variables, thresholds, and time periods that could be of interest. A prototype capability called iPEP (interactive Point Ensemble Probability, an enhancement of an existing static PEP product) was developed that consisted of a front-end interface that allowed a user to choose their variable, threshold, and time period of interest (solving the output interval dilemma discussed earlier), and a back-end that would extract the relevant variables from all ensemble members and make tailored probabilistic calculations based on the inputs. It also could generate joint probabilities of seeing all of 2 or more thresholds met, or any of 2 or more thresholds met, depending on the request. Details on the methodology including statistical correlation assumptions for multi-variable joint probability calculations can be found in Creighton [66].

Results
The combination of improved diversity in ICs, convection-allowing resolution, and reliable algorithms tailored to mission impacts resulted in a regional ensemble capability that was well-received by users, and showed useful skill across many variables (examples follow). Together with GEPS, this new Mesoscale Ensemble Prediction Suite (MEPS) was operationalized on 28 March 2012 [52], one of several CAM ensemble systems being operationalized around this time [71,72]. Four case studies are selected to highlight how MEPS provided environmental intelligence guidance for key events during this era.

US Tornado Outbreaks-April 2011 and 2012
In February 2011, a relocatable 4-km mesoscale ensemble covering a portion of the US had been designed using ICs from deterministic runs of global models (as outlined earlier) instead of members of GEFS. This ensemble was run in the southeastern US in advance of the deadliest day for tornadoes in 75 years [73]. The CAM ensemble was able to simulate updraft helicity tracks [58], and in conjunction with the newly developed USAF tornado algorithm [62] was able to produce hourly probabilities of seeing a tornado within a 20 NM radius ( Figure 6). Analysis of this event (Table 1) found that non-zero tornado probabilities were forecast for every tornado that occurred in the model domain within 20 nautical miles of its occurrence on 27 April 2011. Hourly probabilities exceeding 20% were observed over a broad area with 42 h lead time (Figure 7), providing significant notice for decision makers to take protective action.

US Tornado Outbreaks-April 2011 and 2012
In February 2011, a relocatable 4-km mesoscale ensemble covering a portion of the US had been designed using ICs from deterministic runs of global models (as outlined earlier) instead of members of GEFS. This ensemble was run in the southeastern US in advance of the deadliest day for tornadoes in 75 years [73]. The CAM ensemble was able to simulate updraft helicity tracks [58], and in conjunction with the newly developed USAF tornado algorithm [62] was able to produce hourly probabilities of seeing a tornado within a 20 NM radius ( Figure 6). Analysis of this event (Table 1) found that non-zero tornado probabilities were forecast for every tornado that occurred in the model domain within 20 nautical miles of its occurrence on 27 April 2011. Hourly probabilities exceeding 20% were observed over a broad area with 42 hours lead time (Figure 7), providing significant notice for decision makers to take protective action. Figure 6. PEP bulletin from the 27 April 0600 UTC run of the 4-km regional ensemble valid for Birmingham, AL. Variables listed on the left hand side, with probabilities (percent) of exceedance for each valid time in rows to the right of the label. Yellow and red coloring designed to draw user attention to higher probabilities. Note the elevated potential for lightning, strong surface winds, large hail, and tornadoes. Yellow and red coloring designed to draw user attention to higher probabilities. Note the elevated potential for lightning, strong surface winds, large hail, and tornadoes.   [74] in the morning when a high-impact tornado event was forecast later in the day. A tornado struck the base [75] late that evening.  [74] in the morning when a high-impact tornado event was forecast later in the day. A tornado struck the base [75] late that evening.  from coarser, non-CAMs and thus did not show any dust potential (not shown). A solely deterministic CAM run in this case may not have captured the convection and outflow winds given the variation in ensemble forecasts. mum observed gust of 27 m/s lofted substantial dust (pink dust enhancement from Ashpole and Washington [76]) leading to zero visibility conditions in and around Balad Air Base, Iraq. An aircraft attempting to land crashed killing Spc Michael Shane Cote Jr, and injuring 12 others. USAF had just set up an experimental 4-km ensemble domain over Iraq in August 2009 with GEFS ICs. Though still underdispersed, about half the members indicated convection and outflow winds similar to what was observed (Figure 9), while the other half failed to generate any convection (not shown). Dust was not explicitly modeled in the CAM members at this time. Operational dust forecasting tools were fed by winds from coarser, non-CAMs and thus did not show any dust potential (not shown). A solely deterministic CAM run in this case may not have captured the convection and outflow winds given the variation in ensemble forecasts. This case was the catalyst to use the WRF-CHEM [77] model with six dust size bins turned on [70] to evaluate CAM performance for convectively generated dust storms. Results ( Figure 10) for dust with a 4-km CAM run were realistic and substantially different from the otherwise equivalent 12-km run with convective parameterization turned on. This advantage stands in contrast to the lack of surface observations in this region, and that convective anvils can block satellite diagnosis of dust. Thus, the ability to model both the initiation of convectively generated dust, and also its evolution long after the convection has ended, has been a valuable modeling capability for USAF since its inception in operations in 2012, and will hopefully ensure another Operation Eagle Claw failure [2] does not happen again. This case was the catalyst to use the WRF-CHEM [77] model with six dust size bins turned on [70] to evaluate CAM performance for convectively generated dust storms. Results ( Figure 10) for dust with a 4-km CAM run were realistic and substantially different from the otherwise equivalent 12-km run with convective parameterization turned on. This advantage stands in contrast to the lack of surface observations in this region, and that convective anvils can block satellite diagnosis of dust. Thus, the ability to model both the initiation of convectively generated dust, and also its evolution long after the convection has ended, has been a valuable modeling capability for USAF since its inception in operations in 2012, and will hopefully ensure another Operation Eagle Claw failure [2] does not happen again. On 29 June 2012 a derecho developed in Indiana and moved at 27 m/s east-southeast to the Atlantic coast causing 4 million power outages and 13 deaths over a 1,000-km path [78]. Another 34 deaths were attributed to heat in areas where power outages were prolonged. Warm season derechos are often weakly forced [79] making them a difficult fore-

Derecho-29 June 2012
On 29 June 2012 a derecho developed in Indiana and moved at 27 m/s east-southeast to the Atlantic coast causing 4 million power outages and 13 deaths over a 1000-km path [78]. Another 34 deaths were attributed to heat in areas where power outages were prolonged. Warm season derechos are often weakly forced [79] making them a difficult forecast challenge. A review by Furgione [78] noted that this event was "not well forecast in advance" because the primary US operational models at that time were not convection allowing. In the USAF, the 4-km MEPS in operations over CONUS had several members (but only a minority) simulating a derecho with 17 h of lead time (Figure 11). Note that the ensemble also correctly indicated a severe wind threat from another system that developed in the wake of the derecho, and had a "false alarm" derecho to the southeast of the main one that occurred. This case highlights the criticality of both having CAMs to enable the simulation of complex upscale convective growth, and also an ensemble to account for the sometimes broad envelope of possible non-linear atmospheric responses to subtle forcing.

Afghanistan Dust in Variable Terrain
While CAMs are intended to simulate convective processes more explicitly, the finer resolution is also valuable for simulating complex terrain flows. On 08 Aug 2011, an area of dust moved southwestward from the northern plains area of Afghanistan into the Amu Darya river basin, disrupting USAF efforts during Operation Enduring Freedom. Weather forecasters supporting those efforts recorded that many ensemble members (now using WRF-CHEM to model dust) accurately simulated this flow of dust, as indicated by the 30-50% probability of visibility less than 3 miles forecast ( Figure 12). Those forecasters noted that no other capability (e.g. Barnum et al [80]) showed any dust in this area, attributed to the 4-km resolution of the underlying terrain ( Figure 12). As in the other cases noted earlier, an ensemble was necessary as not all members simulated the event of interest accurately, but very fine resolution was also necessary to simulate the physical processes in

Afghanistan Dust in Variable Terrain
While CAMs are intended to simulate convective processes more explicitly, the finer resolution is also valuable for simulating complex terrain flows. On 8 August 2011, an area of dust moved southwestward from the northern plains area of Afghanistan into the Amu Darya river basin, disrupting USAF efforts during Operation Enduring Freedom. Weather forecasters supporting those efforts recorded that many ensemble members (now using WRF-CHEM to model dust) accurately simulated this flow of dust, as indicated by the 30-50% probability of visibility less than 3 miles forecast ( Figure 12). Those forecasters noted that no other capability (e.g., Barnum et al [80]) showed any dust in this area, attributed to the 4-km resolution of the underlying terrain ( Figure 12). As in the other cases noted earlier, an ensemble was necessary as not all members simulated the event of interest accurately, but very fine resolution was also necessary to simulate the physical processes in terrain. Forecasters reported that there had been four such operationally meaningful visibility-reducing events that summer that only the 4-km ensemble had been able to anticipate [81].

External Studies
During the period 2012-2015, USAF contributed its 4-km MEPS over the US to the HWT Spring Forecasting Experiment, where it was evaluated alongside other ensembles being developed and evaluated that were all shown to have useful skill in predicting severe weather [82]. Of six CAM ensembles evaluated in 2015, MEPS scored in the middle of the group [83], anywhere from 2nd to 5th depending on the metric. 4-km MEPS was also evaluated in the Hydrometeorological Testbed at the Hydrometeorological Prediction Center in early 2012, where value was noted in resolving localized snowfall and cold air damming details in areas of variable terrain, but a high bias in QPF and snowfall totals was also reported [84].
Adams-Selin [85] examined characteristics of each ensemble member for a strongly forced squall line, noting high sensitivity in the strength and evolution of the squall line for different planetary boundary layer physics schemes, despite the strong forcing. Ryerson [86] found that for an ensemble system designed to closely match the one used

External Studies
During the period 2012-2015, USAF contributed its 4-km MEPS over the US to the HWT Spring Forecasting Experiment, where it was evaluated alongside other ensembles being developed and evaluated that were all shown to have useful skill in predicting severe weather [82]. Of six CAM ensembles evaluated in 2015, MEPS scored in the middle of the group [83], anywhere from 2nd to 5th depending on the metric. 4-km MEPS was also evaluated in the Hydrometeorological Testbed at the Hydrometeorological Prediction Center in early 2012, where value was noted in resolving localized snowfall and cold air damming details in areas of variable terrain, but a high bias in QPF and snowfall totals was also reported [84].
Adams-Selin [85] examined characteristics of each ensemble member for a strongly forced squall line, noting high sensitivity in the strength and evolution of the squall line for different planetary boundary layer physics schemes, despite the strong forcing. Ryerson [86] found that for an ensemble system designed to closely match the one used by USAF, a nighttime warm bias led to an ensemble-wide lack of near-surface cloud water in many observed fog situations, but that post-processing could largely remedy the issue. Guyer and Jirak [87] also looked at US severe weather cases in the cool season for MEPS and another CAM ensemble [88], noting that both were able to provide important details on timing and intensity of high-impact severe weather events. Clements [89] evaluated GEPS and MEPS performance from April to October 2013, noting that winds and precipitation were more skillfully forecast in the higher resolution ensembles (especially in areas of variable terrain) and that the lightning algorithms were over-forecasting lightning probabilities due to a night-time high bias at the locations tested. Homan [90] found that the GEPS ensemble mean reduced the error on forecast inputs to long-haul flight fuel loading calculations by 10% for forecast hours 12-36 as compared to the GFS, enabling less contingency fuel to be loaded, lowering the aircraft weight and leading to fuel/cost savings.

Internal Studies
An IPEP prototype ( Figure 13) was evaluated by a focus group on limited test computing resources. There has been widespread recognition by forecasters (e.g., US Army operations, remotely-piloted aircraft, base asset protection, US Aviation Weather Center) and operational leaders of the benefits of precisely tailoring ensemble information to specific mission decision thresholds. Still, sufficient resources to scale up to a fully operational capability have not yet been invested by USAF. GEPS and MEPS performance from April to October 2013, noting that winds and precipitation were more skillfully forecast in the higher resolution ensembles (especially in areas of variable terrain) and that the lightning algorithms were over-forecasting lightning probabilities due to a night-time high bias at the locations tested. Homan [90] found that the GEPS ensemble mean reduced the error on forecast inputs to long-haul flight fuel loading calculations by 10% for forecast hours 12-36 as compared to the GFS, enabling less contingency fuel to be loaded, lowering the aircraft weight and leading to fuel/cost savings.

Internal Studies
An IPEP prototype ( Figure 13) was evaluated by a focus group on limited test computing resources. There has been widespread recognition by forecasters (e.g., US Army operations, remotely-piloted aircraft, base asset protection, US Aviation Weather Center) and operational leaders of the benefits of precisely tailoring ensemble information to specific mission decision thresholds. Still, sufficient resources to scale up to a fully operational capability have not yet been invested by USAF. An example of using the Weibull distribution to more reliably and skillfully predict PDFs of sub-grid scale phenomena can be seen for the prediction of 10-meter winds and gusts over land [62]. The shift parameter (where the PDF for the variable begins if not at zero) is simply the maximum sustained wind from the model over the output period (i.e., the gust has to be higher than the sustained wind). The shape parameter is 3, which is Gaussian but with some right skewness. The scale parameter is the maximum sustained wind raised to the 0.75 power, which causes a slow decrease in the gust factor as the sustained wind gets larger, approximating the observations found in Davis and Newstein [91]. Brier Skill Scores were higher and forecasts more reliable for wind gusts (25 knot threshold) using the Weibull parameters compared to sustained winds (15 knot threshold) using the uniform ranks method ( Figure 14) for a one month period for the 4-km ensemble in 2011. An example of using the Weibull distribution to more reliably and skillfully predict PDFs of sub-grid scale phenomena can be seen for the prediction of 10-m winds and gusts over land [62]. The shift parameter (where the PDF for the variable begins if not at zero) is simply the maximum sustained wind from the model over the output period (i.e., the gust has to be higher than the sustained wind). The shape parameter is 3, which is Gaussian but with some right skewness. The scale parameter is the maximum sustained wind raised to the 0.75 power, which causes a slow decrease in the gust factor as the sustained wind gets larger, approximating the observations found in Davis and Newstein [91]. Brier Skill Scores were higher and forecasts more reliable for wind gusts (25 knot threshold) using the Weibull parameters compared to sustained winds (15 knot threshold) using the uniform ranks method (Figure 14) for a one month period for the 4-km ensemble in 2011. An example of the probability upscaling of severe weather variables can be seen in Figure 15, where the probability of one or more lightning strikes within 4-km (algorithm patterned after results from McCaul et al. [92]) was upscaled to 10 and 20 nautical miles. This methodology has not been robustly evaluated, but the probability values have seemed reasonable in subjective comparisons to forecaster generated probabilistic forecasts from the US Storm Prediction Center (Figure 15). An example of the probability upscaling of severe weather variables can be seen in Figure 15, where the probability of one or more lightning strikes within 4-km (algorithm patterned after results from McCaul et al. [92]) was upscaled to 10 and 20 nautical miles. This methodology has not been robustly evaluated, but the probability values have seemed reasonable in subjective comparisons to forecaster generated probabilistic forecasts from the US Storm Prediction Center (Figure 15).

User Feedback
During this period, users provided significant feedback on how ensembles were benefiting their decision making [51]. For instance, one unit reported that grounded aircraft during a dust storm were costing half a million dollars per day, but that the mesoscale ensemble suggested a narrow calmer period that allowed them to be successfully evacu-

User Feedback
During this period, users provided significant feedback on how ensembles were benefiting their decision making [51]. For instance, one unit reported that grounded aircraft during a dust storm were costing half a million dollars per day, but that the mesoscale ensemble suggested a narrow calmer period that allowed them to be successfully evacuated. Another unit reported a 16% improvement in warning issuance and a 24% reduction in false alarms after using 4-km MEPS during their convective season as compared to the previous season. Efforts made during the initial development phase to engage and involve users in design and testing led to greater acceptance of the capabilities, and a willingness to share results (both good and bad). The CAMs proved to be useful for convective forecasting, and for USAF forecasters working in areas without radar, the CAM output was often used as a surrogate for unobservable storm characteristics even as the convective event was happening [93].

Methods
During this era, more effort was put into systematically gathering and evaluating user perceptions, with formal surveys given along with explorations of product usage by examining web server data. Surveys were given via a link on the operational products webpage and users self-selected to respond. Questions covered what missions were supported and how those missions wished to receive forecast information (deterministically, probabilistically, or a mixture), and what modeling suites, products, and variables were most useful.
A novel idea was initially proposed in May 2013, and implemented on 20 July 2015, called a "Rolling" ensemble. This methodology leverages time lagging by running just a single member every 2 h, but including runs from the last 30 h to create ensemble products. This is attractive operationally for a number of reasons. Foremost is the ease of scheduling on the supercomputer. Each 2-hourly domain gets a fixed number of processors based on how long it takes to integrate the full model forecast in 2 h. Another benefit is the ability to weather an occasional system outage without a significant impact to the full ensemble (i.e., only losing one or two runs of 16). Trending information is also useful, as the results of more recent runs can be contrasted with the older runs in the ensemble. Finally, the ensemble always contains current information given the release of new runs every two hours, although the global initial and lateral boundary conditions are only updated every six hours (note-the capability initially had 3DVAR data assimilation which made each run truly current but this was stopped due to resource limitations). This methodology is operational today for USAF in addition to a similar methodology for UKMO [94], with two large domains covering most of the world at 20-km resolution run to 132 h, and fixed/relocatable 4-km domains run to 72 h.
To respond to user requests for improved short term convective forecasting to support training, a time-lagged rapid refresh ensemble using the High Resolution Rapid Refresh [95] system for ICs was evaluated over USAF training ranges [96] during the 2017 summer. These ranges are in areas of variable terrain, and increased accuracy in forecasts of timing/location of convection and attendant threats would enable the USAF to perform more training safely. Three WRF runs with varied physics at 3 and 1-km were initialized from the HRRR 1-h, 2-h, and 3-h old forecasts each hour (details in Kuchera et al. [96]) and then combined with equivalent runs from the previous 4 h to create a 12-member time-lagged "Rolling" rapid refresh ensemble forecast to 12 h.
During this period, USAF operations implemented hybrid ETKF-4DVAR data assimilation (DA) [97] into its deterministic global GALWEM runs. The initial condition perturbations used for DA provided ICs for a 40-km resolution, 70-level global ensemble (GALWEM-GE) out to 16 days that was implemented on 4 November 2020. Design of this ensemble was patterned off the NAEFS [98] ensemble (20 members plus a control run twice daily to 384 h with half-degree output) to enable sharing and blending the ensembles together for improved operational products.

Results
Surveys and webpage product analyses were performed in 2016 [99] and 2020. 100 surveys were completed from August to October 2020, with users reporting that 89% of supported missions wanted a combination of "yes/no" and "confidence" information, that the 4-km MEPS was a more critical forecasting tool than the 20-km MEPS or GEPS (Figure 16), and that probability threshold information was more important than the individual ensemble members or mean/spread products ( Figure 16). Additionally, lightning was cited as critical by 74% of survey respondents, with no other variable reaching 50% critical ( Figure 16). For the first 6 months of 2020, 6020 unique users who authenticated into the USAF website used an ensemble product, with 87.8 million overall product downloads. 46.9 million of these downloads were for a 4-km MEPS product, and 95% of users viewed a PEP product, the most popular single product in the suites. 46.9 million of these downloads were for a 4-km MEPS product, and 95% of users viewed a PEP product, the most popular single product in the suites. Ratings distribution for 100 survey responses received from August to October 2020 to the question "Please rate each of the following on how important it is to your efforts" where the "following" is each suite, product, and variable listed.
The initial evaluation of the Rolling ensemble came during a military exercise with South Korea [100]. The squadron providing weather support used both the test Rolling ensemble output and the operational output for the exercise, comparing and contrasting utility. The squadron commander reported that "my biggest take-away from the experiment was the ability to see trends (both geographic coverage and probabilities) over time at a specific valid time" and that he "hoped the lagged ensemble gains traction" due to his perception of its benefits [101]. To facilitate examining these trends, a web interface was developed with a double-slider bar ( Figure 17) where one loop was the forecast hour, and the second loop the member of the ensemble or the ensemble products themselves (which were updated every two hours by incorporating the newest member, and removing the oldest). By looping through changes in the lagged forecasts on a fixed forecast hour, trends could be discerned quickly. Additionally the PEP bulletin was modified to denote where probabilities had either increased or decreased by more than 15% in the previous 12 hours (Figure 18). Ratings distribution for 100 survey responses received from August to October 2020 to the question "Please rate each of the following on how important it is to your efforts" where the "following" is each suite, product, and variable listed.
The initial evaluation of the Rolling ensemble came during a military exercise with South Korea [100]. The squadron providing weather support used both the test Rolling ensemble output and the operational output for the exercise, comparing and contrasting utility. The squadron commander reported that "my biggest take-away from the experiment was the ability to see trends (both geographic coverage and probabilities) over time at a specific valid time" and that he "hoped the lagged ensemble gains traction" due to his perception of its benefits [101]. To facilitate examining these trends, a web interface was developed with a double-slider bar ( Figure 17) where one loop was the forecast hour, and the second loop the member of the ensemble or the ensemble products themselves (which were updated every two hours by incorporating the newest member, and removing the oldest). By looping through changes in the lagged forecasts on a fixed forecast hour, trends could be discerned quickly. Additionally the PEP bulletin was modified to denote where probabilities had either increased or decreased by more than 15% in the previous 12 h (Figure 18).  Evaluation of winds over a test domain in CONUS in spring 2015 comparing the operational to the Rolling methodology indicated similar forecasting skill, with a very small improvement for the Rolling (Figure 19). Interestingly, precipitation skill increased in the earlier forecast hours but then decreased at later hours (Figure 20), perhaps indicating that the Rolling methodology was mitigating spin-up issues [102] with older runs included in the earlier forecast hours, but then reducing skill by including forecasts where error growth was dominating in later hours. Hepper [103] evaluated snowfall forecasts from three different members of the Rolling ensemble both with and without 3DVAR data assimilation, and found that it made little difference in the forecasts, with the model ICs and physics variations making a much more substantial impact. A weather squadron responsible for forecasting in the southeastern US reviewed 29 cases during the 2015 severe weather season and found both the operational and Rolling methodologies were similar and equally skillful [104]. Burns [105] found that for the 20-km MEPS, the new Rolling methodology was generally better than the previous methodology for ceiling, lightning, precipitation, and wind forecasts. All in all, the decision to implement Rolling came more from the pragmatic reasons cited earlier than from the minor forecast improvements seen in evaluations. Subsequent to implementation, Melick et al. [106] studied a tornado outbreak in the southeast US and found MEPS updraft helicity forecasts to be skillful using the fractions skill score (FSS) [107] for neighborhood areas with 40-square-km and larger. Evaluation of winds over a test domain in CONUS in spring 2015 comparing the operational to the Rolling methodology indicated similar forecasting skill, with a very small improvement for the Rolling (Figure 19). Interestingly, precipitation skill increased in the earlier forecast hours but then decreased at later hours ( Figure 20), perhaps indicating that the Rolling methodology was mitigating spin-up issues [102] with older runs included in the earlier forecast hours, but then reducing skill by including forecasts where error growth was dominating in later hours. Hepper [103] evaluated snowfall forecasts from three different members of the Rolling ensemble both with and without 3DVAR data assimilation, and found that it made little difference in the forecasts, with the model ICs and physics variations making a much more substantial impact. A weather squadron responsible for forecasting in the southeastern US reviewed 29 cases during the 2015 severe weather season and found both the operational and Rolling methodologies were similar and equally skillful [104]. Burns [105] found that for the 20-km MEPS, the new Rolling methodology was generally better than the previous methodology for ceiling, lightning, precipitation, and wind forecasts. All in all, the decision to implement Rolling came more from the pragmatic reasons cited earlier than from the minor forecast improvements seen in evaluations. Subsequent to implementation, Melick et al. [106] studied a tornado outbreak in the southeast US and found MEPS updraft helicity forecasts to be skillful using the fractions skill score (FSS) [107] for neighborhood areas with 40-square-km and larger.  For the rapid refresh experiment [96], 22 forecaster surveys were completed. 91% rated the rapid refresh forecasts at least somewhat better than their existing tools (which included the operational 4-km MEPS), and 55% rated them significantly better. When comparing their training results to the previous convective season, one group of forecasters reported a 36% decrease in cancellations, and another group of forecasters reported an average of 2 more flying hours per day. This could be attributed just to changes in the weather between the two years, but the impression of the forecasters was that the rapid refresh information did contribute significantly to the improvement. While the rapid refresh component of these findings is still under investigation and development, 1-km MEPS domains were implemented over the Korean Peninsula, parts of Japan and Alaska, and in the Cape Canaveral area in March 2021 using the Rolling ensemble design to improve forecasting in areas of variable terrain, but with forecasts only to 30 hours instead of 72 ( Figure 21). For the rapid refresh experiment [96], 22 forecaster surveys were completed. 91% rated the rapid refresh forecasts at least somewhat better than their existing tools (which included the operational 4-km MEPS), and 55% rated them significantly better. When comparing their training results to the previous convective season, one group of forecasters reported a 36% decrease in cancellations, and another group of forecasters reported an average of 2 more flying hours per day. This could be attributed just to changes in the weather between the two years, but the impression of the forecasters was that the rapid refresh information did contribute significantly to the improvement. While the rapid refresh component of these findings is still under investigation and development, 1-km MEPS domains were implemented over the Korean Peninsula, parts of Japan and Alaska, and in the Cape Canaveral area in March 2021 using the Rolling ensemble design to improve forecasting in areas of variable terrain, but with forecasts only to 30 h instead of 72 ( Figure 21).
A one-month GALWEM-GE evaluation was performed in November 2017, comparing GEPS alone to GEPS with GALWEM-GE included in its ensemble [99]. Slight improvements were noted for all variables evaluated, with an average 4.4% improvement in the continuous ranked probability scores (CRPS) [108,109]. Another evaluation occurred in the summer of 2020 (Figures 22 and 23) that repeated the same consistency in slight skill improvement for CRPS scores, informing the decision to implement later that year. A one-month GALWEM-GE evaluation was performed in November 2017, comparing GEPS alone to GEPS with GALWEM-GE included in its ensemble [99]. Slight improvements were noted for all variables evaluated, with an average 4.4% improvement in the continuous ranked probability scores (CRPS) [108,109]. Another evaluation occurred in the summer of 2020 (Figures 22 and 23) that repeated the same consistency in slight skill improvement for CRPS scores, informing the decision to implement later that year.   . Continuous Ranked Probability Scores for GEPS (blue) and GEPS with GALWEM-GE included (orange) for 250 hPa wind speed as compared to the corresponding UM analysis for the period 28 July 2020 through 28 August 2020. Error bars represent the 95% confidence interval for the CRPS values. Figure 23. Continuous Ranked Probability Scores for GEPS (blue) and GEPS with GALWEM-GE included (orange) for 700 hPa relative humidity as compared to the corresponding UM analysis for the period 28 July 2020 through 28 August 2020. Error bars represent the 95% confidence interval for the CRPS values.

Conclusions
Nearing the 10-year mark of operational ensemble modeling at USAF, a number of lessons and themes have emerged. The convergence of CAMs with inline diagnostics, algorithms to estimate the sub-grid scale uncertainty of critical forecasting variables, and the distillation of large quantities of ensemble-generated information into decision-relevant products has led to widespread reliance on ensembles (particularly 4-km MEPS) in USAF operations worldwide. Useful ensemble products tailored to the phenomena that drive decisions (e.g., probability of lightning as opposed to mean/spread of an instability diagnostic) that have to be made have helped evolve the USAF culture towards acceptance of probabilistic environmental forecast information. Rationales for critical choices that were made in three major aspects of development will be discussed in the following sections, along with a few comments on future efforts.

Ensemble Design
Du et al. [110] provides a comprehensive review of the diversity of methodologies for ensemble design. Focusing on operational CAM ensembles, two basic and contrasting approaches are currently in use. The first approach utilizes a single model and data assimilation system, with attempts to account for all relevant uncertainties therein. Walters et al. [111] describes some benefits of this approach which is employed for the UKMO, DWD [50], and MetCoOp (Finland/Norway/Sweden [112]) operational CAM ensembles: "By studying the same model formulation across a range of timescales and system applications, one can learn about the rate of growth and nature of both model errors and desirable behaviours. Also, by constraining configurations to perform adequately across a wide variety of systems, scientists can be more confident that model developments seen to improve performance metrics in any one system are doing so by modelling a truer representation of the real atmosphere." This is contrasted with systems using multiple models for ICs and to comprise the ensemble model runs themselves in the operational HREF [60] CAM ensemble and in the MEPS described herein for USAF. Du et. al [110] cites increased spread, bias cancellation, and reduction of systematic errors that are often present in single model systems as reasons to choose this approach. Drawbacks include unequal skill amongst members, bias clustering by model, and the increased cost of maintaining and ensuring compatibility between multiple models. Still, the multi-model approach has been pragmatic and easily adaptable to the fast-changing operational situations USAF is faced with, and as such has been the method of choice. From Hansen [113]: "Even if a given multimodel ensemble is unable to bound truth, if each ensemble member is consistent with its model attractor, the ensemble's distribution can provide information about the sensitivity of regions of state space to the different models making up the ensemble." This imperfect highlighting of sensitivity has proven valuable to USAF forecasters, as systematic biases in single model systems may not identify areas of uncertainty. When forecasters are alerted to areas with uncertainties, they can prioritize their limited time to more deeply analyze them, and ignore areas of greater certainty with confidence. Finally, USAF experience has not borne out the oft proclaimed difficulty in maintaining a multimodel system, which is likely due to the robustness and versatility of the WRF modeling framework and components. Since USAF does not develop its own NWP models, many of the meaningful benefits cited by Walters et al. [111] are not applicable.

Scientific Rigor in an Operational Environment
Another aspect of the forecasting mission in the USAF is the question of scientific rigor when developing new capabilities and techniques to meet ever-changing operational requirements. Ideally, one would want to thoroughly evaluate new methods but these evaluations take time and resources while a mission need is going unmet. Where is the point of diminishing returns for operational benefits with further rigor, and how should that point be determined? This point of view is especially valid when the lack of observational truth data in many areas of the world increases the cost and reduces the benefit of rigor. USAF rigor has varied from significant development and testing over a period of years for explicit dust modeling, to creating and empirically tuning a dust lofting potential index over the course of just an afternoon. A key lesson has been that by using the knowledge and experience of scientists, along with the perspectives of users, it is worth considering if a simple solution is available and likely effective that could potentially meet a need. The employment of rigor is a cost-benefit analysis in an operational environment. The costs of more complexity or more evaluation have to be weighed against the risk of not providing something to benefit a potentially pressing (and possibly transient) operational need.

Post-Processing
One area where other organizations [114] have focused but USAF has not is statistical correction of ensemble output. One reason has been that many of the priority variables USAF forecasts do not have enough corresponding quality observations, either due to the difficulty of sensing the variable or the lack of available sensing in USAF operational areas. Another reason is that bias correction is not addressing the underlying cause of model bias which almost certainly reduces the overall skill of the model simulation. Investing development resources into addressing the root causes of model bias may be a more costeffective way to address forecast quality in the long term. There is also concern about bias correcting the inputs to the algorithms used to tailor information to decision making. Algorithms need raw, physically consistent data as inputs before any bias corrections can be made. For instance, a bias correction to low level temperatures could create convective instability in an area where the physical model produced convection and rain-cooled, stable low-level temperatures ( Figure 24) when those corrected low-level temperatures are used in the calculation of instability diagnostics. To avoid this possibility corrections should be made only on the variable of interest, not on the inputs used to calculate the variable of interest. Finally, there is risk that perceived improvements from bias correction are illusory because of poorly chosen metrics. Many variables, like 500 hPa height, are useful for verification of overall model performance, but may not be relevant to decisionmaking. To evaluate the true usefulness of a bias correction, it is important to design metrics appropriately so that statistical improvements do not "teach to the test" rather than demonstrate real value for improved decision-making. USAF is interested in furthering its post-processing efforts by enabling more decision-specific product tailoring (e.g., iPEP). One use case is the optimization of a flight path by finding the path in space and/or time that has the lowest risk. As seen in Figures 25  and 26, to calculate this accurately each ensemble member is required or the probability estimates could be significantly in error. Bandwidth limitations with large CAM ensemble datasets also pose a challenge, which some have proposed to solve by creating statistical summaries of the full ensemble (see Vannitsem et al. [114]). For USAF use-cases (e.g., flight paths, line of sight calculations, joint probabilities), those methodologies could be insufficient. Instead, computing solutions that enable tailored "what is the risk for my situation?" data extractions based on user inputs from full ensemble member datasets that retain their space/time correlations would appear to be a more optimal approach to the bandwidth issue. USAF is interested in furthering its post-processing efforts by enabling more decisionspecific product tailoring (e.g., iPEP). One use case is the optimization of a flight path by finding the path in space and/or time that has the lowest risk. As seen in Figures 25 and 26, to calculate this accurately each ensemble member is required or the probability estimates could be significantly in error. Bandwidth limitations with large CAM ensemble datasets also pose a challenge, which some have proposed to solve by creating statistical summaries of the full ensemble (see Vannitsem et al. [114]). For USAF use-cases (e.g., flight paths, line of sight calculations, joint probabilities), those methodologies could be insufficient. Instead, computing solutions that enable tailored "what is the risk for my situation?" data extractions based on user inputs from full ensemble member datasets that retain their space/time correlations would appear to be a more optimal approach to the bandwidth issue. datasets also pose a challenge, which some have proposed to solve by creating statistical summaries of the full ensemble (see Vannitsem et al. [114]). For USAF use-cases (e.g., flight paths, line of sight calculations, joint probabilities), those methodologies could be insufficient. Instead, computing solutions that enable tailored "what is the risk for my situation?" data extractions based on user inputs from full ensemble member datasets that retain their space/time correlations would appear to be a more optimal approach to the bandwidth issue.  Each has a 20% chance of being impacted due to uncertainty in convection happening rather than uncertainty in location.

Future Efforts
USAF intends to continue to leverage its Rolling ensemble paradigm in operations, and extend it to rapid refresh (hourly or greater) capabilities. The multi-model approach for ensemble design remains preferred and USAF will be studying and evaluating existing and emerging models both for providing initial conditions, and for comprising the ensemble membership, with no specific models targeted save continuing to use WRF for the foreseeable future. USAF also is evaluating ways to participate in collaborative projects for ensemble post processing using IMPROVER [115] and verification using the Model Figure 26. Five hypothetical ensemble convection forecasts, with two hypothetical flight paths. Each has a 20% chance of being impacted due to uncertainty in convection happening rather than uncertainty in location.

Future Efforts
USAF intends to continue to leverage its Rolling ensemble paradigm in operations, and extend it to rapid refresh (hourly or greater) capabilities. The multi-model approach for ensemble design remains preferred and USAF will be studying and evaluating existing and emerging models both for providing initial conditions, and for comprising the ensemble membership, with no specific models targeted save continuing to use WRF for the foreseeable future. USAF also is evaluating ways to participate in collaborative projects for ensemble post processing using IMPROVER [115] and verification using the Model Evaluation Toolkit (MET) and METplus [116,117]. Finally, USAF is interested in exploring the potential of cloud computing resources to enable situationally-dependent model/ensemble design to best address a specific forecast challenge and its importance [118] through finer/coarser resolution, more/less members, etc., as appropriate.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.