The Evaluation of Real-Time Hurricane Analysis and Forecast System (HAFS) Stand-Alone Regional (SAR) Model Performance for the 2019 Atlantic Hurricane Season

The next generation Hurricane Analysis and Forecast System (HAFS) has been developed recently in the National Oceanic and Atmospheric Administration (NOAA) to accelerate the improvement of tropical cyclone (TC) forecasts within the Unified Forecast System (UFS) framework. The finite-volume cubed sphere (FV3) based convection-allowing HAFS Stand-Alone Regional model (HAFS-SAR) was successfully implemented during Hurricane Forecast Improvement Project (HFIP) real-time experiments for the 2019 Atlantic TC season. HAFS-SAR has a single large 3-km horizontal resolution regional domain covering the North Atlantic basin. A total of 273 cases during the 2019 TC season are systematically evaluated against the best track and compared with three operational forecasting systems: Global Forecast System (GFS), Hurricane Weather Research and Forecasting model (HWRF), and Hurricanes in a Multi-scale Ocean-coupled Non-hydrostatic model (HMON). HAFS-SAR has the best performance in track forecasts among the models presented in this study. The intensity forecasts are improved over GFS, but show less skill compared to HWRF and HMON. The radius of gale force wind is over-predicted in HAFS-SAR, while the hurricane force wind radius has lower error than other models.


Introduction
A tropical cyclone (TC) is one of the major devastating natural disasters often resulting in loss of lives and property damage. Numerical weather prediction (NWP) developments, specific to TCs, have progressed significantly over the last decade. In the United States, the National Hurricane Center (NHC) makes use of both global and high-resolution regional dynamical models for TC forecast applies the FV3 dynamic core on a regional grid with external lateral boundary conditions, and can reduce the computational cost significantly, by removing the necessity of the parent global domain used for the HAFS-global-nest configuration. The limited area HAFS-SAR model has been configured for the North Atlantic (NATL) basin for TC forecasts (Figure 1), using a single large domain with improved planetary boundary layer (PBL) and surface flux parameterization schemes designed/calibrated specifically for TC simulations. A resolution of 3 km is chosen for convection-allowing TC forecasts.
The HAFS-SAR was successfully implemented during 2019 Hurricane Forecast Improvement (HFIP) real-time experiments for the NATL TC season. This was the first time the FV3-based SAR was systematically tested specifically for TC forecasts. In this study, the performance and forecast skill of HAFS-SAR will be evaluated and compared to the three NOAA operational NWP models. In Section 2, the real-time experiment is described, while the intensity, track and storm size (wind-radii) forecasts are evaluated in Section 3. The HAFS-SAR performance is further examined for two high-impact TC events during the 2019 NATL TC season in Section 3, while Section 4 summarizes the results of this study.
Atmosphere 2020, 11, x FOR PEER REVIEW 3 of 17 results, running both a global model and a nest domain concurrently is computationally expensive. HAFS-SAR applies the FV3 dynamic core on a regional grid with external lateral boundary conditions, and can reduce the computational cost significantly, by removing the necessity of the parent global domain used for the HAFS-global-nest configuration. The limited area HAFS-SAR model has been configured for the North Atlantic (NATL) basin for TC forecasts (Figure 1), using a single large domain with improved planetary boundary layer (PBL) and surface flux parameterization schemes designed/calibrated specifically for TC simulations. A resolution of 3 km is chosen for convection-allowing TC forecasts. The HAFS-SAR was successfully implemented during 2019 Hurricane Forecast Improvement (HFIP) real-time experiments for the NATL TC season. This was the first time the FV3-based SAR was systematically tested specifically for TC forecasts. In this study, the performance and forecast skill of HAFS-SAR will be evaluated and compared to the three NOAA operational NWP models. In Section 2, the real-time experiment is described, while the intensity, track and storm size (wind-radii) forecasts are evaluated in Section 3. The HAFS-SAR performance is further examined for two high-impact TC events during the 2019 NATL TC season in Section 3, while Section 4 summarizes the results of this study. domain.

Experiments
The FV3 dynamical core [17,18] for HAFS-SAR includes a non-hydrostatic finite volume solver using a Lagrangian vertical coordinate [19]. The HAFS-SAR has a single large domain of 2880 × 1920 grid-cells. The domain is centered at 62° W, 22° N, spanning 60° from south to north and 106.5° from east to west. The north Atlantic basin covers the main development region (MDR) for TCs, and extends northward as far as to Newfoundland, in order to capture TC genesis and extratropical transition. The HAFS-SAR is based on one of the 6 faces of the global cubed sphere C768 (~13 km grid spacing globally) and is further refined by a factor of 4. The computational grid is gnomonic, and has an average cell size (defined as square root of cell area) of 3.2 km. The cell size gradually increases from 2.6 km around the domain edges to 3.6 km in the domain center. The HAFS-SAR uses 64 vertical levels on a sigma-pressure hybrid coordinate with the lowest model level at about 25 m above the Topography (m) of the high-resolution stand-alone regional configuration (HAFS-SAR) domain.

Experiments
The FV3 dynamical core [17,18] for HAFS-SAR includes a non-hydrostatic finite volume solver using a Lagrangian vertical coordinate [19]. The HAFS-SAR has a single large domain of 2880 × 1920 grid-cells. The domain is centered at 62 • W, 22 • N, spanning 60 • from south to north and 106.5 • from east to west. The north Atlantic basin covers the main development region (MDR) for TCs, and extends northward as far as to Newfoundland, in order to capture TC genesis and extratropical transition. The HAFS-SAR is based on one of the 6 faces of the global cubed sphere C768 (~13 km grid spacing globally) and is further refined by a factor of 4. The computational grid is gnomonic, and has an average cell size (defined as square root of cell area) of 3.2 km. The cell size gradually increases from 2.6 km around the domain edges to 3.6 km in the domain center. The HAFS-SAR uses 64 vertical levels on a sigma-pressure hybrid coordinate with the lowest model level at about 25 m above the surface and the top isobaric level of 0.2-hPa. The terrain data are interpolated from the 30-s (~1 km) USGS GMTED2010 dataset.
Initial and boundary conditions were interpolated from the 2019 operational global FV3-based GFS (~13-km) onto the 3-km HAFS SAR domain. Lateral boundary conditions (LBCs) were provided every 3 h from the same global GFS forecasts. In this study, no data assimilation is performed for the HAFS-SAR.
The HAFS-SAR physics parameterizations include the Hybrid Eddy-Diffusivity Mass-Flux (HEDMF) PBL scheme [20]. The eddy diffusivity over water in the HEDMF PBL scheme is modified based on observations, to better resolve the TC boundary layer processes in TC, similar to that used in HWRF [21]. The exchange coefficient of momentum Cd and heat Ck for the surface flux parameterization follows the same formula of the operational HWRF over water under strong wind conditions, in order to be more consistent with the observations [22]. The microphysics scheme is the GFDL 6-category hydrometeors scheme [23], while the land surface parameterization is the Noah land surface scheme [24]. The radiation schemes are RRTMG longwave and shortwave parameterizations [25,26], and the cumulus convection is turned off at the 3-km resolution. Finally, the SST is provided by the GFS Near SST (NSST) scheme, which predicts the vertical profile of sea temperature between the surface and a reference level.
The 2019 real-time HAFS-SAR experiments started 0000 UTC 12 July and ended 0000 UTC 01 November, covering eighteen storms consisting of TC Barry through TC Rebekah. The real-time experiments were initialized at 0000, 0600, 1200, and 1800 UTC each day and ran for 126 h. The GFDL vortex tracker [22,27] was used to generate the TC track information, including TC center locations, the maximum 10-m wind and minimum pressure intensities, and TC wind radii of gale force wind (17.5 m s −1 ), damaging force wind (25.7 m s −1 ) and hurricane force wind (32.9 m s −1 ). The model products in GRIB2 format and the Automated Tropical Cyclone Forecasting System [28] (ATCF: the application to automate and optimize TC forecasts for operational centers) track information were delivered at 0930, 1530, 2130, and 0330 (next day) UTC every day during the TC season. A total of 273 TC cases are included. In order to evaluate the performance of HAFS-SAR real-time experiments, the forecasted TC track, intensity, and size are verified against the best track data from the NHC and compared with the operational GFS, HWRF, and HMON systems. The models for comparison are listed in Table 1. Both the HWRF and HMON models have vortex initializations implemented, while additional inner core data assimilation is used in HWRF ( Table 1). The HWRF model is coupled with the Princeton Ocean Model (POM) and the HMON model is coupled with the Hybrid Coordinate Ocean Model (HYCOM). As demonstrated in the HWRF model, both the inner core data assimilation and the ocean coupling can potentially further improve the TC track and intensity forecasts [29,30]. The data assimilation/vortex initialization and ocean coupling capability are being developed for HAFS-SAR. These new capabilities are expected to present the more realistic vortex structure and ocean response for the HAFS-SAR in the future.

Results
The seasonal statistics of HAFS-SAR track, intensity and size forecasts are presented and compared with the GFS, HWRF, and HMON in this section.

Track Forecast
The track forecasts are verified by computing the track error defined as the great circle distance between the predicted TC center and the best track location. Track errors are plotted every forecast 6 h until 48 h, and then every 24 h until 120 h for each of the four models. The initial location errors for HWRF and HMON are smaller than GFS and the HAFS-SAR (Figure 2a), because of the vortex relocation methods of HWRF and HMON [22]. Both HAFS-SAR and GFS have relatively small initial track errors of 22 km, even without relocating TC, and likely benefit from the cycled satellite data assimilation in GFS. Beginning from the 6 h forecast, the track error growth of HAFS-SAR is slower than the other three models. The resulting track error of HAFS-SAR is consistently the smallest from 12-120 forecast hours. The 95% confidence intervals (CI) are also plotted for the track error as the vertical bars ( Figure 2a). The error of HAFS-SAR is significantly smaller than the other three models during 24-72 h forecast periods, when the upper limit of the 95% CI of HAFS-SAR is always lower than the lower limit of the other three models (Figure 2a). Among the three operational models, HWRF appears to have the largest track error, while GFS was the second best performer for track forecasts.
The track forecast skill is also calculated by comparing the track error of each model to HWRF as follows: A positive (negative) value represents a smaller (larger) forecast error and better (worse) performance compared to HWRF. HAFS-SAR, GFS and HMON all show positive track forecast skill at most forecast hours (Figure 2b), indicating improvement over HWRF. The track forecast improvement of HAFS-SAR, when compared to HWRF, is roughly 20% during the 12-120 h forecast period.
The total track forecast error can be further decomposed into along-track and cross-track directional error, to examine the contribution from the two respective components. The along-track error of HAFS-SAR is very close to HMON and smaller than GFS and HWRF prior to 96 h ( Figure 2c). HAFS-SAR demonstrates more advantage in reducing cross-track error ( Figure 2d). After 24 h, the HAFS-SAR cross-track error is always the smallest of the four models out to 120 h, with the error at 72 h being only half those of HWRF and HMON, thus suggesting that cross-track component contributes more to the track forecast improvement than along-track component.
The along-and cross-track biases are illustrated in Figure 2e,f, to further examine track forecast skills. Positive (negative) along-track bias means the predicted storm moved faster (slower) in the along-track direction than the observed TC (as denoted by the best track). Positive (negative) cross-track bias represents how far to the right (left) the predicted storm is relative to the observed track. The cross-track bias is very low for HAFS-SAR along with GFS prior to 72 h (Figure 2f), suggesting HAFS-SAR stays closely in the observed TC moving direction. After 72 h, a positive cross-track bias is observed for both HAFS-SAR and GFS, implying that both models move to the right of the observed track. The along-track bias is negative for most of the models during most of the forecast lead times (Figure 2e), suggesting slower moving speed along the observed TC direction.
In summary, HAFS-SAR performs noticeably better for the track forecast metric, when compared to the operational GFS, HWRF, and HMON. The improvement is mostly from the reduction of the cross-track errors. The comparison of HAFS-SAR to GFS is of particular interest, given that both models use the same dynamical core and initial conditions. The boundary conditions for HAFS-SAR are also from the GFS forecasts. The 3-km high resolution likely contributes to the track forecast improvements (Figure 2 of Xue et al. [4]). Further, HAFS-SAR also applies a different horizontal advection scheme than GFS, and also does not apply a cumulus convective parameterization scheme. Both factors may affect the large scale environment forecast and, thus, the TC track forecasts.

Intensity Forecast
Intensity forecasts, described by the maximum 10-m wind speed, for HAFS-SAR are examined and compared to GFS, HWRF, and HMON ( Figure 3). Due to the lack of inner-core data assimilation and vortex initialization contained in HWRF and HMON [22], both HAFS-SAR and GFS have a relatively large initial intensity error of 6.2 m s −1 , compared to errors around 1.5 m s −1 for HWRF and HMON (Figure 3a). The simulated TC of HAFS-SAR spins up quickly with its intensity error decreasing to 4.6 m s −1 in 6 h, and reduced to 4.1 m s −1 at 12 h. The error of HAFS-SAR grows during

Intensity Forecast
Intensity forecasts, described by the maximum 10-m wind speed, for HAFS-SAR are examined and compared to GFS, HWRF, and HMON ( Figure 3). Due to the lack of inner-core data assimilation and vortex initialization contained in HWRF and HMON [22], both HAFS-SAR and GFS have a relatively large initial intensity error of 6.2 m s −1 , compared to errors around 1.5 m s −1 for HWRF and HMON (Figure 3a). The simulated TC of HAFS-SAR spins up quickly with its intensity error decreasing to 4.6 m s −1 in 6 h, and reduced to 4.1 m s −1 at 12 h. The error of HAFS-SAR grows during the 12-72 h forecast period, appears to saturate by 72-96 h, and decreases during 96-120 h period. The use of high-resolution models is expected to improve the intensity forecasts by simulating more realistic convective-scale inner-core structures [3]. Beginning at 6 h, the intensity error of HAFS-SAR is always lower than GFS, and is only 56% of GFS at 120 h. The performance of HAFS-SAR is significantly better than GFS from 6-24 h and at the lead time of 42 h (error bars in Figure 3a). When compared with HWRF and HMON, the intensity forecasts of the 3-km HAFS-SAR falls behind the finer resolution models before day 5. At 120 h, HAFS-SAR has the lowest intensity error among all four models. It is also likely a result of its vastly improved track forecast at this forecast lead time (Figure 2). The intensity forecast skill (Figure 3b) is consistent with the trend of intensity errors. HAFS-SAR has a negative forecast skill compared to HWRF except at 120 h and also has a lower skill compared to HMON before 72 h. However, the skill is always higher than GFS at all forecast lead times.
The wind bias indicates whether the respective NWP model predicts a stronger (positive) or weaker (negative) storm. Most of the models under-predict the intensity with negative wind bias in most of forecast hours (Figure 3c). The bias of HAFS-SAR is similar to HWRF during the 12-72 h period and turns into a positive 1.5 m s −1 at 120 h, suggesting a slight over-prediction of TC intensity. The number of verifying forecasts is lower at 120 h as well, which may cause a sampling bias (e.g., stronger storms last longer).
compared with HWRF and HMON, the intensity forecasts of the 3-km HAFS-SAR falls behind the finer resolution models before day 5. At 120 h, HAFS-SAR has the lowest intensity error among all four models. It is also likely a result of its vastly improved track forecast at this forecast lead time ( Figure 2). The intensity forecast skill (Figure 3b) is consistent with the trend of intensity errors. HAFS-SAR has a negative forecast skill compared to HWRF except at 120 h and also has a lower skill compared to HMON before 72 h. However, the skill is always higher than GFS at all forecast lead times.
The wind bias indicates whether the respective NWP model predicts a stronger (positive) or weaker (negative) storm. Most of the models under-predict the intensity with negative wind bias in most of forecast hours (Figure 3c). The bias of HAFS-SAR is similar to HWRF during the 12-72 h period and turns into a positive 1.5 m s −1 at 120 h, suggesting a slight over-prediction of TC intensity. The number of verifying forecasts is lower at 120 h as well, which may cause a sampling bias (e.g., stronger storms last longer).
The comparison of intensity error between HAFS-SAR, and other models, demonstrates advantages and limitations of the 3-km resolution grid for TC intensity forecasts. When the resolution is reduced from the ~13-km of GFS to the ~3-km of HAFS-SAR, the improvement is clear, since the GFS resolution is too coarse to resolve the inner core storm structure. However, a grid-spacing resolution of 3 km is considered marginally cloud resolving, and appears to lag behind the higher resolution of 1.5-km and 2-km in HWRF and HMON intensity forecasts, respectively. Increasing the resolution of HAFS-SAR to 1.5-2 km has the potential to further improve the intensity forecasts for strong TC events (not shown). The inner core data assimilation, when implemented in the future, could also potentially improve the intensity forecasts. The comparison of intensity error between HAFS-SAR, and other models, demonstrates advantages and limitations of the 3-km resolution grid for TC intensity forecasts. When the resolution is reduced from the~13-km of GFS to the~3-km of HAFS-SAR, the improvement is clear, since the GFS resolution is too coarse to resolve the inner core storm structure. However, a grid-spacing resolution of 3 km is considered marginally cloud resolving, and appears to lag behind the higher resolution of 1.5-km and 2-km in HWRF and HMON intensity forecasts, respectively. Increasing the resolution of HAFS-SAR to 1.5-2 km has the potential to further improve the intensity forecasts for strong TC events (not shown). The inner core data assimilation, when implemented in the future, could also potentially improve the intensity forecasts.

Size Forecast
Accurate TC size forecasts can help storm surge prediction and estimation of damaging wind areas. The maximum radial extents of gale force wind (17.5 m s −1 ), damaging force wind (25.7 m s −1 ), and hurricane force wind (32.9 m s −1 ) are verified against the observed radii (e.g., the best track; Figure 4) to evaluate TC size prediction. The initial gale and damaging force wind radii errors of HAFS-SAR are relatively small (Figure 4a,c), but increase in 6 h, leading to larger forecast errors than GFS, HWRF, and HMON in most lead times. The wind radii bias is also illustrated in Figure 4. A positive (negative) bias indicates that the model predicts a larger (smaller) storm at a specific wind radius. The gale force wind radius bias of HAFS-SAR is always positive (Figure 4b), indicating an over-prediction of storm size or the outer radius when other models underpredict the gale force wind radii. The difference between HAFS-SAR and other models is statistically significant from 6-96 h (error bars in Figure 4a). The damaging force wind radius bias of HAFS-SAR is comparable to other models (Figure 4d). For the hurricane force wind radius, and closer to the inner core region, HAFS-SAR performs better than the other models during the 12-120 h period, as illustrated in Figure 4e. The bias is also low at the hurricane force wind radius in HAFS-SAR forecasts, compared to HWRF and GFS (Figure 4f).
The relatively large size error for gale force wind radius from HAFS-SAR is possibly related to the horizontal advection scheme that is more diffusive than that of the GFS. The advection scheme in GFS is a fast unlimited fifth-order scheme with the built-in 2∆x filter, while HAFS-SAR chooses a Piecewise Parabolic Method (PPM) scheme with an intermediate-strength monotonicity constraint, due to the instability issue [31]. The scale-aware cumulus convection parameterization, which is turned off in the HAFS-SAR experiments, can also help to reduce the gale force wind radius error, as demonstrated by other experiments (not shown).

Weak vs. Strong Storms
In this section, TCs are categorized as strong or weak storms, based on the initial maximum 10-m wind speed intensity, above or below 25.7-m s −1 (damaging force wind). Track and intensity forecasts errors are examined for these two groups in order to understand the performance of HAFS-SAR for different storm intensities ( Figure 5). The track error for the strong storms is similar to or slightly smaller than that of all storms (Figure 5a). For the weak storms, the track error of HAFS-SAR is larger. HAFS-SAR has an average track error of 482 km (Figure 5c) on 120 h, which is much higher than the average track error of 259 km for the strong storms (Figure 5a). Further, the track error of HAFS-SAR is higher than GFS for weak storms, but still lower than HWRF and HMON. In general, GFS has the best performance of track forecast for the weak storms. It also should be noted that the case number of weak storms is much fewer than strong storms after 60 h.
For the intensity forecasts of strong storms, HAFS-SAR has the intensity error close to that of all storms (Figure 5b). The HAFS-SAR intensity error for weak storms is, however, much lower than that of strong storms (Figure 5d). The error at 72 h forecasts is 4.6 m s −1 , only approximately half of the 10.3 m s −1 error for strong storms. The intensity error for HAFS-SAR is comparable to HWRF and HMON during the 12-48 h forecast lead times. From 72-96 h, the intensity error of HAFS-SAR is even lower than HWRF and HMON ( Figure 5d).
As shown above, strong storms dominate both seasonal track and intensity forecast error for HAFS-SAR, with more cases at later forecast lead times. HAFS-SAR shows better forecast skill for weak storms intensity forecasts and strong storms track forecasts. The results are encouraging given that the high-resolution TC models (e.g., HWRF) tend to over-predict the intensity for weak storms [32].

TC Barry and TC Dorian
During the 2019 NATL TC season, two hurricanes made landfall in the United States and caused considerable damage and economic loss. The track and intensity forecasts of HAFS-SAR for TC Barry and Dorian are verified against the best track, along with the comparison to GFS, HWRF, and HMON, in order to evaluate the performance of these two high-impact events.

TC Barry
TC Barry began as a mesoscale convective vortex over the central US, and became a tropical depression in the northern Gulf of Mexico [33]. After intensifying into a tropical storm, Barry reached category 1 hurricane status on 13 July, and made landfall at Marsh Island, Louisiana, with a very asymmetric structure. The track and intensity forecast errors for TC Barry are illustrated in Figure 6. The HAFS-SAR track forecast shows clear improvements over GFS, HWRF, and HMON within the 24-96 h forecast lead times (Figure 6a). The HAFS-SAR track forecast error is always below 74 km, while the GFS error is above 222 km after 72 h. Both HWRF and HMON have track errors over 370 km at 96 h. The composite Barry track forecasts of all cases for the four models are shown in Figure 7. GFS, HWRF, and HMON all have a rightward bias in the track forecasts (Figure 7a-c). HAFS-SAR follows the observed track closely, and predicts the Louisiana landfall location in close proximity to the observed location in multiple forecast cases (Figure 7d). HAFS-SAR also performs well for Barry in intensity forecasts, as shown in Figure 6b. The maximum wind error of HAFS-SAR is below those of other models presented here at most forecast lead times. We notice that the intensity errors of HWRF and HMON are much higher than both GFS and HAFS-SAR for Barry, which is likely due to their erroneous track forecasts particularly at longer lead times (Figure 7b,c).
Affected by both the northerly shear and dry air intrusion at mid-levels, TC Barry maintained a very asymmetric structure at landfall (Figure 6c). The rainband structure at 1200 UTC 13 July is mostly along the east and the south side of the TC center. HAFS-SAR is able to capture the convective asymmetry at the 60 h forecast lead time for the forecast initialized at 0000 UTC 11 July, with the rainband extending from the borders of Mississippi and Alabama into the Gulf of Mexico. However, the convection to the south of TC center is over-predicted in HAFS-SAR (Figure 6d). The forecasted TC center (29.1 • N, 92.3 • W) was at the southwest of the observation (29.3 • N, 91.9 • W), and the distance between the model and observed locations is 44 km at the valid time.
follows the observed track closely, and predicts the Louisiana landfall location in close proximity to the observed location in multiple forecast cases (Figure 7d). HAFS-SAR also performs well for Barry in intensity forecasts, as shown in Figure 6b. The maximum wind error of HAFS-SAR is below those of other models presented here at most forecast lead times. We notice that the intensity errors of HWRF and HMON are much higher than both GFS and HAFS-SAR for Barry, which is likely due to their erroneous track forecasts particularly at longer lead times (Figure 7b,c).    The HAFS-SAR track forecast outperforms GFS, HWRF, and HMON, as illustrated in Figure 8a. The HAFS-SAR forecast error is the lowest among all models during the 24-120 h period. The superiority of HAFS-SAR is mostly achieved from the cross-track component at shorter lead times (before 96 h), and from along-track component at longer lead times (not shown). GFS, HWRF, and HMON incorrectly predicted the Florida landfall of Dorian in multiple cases (Figure 9a-c). HAFS-SAR appears to predict the recurvature along the Florida coast more accurately, with none of its cases making landfall in Florida (Figure 9d). The intensity forecast error of HAFS-SAR for Dorian is consistent with the total seasonal statistics for strong storms statistics (Figure 8b), when compared with other models.
Atmosphere 2020, 11, x FOR PEER REVIEW 13 of 17 predicted the Florida landfall of Dorian in multiple cases (Figure 9a-c). HAFS-SAR appears to predict the recurvature along the Florida coast more accurately, with none of its cases making landfall in Florida (Figure 9d). The intensity forecast error of HAFS-SAR for Dorian is consistent with the total seasonal statistics for strong storms statistics (Figure 8b), when compared with other models.

Conclusions
With the ongoing advancements for the HAFS, the convection-allowing FV3-based HAFS-SAR was successfully implemented during the 2019 real-time NATL TC season. It is of great interest to examine the performance of the new regional hurricane forecast system with the same FV3 dynamic core in NOAA's unified modelling framework. The intensity, track, and storm size forecasts were systematically evaluated for 273 cases, and compared to both the observed (e.g., best) track data and the GFS, HWRF, and HMON.
HAFS-SAR demonstrate noticeable improvements of track forecasts compared to GFS, HWRF, and HMON at almost all forecast lead times. The track forecast skill of HAFS-SAR improves roughly 20% over HWRF. HAFS-SAR is more effective at reducing the cross-track error than the along track

Conclusions
With the ongoing advancements for the HAFS, the convection-allowing FV3-based HAFS-SAR was successfully implemented during the 2019 real-time NATL TC season. It is of great interest to examine the performance of the new regional hurricane forecast system with the same FV3 dynamic core in NOAA's unified modelling framework. The intensity, track, and storm size forecasts were systematically evaluated for 273 cases, and compared to both the observed (e.g., best) track data and the GFS, HWRF, and HMON.
HAFS-SAR demonstrate noticeable improvements of track forecasts compared to GFS, HWRF, and HMON at almost all forecast lead times. The track forecast skill of HAFS-SAR improves roughly 20% over HWRF. HAFS-SAR is more effective at reducing the cross-track error than the along track component. The cross-track bias of HAFS-SAR is close to zero prior to 72 h, and has a rightward bias for longer forecast lead times.
The initial intensity error (6.2 m s −1 ) of HAFS-SAR is similar to GFS, due to the lack of inner core data assimilation and/or vortex adjustment as done in HWRF and HMON. The spin-up during the first 6 h reduces the HAFS-SAR intensity error to 4.1 m s −1 , and the improvement over GFS is consistent at longer forecast lead times. When compared to both HWRF and HMON, the HAFS-SAR intensity forecasts demonstrate less skill, likely related to the superior horizontal resolution of the operational HWRF and HMON. The intensity forecasts of HAFS-SAR are improved in additional experiments when the horizontal resolutions are increased from 3 km to 2 km and 1.5 km for individual storms. The sensitivity of intensity forecasts to vortex initialization, model physics, dynamics, and grid specification are also being further investigated.
The verification of gale force, damaging force, and hurricane force wind radii reveals that HAFS-SAR performs better than the other three models for the hurricane force wind (32.9-m s −1 ) radius. HAFS-SAR over-predicts the size of the TC (e.g., the 17.5 m s −1 gale force wind radius), compared to GFS, HWRF, and HMON. The storm size of HAFS-SAR is sensitive to the horizontal advection schemes of FV3, and the use of a less diffusive scheme is being investigated to improve the HAFS storm size forecasts.
The seasonal statistics for HAFS-SAR track and intensity forecasts were also examined for strong and weak storms, based on the initial intensity. The track forecast error is lower for strong storms, while weak storms have better intensity forecast performance. HAFS-SAR also demonstrates promising results for the track forecasts during two high-impact 2019 TC events: TC Barry and Dorian.
The successful implementation of HAFS-SAR during 2019 HFIP real-time demonstration provides an important step toward the efforts for building up the next generation high-resolution TC forecasting system. The systematic evaluation of HAFS-SAR demonstrates great potential for the FV3-based convection-allowing regional models toward improving TC forecasts. A new data assimilation system is under development for HAFS, to further improve the initial vortex structure by assimilating TC inner core observations, which are expected to reduce the intensity forecasts errors. The future development of ocean coupling for HAFS will also enhance the TC forecasting capability, by introducing realistic ocean response to TCs.