Evaluating Quantitative Precipitation Forecasts Using the 2.5 km CReSS Model for Typhoons in Taiwan: An Update through the 2015 Season

: In this study, 24 h quantitative precipitation forecasts (QPFs) by a cloud-resolving model (with a grid spacing of 2.5 km) on days 1–3 for 29 typhoons in six seasons of 2010–2015 in Taiwan were examined using categorical scores and rain gauge data. The study represents an update from a previous study for 2010–2012, in order to produce more stable and robust statistics toward the high thresholds (typically with fewer sample points), which is our main focus of interest. This is important to better understand the model’s ability to predict such high-impact typhoon rainfall events. The overall threat scores (TS, deﬁned as the fraction among all veriﬁcation points that are correctly predicted to reach a given threshold to all points that are either observed or predicted to reach that threshold, or both) were 0.28 and 0.18 on day 1 (0–24 h) QPFs, 0.25 and 0.16 on day 2 (24–48 h) QPFs, and 0.15 and 0.08 on day 3 (48–72 h) QPFs at 350 mm and 500 mm, respectively, showing improvements over 5 km models. Moreover, as found previously, a strong dependence of higher TSs for larger rainfall events also existed, and the corresponding TSs at 350 and 500 mm for the top 5% of events were 0.39 and 0.25 on day 1, 0.38 and 0.21 on day 2, and 0.25 and 0.12 on day 3. Thus, for the top typhoon rainfall events that have the highest potential for hazards, the model exhibits an even higher ability for QPFs based on categorical scores. Furthermore, it is shown that the model has little tendency to overpredict or underpredict rainfall for all groups of events with different rainfall magnitude across all thresholds, except for some tendency to under-forecast for the largest event group on day 3. Some issues associated with categorical statistics to be aware of are also demonstrated and discussed.


Introduction
The quantitative precipitation forecast (QPF) is one of the most challenging areas in modern numerical weather prediction (e.g., [1][2][3][4][5]), as precipitation is considered the end product of all nonlinear processes involved in the atmosphere. This is especially true for heavy and extreme rainfall events (≥200 and 500 mm in 24 h, respectively), where the responsible weather systems are mostly of mesoscale and can evolve rapidly with time (e.g., [6,7]). One such system of great importance is the tropical cyclone (TC) or At the CWB, the highest threshold verified is 500 mm (per 24 h), and the TSs of day-1 (0-24 h) QPFs were at most ~0.05 for July-September 2014 [23]. However, a few members using the Hurricane WRF (HWRF) Model [10,27] at the TTFRI performed better in 2014, as the highest TS on day 1 reached 0.16 at 500 mm [28], but decreased rapidly to 0.08 and 0.03 on days 2 and 3. Thus, while some variations exist among the TS values (due to different cases and data periods, and perhaps also methodology), the above studies overall indicate that the grid size of 5 km is not fine enough to produce rainfall as heavy as in the observations and, subsequently, to produce hits at high and extreme thresholds. Thus, the QPF performance at thresholds above 350 mm was rather limited in past studies. In addition, only a few of the above studies performed QPF evaluation at forecast ranges beyond day 1 (up to 24 or 30 h). In those that did [22,28], the TSs at most thresholds decreased considerably on day 2 and dropped further on day 3, whereas poor performance existed above ~150 mm.
Since 2010, the Cloud-Resolving Storm Simulator (CReSS) [29,30] at a grid size of 2.5 km has been used to perform routine forecasts at the National Taiwan Normal University (NTNU), Taiwan [31] (referred to as W15 hereafter), and provided to the TTFRI as a forecast member. In [31], results of 24 h QPFs within 3 days for 15 typhoons during 2010-2012 were reported. The overall TSs for all events at 350 and 500 mm were 0.26 and 0.16 on day 1, 0.21 and 0.12 on day 2, and 0.08 and 0.01 on day 3, respectively. These scores from deterministic forecasts over three seasons at least match the best results for single seasons reported above, if not better, and show that typhoon heavy-rainfall QPFs in Taiwan at high thresholds can be improved using a higher model resolution and a larger fine-grid domain. Moreover, W15 [31] also reported a strong positive dependence of categorical scores on the observed rainfall amount, i.e., event size or magnitude. That is, a larger rainfall area meeting a given threshold results in a higher TS and a greater inferred QPF per- Since 2010, the Cloud-Resolving Storm Simulator (CReSS) [29,30] at a grid size of 2.5 km has been used to perform routine forecasts at the National Taiwan Normal University (NTNU), Taiwan [31] (referred to as W15 hereafter), and provided to the TTFRI as a forecast member. In [31], results of 24 h QPFs within 3 days for 15 typhoons during 2010-2012 were reported. The overall TSs for all events at 350 and 500 mm were 0.26 and 0.16 on day 1, 0.21 and 0.12 on day 2, and 0.08 and 0.01 on day 3, respectively. These scores from deterministic forecasts over three seasons at least match the best results for single seasons reported above, if not better, and show that typhoon heavy-rainfall QPFs in Taiwan at high thresholds can be improved using a higher model resolution and a larger fine-grid domain. Moreover, W15 [31] also reported a strong positive dependence of categorical scores on the observed rainfall amount, i.e., event size or magnitude. That is, a larger rainfall area meeting a given threshold results in a higher TS and a greater inferred QPF performance at that threshold. Because of this property, for the rainiest top 5% of typhoon events, the TSs at 350 and 500 mm are 0.32 and 0.34 on day 1 and 0.22 and 0.04 on day 3. Therefore, the skill scores for the top events are even higher, and at least higher than those obtained for all events without classification (see [31,32] for details). Toward the high and extreme thresholds, the evaluation of model QPFs for large rainfall events is very important due to their high hazard potential, but such events are rare by definition. This rarity leads to a reduced sample size toward higher thresholds and, therefore, calls for the need to update the data period of verification in order to ensure the robustness of the results.
Thus, the objectives of the present study were threefold. First, the results of W15 [31][32][33] for three seasons of 2010-2012 were updated to include three more seasons (2013)(2014)(2015) and, thus, the sample size was roughly doubled. As discussed above, this is important and necessary to confirm stable results of QPFs, particularly for the top events toward the extreme thresholds. Furthermore, the results for QPFs at longer ranges of days 2-3 were also augmented. Second, a classification scheme different from that of W15 [31] is introduced to isolate increasingly larger events to assess model QPFs for them. This method is simple and easier to implement for operational use, if needed. Third, some of the issues associated with categorical scores in model QPFs are also be demonstrated and discussed using examples, such that future researchers will be more aware of these issues. In Section 2, the model, data, and methodology used in this study are described. In Section 3, a few selected examples of CReSS forecasts during 2013-2015 are presented and discussed, so that a general ability of this model to predict rainfall in Taiwan under normal conditions can be assessed. The categorical scores for all events and the top events are updated in Section 4; lastly, the summary and conclusion are given in Section 5.

The CReSS Model and Its Forecasts
As an extension of work from W15 [31], the same version and configuration of CReSS was used in this study. Thus, only a brief description is provided below, and the readers are referred to W15 [31] for more details. The CReSS model [29,30] is a cloud-resolving model suitable to simulate convective storms at high resolution with parallel computation (e.g., [34][35][36][37][38][39]). The model utilizes a terrain-following vertical coordinate based on height, and it has neither nesting nor cumulus parameterization [29]. Instead, clouds are treated fully explicitly using a bulk cold-rain scheme following the studies of [40][41][42][43][44], with a total of six species (vapor, cloud water, cloud ice, rain, snow, and graupel). Sub-grid scale processes parameterized include turbulent mixing in the planetary boundary layer [30,45] and surface radiation and momentum/energy fluxes with a substrate model [46][47][48].

Data and Methodology
Observational data used in this study were also similar to those used in W15 [31]. Mostly from the CWB, these include the best-track data, weather maps, and radar reflectivity composites. For QPF verification, hourly rainfall data from more than 400 automated rain gauges over Taiwan [56] were used. Figure 1 shows the topography of Taiwan and the locations of these gauges, which are denser in coastal plains than in the mountain interiors.
To extend the results of W15 [31] for three more typhoon seasons in a consistent manner, the methodology to select and classify verification periods also followed the same approach closely. Thus, only 24 h QPFs, either from 12:00-12:00 a.m. UTC or 12:00-12:00 p.m. UTC, covering the warning period issued by the CWB for each typhoon (land and/or sea warning), were selected as our target periods for evaluation. While all typhoons must be included, these periods were checked to confirm that the rainfall in Taiwan was at least partially caused by or linked to the TCs, using weather maps and radar/satellite loops. As a result, a total of 193 time segments from 29 typhoons were selected, as shown in Figure  2, and the 24 h QPFs on day 1 (0-24 h), day 2 (24-48 h), and day 3 (48-72 h) by the CReSS runs at 12:00 a.m. and 12:00 p.m. UTC covering these periods were evaluated. Compared to the 99 segments from 15 typhoons in W15 [31], the sample size here was nearly doubled.
Next, the 193 segments were classified into four groups (A to D) on the basis of the observed 24 h rainfall using the same criteria as W15 [31]. That is, when at least 50 gauge sites in Taiwan reached 100, 50, or 25 mm, the segment was classified as group A, B, or C, respectively. Segments that failed to reach the group C standard were classified as group D. Thus, the magnitude of the rainfall decreases from group A to D. Given in Table 2, the numbers of segments following the order A-D were 55, 39, 47, and 52, respectively; thus, they are quite comparable. The total data points were 86,016 for the 193 segments, averaging near 446 points (gauge sites) per segment. While the four groups were exclusive to each other, a top 10 group (denoted as T10) was also selected from group A as a subset for the top 10 segments (Table 2). Thus, T10 is the rainiest part of group A and represents

Data and Methodology
Observational data used in this study were also similar to those used in W15 [31]. Mostly from the CWB, these include the best-track data, weather maps, and radar reflectivity composites. For QPF verification, hourly rainfall data from more than 400 automated rain gauges over Taiwan [56] were used. Figure 1 shows the topography of Taiwan and the locations of these gauges, which are denser in coastal plains than in the mountain interiors.
To extend the results of W15 [31] for three more typhoon seasons in a consistent manner, the methodology to select and classify verification periods also followed the same approach closely. Thus, only 24 h QPFs, either from 12:00-12:00 a.m. UTC or 12:00-12:00 p.m. UTC, covering the warning period issued by the CWB for each typhoon (land and/or sea warning), were selected as our target periods for evaluation. While all typhoons must be included, these periods were checked to confirm that the rainfall in Taiwan was at least partially caused by or linked to the TCs, using weather maps and radar/satellite loops. As a result, a total of 193 time segments from 29 typhoons were selected, as shown in Figure 2, and the 24 h QPFs on day 1 (0-24 h), day 2 (24-48 h), and day 3 (48-72 h) by the CReSS runs at 12:00 a.m. and 12:00 p.m. UTC covering these periods were evaluated. Compared to the 99 segments from 15 typhoons in W15 [31], the sample size here was nearly doubled.
Next, the 193 segments were classified into four groups (A to D) on the basis of the observed 24 h rainfall using the same criteria as W15 [31]. That is, when at least 50 gauge sites in Taiwan reached 100, 50, or 25 mm, the segment was classified as group A, B, or C, respectively. Segments that failed to reach the group C standard were classified as group D. Thus, the magnitude of the rainfall decreases from group A to D. Given in Table 2, the numbers of segments following the order A-D were 55, 39, 47, and 52, respectively; thus, they are quite comparable. The total data points were 86,016 for the 193 segments, averaging near 446 points (gauge sites) per segment. While the four groups were exclusive to each other, a top 10 group (denoted as T10) was also selected from group A as a subset for the top 10 segments (Table 2). Thus, T10 is the rainiest part of group A and represents roughly the top 5% of all samples (out of 193 segments). The above classification allowed for a proper examination on the dependence of QPF skill on rainfall magnitude, and the related results are presented in Section 4. Table 2. List of the 29 typhoon cases, their data period, number of 24 h segments (12:00-12:00 a.m. and 12:00-12:00 p.m. UTC), and classification (group A-D or T10, in chronical order) included in this study. The 10 segments in group T10 are denoted by "T" in the classification, and this group is a subset of group A. TY Lionrock shares a period (brackets) with Namtheun, and it is counted only once. A summary of the sample size is given at the bottom. The cases in 2010-2012 were the same as W15 [31].

Categorical Scores for Model QPFs
Again, as in W15 [31], the categorical scores based on the 2 × 2 contingency table [11][12][13][14] were employed to verify model QPFs. At any verification point, the outcome of a prediction to reach a given rainfall threshold over an accumulation period (called an event) can be one of four possibilities: hit (H, event predicted and occurred), miss (M, event occurred but not predicted), false alarm (FA, event predicted but not occurred), and correct negative (CN, event neither predicted nor occurred). By counting the number of points falling into each category among a total of N points (N = H + M + FA + CN) in the verification domain, the TS mentioned in Section 1 and the bias score (BS) can be computed as Thus, TS is the fraction of successful prediction of event occurrences (rainfall ≥ the threshold) among all events that are observed and/or predicted, where 0 ≤ TS ≤ 1 (the higher, the better). On the other hand, BS is the ratio of the number of events in model prediction (F = H + FA) to that which actually occurred (O = H + M), thus reflecting overprediction if BS > 1 and underprediction if BS < 1. Obviously, the most ideal value of BS is unity. Typically, at least both TS and BS need to be inspected to allow for a better understanding of how the model performs in QPFs, and this is what we do below. Here, a wide range of 24 h rainfall thresholds were used, from 0.05 to 1000 mm. Lastly, it was noted that the TS and BS are computed at the rain-gauge sites where correct observations are available (see Figure 1), by interpolating model QPFs onto these locations as in W15 [31].

Examples of CReSS Forecasts
A few examples of CReSS forecasts during the added period are shown and discussed in this section. These examples are from Typhoon (TY) Soulik (2013), which approached Taiwan from the east-southeast with a typical track more commonly seen. From the examples, a general idea can be obtained about the model's capability in simulating typhoons and their evolution near Taiwan, and subsequently in the 24 h QPFs over Taiwan. Figure 3 depicts the track of TY Soulik (2013) and the reflectivity composites from land-based radars in Taiwan at selected times every 5-6 h during its passage over 12-13 July 2013 (left column), as well as compares them to the model prediction (of track and rainfall structure) made at the initial time (t 0 ) of 12:00 p.m. UTC 10 July 2013. On the left panels of Figure 3, one can see that TY Soulik (2013) approached northern Taiwan from the southeast at a speed of close to 30 km·h −1 , and its center made landfall across the northernmost part of Taiwan. Despite being more limited in the field of view at longer distances farther away, the radar composites nonetheless indicate that the rainfall associated with Soulik was somewhat asymmetric and more to the south than the north of its center during approach (Figure 3a), and became more concentrated over the windward slopes of Taiwan (see Figure 1) during and shortly after landfall (Figure 3c,e). As Soulik moved away and the overall rainfall gradually weakened, rainbands that aligned in a northeast-southwest direction were present across Taiwan (Figure 3g). The forecast initialized at 12:00 p.m. UTC 10 July (right column, Figure 3), while not always at the same time of the radar observations shown, suggests that the CReSS model captured the evolution of TY Soulik quite well, even in a range of 48-67 h on day 3. One can see that the track was well produced (with a timing difference within 2 h), and the rainfall structure of the TC and around Taiwan also compared quite favorably with the radar observations, including the heavy rainfall over the windward slopes around landfall (Figure 3d,f) and the rainbands at the wake of the storm (Figure 3h).
land-based radars in Taiwan at selected times every 5-6 h during its passage over 12-13 July 2013 (left column), as well as compares them to the model prediction (of track and rainfall structure) made at the initial time (t0) of 12:00 p.m. UTC 10 July 2013. On the left panels of Figure 3, one can see that TY Soulik (2013) approached northern Taiwan from the southeast at a speed of close to 30 km·h −1 , and its center made landfall across the northernmost part of Taiwan. Despite being more limited in the field of view at longer distances farther away, the radar composites nonetheless indicate that the rainfall associated with Soulik was somewhat asymmetric and more to the south than the north of its center during approach (Figure 3a), and became more concentrated over the windward slopes of Taiwan (see Figure 1) during and shortly after landfall (Figure 3c,e). As Soulik moved away and the overall rainfall gradually weakened, rainbands that aligned in a northeastsouthwest direction were present across Taiwan (Figure 3g). The forecast initialized at 12:00 p.m. UTC 10 July (right column, Figure 3), while not always at the same time of the radar observations shown, suggests that the CReSS model captured the evolution of TY Soulik quite well, even in a range of 48-67 h on day 3. One can see that the track was well produced (with a timing difference within 2 h), and the rainfall structure of the TC and around Taiwan also compared quite favorably with the radar observations, including the heavy rainfall over the windward slopes around landfall (Figure 3d,f) and the rainbands at the wake of the storm (Figure 3h). The 24 h total rainfall distributions over Taiwan from rain-gauge observations for five segments from 12:00 p.m. UTC 10 to 12:00 p.m. UTC 15 July 2013 are shown in Figure  4a-e, and they indicate that the rainfall from TY Soulik was most concentrated from 12:00 p.m. UTC 12 to 12:00 p.m. UTC 13 July, with a peak amount of 875.5 mm (Figure 4c) over the Snow Mountain Range (SMR, cf. Figure 1). While this 24 h segment belonged to group T10 (and group A), other adjacent segments, being much less rainy, could only be classified as group C or D at most. In the second row, Figure 4f-h depict the rainfall distributions on days 1-3 from the run starting at 12:00 p.m. UTC 10 July, i.e., the one shown in Figure 3 (right column). One can see that, on the third day (48-72 h) of this run, the overall rainfall pattern predicted by the 2.5 km CReSS was very good, with a peak amount of 957.9 mm and only some minor disagreements with the observation. The results from the two runs 24 and 48 h later (with t0 at 12:00 p.m. UTC 11 and 12:00 p.m. UTC 12 July, respectively) are shown in the third and fourth rows of Figure 4; therefore, the rainiest target The 24 h total rainfall distributions over Taiwan from rain-gauge observations for five segments from 12:00 p.m. UTC 10 to 12:00 p.m. UTC 15 July 2013 are shown in Figure 4a-e, and they indicate that the rainfall from TY Soulik was most concentrated from 12:00 p.m. UTC 12 to 12:00 p.m. UTC 13 July, with a peak amount of 875.5 mm (Figure 4c) over the Snow Mountain Range (SMR, cf. Figure 1). While this 24 h segment belonged to group T10 (and group A), other adjacent segments, being much less rainy, could only be classified as group C or D at most. In the second row, Figure 4f-h depict the rainfall distributions on days 1-3 from the run starting at 12:00 p.m. UTC 10 July, i.e., the one shown in Figure 3 (right column). One can see that, on the third day (48-72 h) of this run, the overall rainfall pattern predicted by the 2.5 km CReSS was very good, with a peak amount of 957.9 mm and only some minor disagreements with the observation. The results from the two runs 24 and 48 h later (with t 0 at 12:00 p.m. UTC 11 and 12:00 p.m. UTC 12 July, respectively) are shown in the third and fourth rows of Figure 4; therefore, the rainiest target periods were on day 2 and day 1, respectively. Again, the model performed quite well in its rainfall prediction for this period, with peak amounts just over 1000 mm (Figure 4j,l). For other days where it was less rainy, the agreement between forecasts and observations could not be judged as well by visual inspection. However, the QPFs for these less rainy segments carry less significance, as the most important ones should be those made for the rainiest period (i.e., 12:00 p.m. UTC 12 to 12:00 p.m. UTC 13 July) in the event of TY Soulik (2013).
Atmosphere 2021, 12, x FOR PEER REVIEW 11 of 21 periods were on day 2 and day 1, respectively. Again, the model performed quite well in its rainfall prediction for this period, with peak amounts just over 1000 mm (Figure 4j,l). For other days where it was less rainy, the agreement between forecasts and observations could not be judged as well by visual inspection. However, the QPFs for these less rainy segments carry less significance, as the most important ones should be those made for the rainiest period (i.e., 12:00 p.m. UTC 12 to 12:00 p.m. UTC 13 July) in the event of TY Soulik (2013). The TS and BS across 15 thresholds from 0.05 to 1000 mm from the three experiments are shown in Figure 4 (rows 2-4), i.e., those made at 12:00 p.m. UTC on 10, 11, and 12 July, are shown and examined in Figure 5. As mentioned, these three runs all resulted in a fairly good rainfall forecast for the rainiest 24 h, i.e., on day 3 of the run on 10 July (Figure 5a, blue), day 2 of the run on 11 July (Figure 5b, red), and day 1 of the run on 12 July (Figure The TS and BS across 15 thresholds from 0.05 to 1000 mm from the three experiments are shown in Figure 4 (rows 2-4), i.e., those made at 12:00 p.m. UTC on 10, 11, and 12 July, are shown and examined in Figure 5. As mentioned, these three runs all resulted in a fairly good rainfall forecast for the rainiest 24 h, i.e., on day 3 of the run on 10 July (Figure 5a, blue), day 2 of the run on 11 July (Figure 5b, red), and day 1 of the run on 12 July (Figure 5c, black). These TSs were high and at least close to 0.7 up to 250 mm and above 0.4 at 750 mm. Such scores were much higher compared to those of the QPFs made for other 24 h segments, which decreased to zero at or before 130 mm without exception. Similarly, the BS curves were also the most ideal when the period from 12:00 p.m. UTC 12 to 12:00 p.m. UTC 13 July was targeted ( Figure 5, right column). For other days that were less rainy, the BS tended to more easily go much higher or lower than unity. With the added information in Figure 5, including the classification group and hit rate H/N (left column), the peak 24 h rainfall amount, and the observed base rate O/N (i.e., rainfall area size) and where it reaches 10%, one can readily recognize that, in the case of TY Soulik (2013), the model QPFs could be verified to be of high quality, and they performed substantially better when the magnitude of the rainfall event during the target period was greater, i.e., with large rainfall area at a relatively high threshold.
Atmosphere 2021, 12, x FOR PEER REVIEW 12 of 21 5c, black). These TSs were high and at least close to 0.7 up to 250 mm and above 0.4 at 750 mm. Such scores were much higher compared to those of the QPFs made for other 24 h segments, which decreased to zero at or before 130 mm without exception. Similarly, the BS curves were also the most ideal when the period from 12:00 p.m. UTC 12 to 12:00 p.m. UTC 13 July was targeted ( Figure 5, right column). For other days that were less rainy, the BS tended to more easily go much higher or lower than unity. With the added information in Figure 5, including the classification group and hit rate H/N (left column), the peak 24 h rainfall amount, and the observed base rate O/N (i.e., rainfall area size) and where it reaches 10%, one can readily recognize that, in the case of TY Soulik (2013), the model QPFs could be verified to be of high quality, and they performed substantially better when the magnitude of the rainfall event during the target period was greater, i.e., with large rainfall area at a relatively high threshold. Another typhoon, TY Fung-Wong (2014) was also examined. It approached Taiwan from the south very slowly, and this track type is less frequent. Although the TS values were somewhat lower, the overall results for Fung-Wong (figures not shown) were similar to those obtained earlier for TY Soulik (2013). From the above discussion and the TS and BS curves shown in the examples (Figures 3-5), one can see that, in successive runs where the typhoon was captured by the model in a more-or-less similar way (i.e., no major differences in the simulations), the magnitude of the rainfall event appeared to exert a strong control on the categorical statistics (and the performance of derived model QPFs), especially in the TS. This dependence is an important aspect investigated in this study, and it is further elaborated below. As stressed by W15 [31], these examples also show that computing the scores for individual segments first and then taking the arithmetic average is problematic, as this creates biases toward the smaller and less important events. This is particularly true for the BS, which can be very unstable in small events with few points reaching a given threshold.

Updated Results of 2010-2015
Following the methodology in Section 2, the overall TSs from CReSS QPFs starting at 12:00 a.m. and 12:00 p.m. UTC for the 193 segments and 29 typhoons (denoted as "all") Another typhoon, TY Fung-Wong (2014) was also examined. It approached Taiwan from the south very slowly, and this track type is less frequent. Although the TS values were somewhat lower, the overall results for Fung-Wong (figures not shown) were similar to those obtained earlier for TY Soulik (2013). From the above discussion and the TS and BS curves shown in the examples (Figures 3-5), one can see that, in successive runs where the typhoon was captured by the model in a more-or-less similar way (i.e., no major differences in the simulations), the magnitude of the rainfall event appeared to exert a strong control on the categorical statistics (and the performance of derived model QPFs), especially in the TS. This dependence is an important aspect investigated in this study, and it is further elaborated below. As stressed by W15 [31], these examples also show that computing the scores for individual segments first and then taking the arithmetic average is problematic, as this creates biases toward the smaller and less important events. This is particularly true for the BS, which can be very unstable in small events with few points reaching a given threshold.

Updated Results of 2010-2015
Following the methodology in Section 2, the overall TSs from CReSS QPFs starting at 12:00 a.m. and 12:00 p.m. UTC for the 193 segments and 29 typhoons (denoted as "all") and those for individual groups A-D and T10 (see Table 2) in the three ranges of days 1-3 are presented in Figure 6. For each group, the entries from all 24 h periods are combined to form one contingency table to compute the scores (so that each of the 86,016 data points carries the same weight). As in W15 [31], one can immediately see that, while each curve nearly always decreased with rainfall threshold, the TSs were higher in group A than B, higher in group B than C, etc., following the order among the four exclusive groups (in black, red, blue, and green), regardless of forecast range or lead time (Figure 6a-c). Naturally, the "all group" (gray) had TS values somewhere in between those of A and D, whereas they became closer to those from the larger events (i.e., group A) toward the high thresholds. Compared to the TSs of group A, the T10 curve (orange) was even higher as expected. In the range of day 1 and 2, the TSs from two earlier forecasts by the 4 km CReSS for TY Morakot (2009) at the time (from [36,57]) and targeted for the 24 h on 8 August (in UTC) are also plotted as purple dots at available thresholds. As TY Morakot (2009) was an even larger and more extreme event (over 1650 mm on 8 August), the TSs were higher. Thus, it is confirmed that rainfall area size or event magnitude (see Figure 6d) exerted a strong control on the TS, and the larger events tended to have higher TSs at the same set of rainfall thresholds for the typhoon regime in Taiwan. Recently, the same dependence in the Mei-yu regime was also confirmed [58]. and those for individual groups A-D and T10 (see Table 2) in the three ranges of days 1-3 are presented in Figure 6. For each group, the entries from all 24 h periods are combined to form one contingency table to compute the scores (so that each of the 86,016 data points carries the same weight). As in W15 [31], one can immediately see that, while each curve nearly always decreased with rainfall threshold, the TSs were higher in group A than B, higher in group B than C, etc., following the order among the four exclusive groups (in black, red, blue, and green), regardless of forecast range or lead time (Figure 6a-c). Naturally, the "all group" (gray) had TS values somewhere in between those of A and D, whereas they became closer to those from the larger events (i.e., group A) toward the high thresholds. Compared to the TSs of group A, the T10 curve (orange) was even higher as expected. In the range of day 1 and 2, the TSs from two earlier forecasts by the 4 km CReSS for TY Morakot (2009) at the time (from [36,57]) and targeted for the 24 h on 8 August (in UTC) are also plotted as purple dots at available thresholds. As TY Morakot (2009) was an even larger and more extreme event (over 1650 mm on 8 August), the TSs were higher. Thus, it is confirmed that rainfall area size or event magnitude (see Figure 6d) exerted a strong control on the TS, and the larger events tended to have higher TSs at the same set of rainfall thresholds for the typhoon regime in Taiwan. Recently, the same dependence in the Mei-yu regime was also confirmed [58]. While the overall curves (similar to "all group" here) are often the only ones examined, it is clear in Figure 6a-c that the larger events (group A, group T10, and for Morakot) had considerable higher TSs across the low and middle thresholds, as well as even the high thresholds at times (Figure 6b). For instance, the "all" curve on day 1 started from 0.73 at 0.05 mm and reached 0.34 at 250 mm and 0.18 at 500 mm, whereas the T10 curve was at 1.00, 0.50, and 0.25 at the same three thresholds (Figure 6a). Over a longer range involving day 2 (where all forecasts started 24 h earlier than those for day 1), the TS values remained the same or were only barely lower (Figure 6b) than those on day 1. This phenomenon indicates that the model exhibited nearly the same performance on day 2 (24- While the overall curves (similar to "all group" here) are often the only ones examined, it is clear in Figure 6a-c that the larger events (group A, group T10, and for Morakot) had considerable higher TSs across the low and middle thresholds, as well as even the high thresholds at times (Figure 6b). For instance, the "all" curve on day 1 started from 0.73 at 0.05 mm and reached 0.34 at 250 mm and 0.18 at 500 mm, whereas the T10 curve was at 1.00, 0.50, and 0.25 at the same three thresholds (Figure 6a). Over a longer range involving day 2 (where all forecasts started 24 h earlier than those for day 1), the TS values remained the same or were only barely lower (Figure 6b) than those on day 1. This phenomenon indicates that the model exhibited nearly the same performance on day 2 (24-48 h) as day 1 (0-24 h); thus, this is impressive compared to the results of 5 km models reviewed in Section 1. The decrease in performance was more visible only on day 3 (Figure 6c), where the overall TSs (all group) were 0.20 at 250 mm and 0.08 at 500 mm (0.33 and 0.12 at the same thresholds in the T10 group). Typically, the QPFs can be considered to have a certain level of skill when the TS reaches 0.2 [58]. Using this value, one can say that the 2.5 km CReSS possesses a skill above this level up to 500 mm on day 1, around 450 mm on day 2, and around 250 mm on day 3 in typhoon QPFs in Taiwan. For the top 10% of rainiest events (T10), these values increased to 600, 500, and 400 mm on days 1-3, respectively. Again, we note that the TS values across the high thresholds in Figure 6 were considerably higher than those by 5 km models reviewed in Section 1, especially on days 2-3, and they were, in general, also slightly higher than those reported in W15 [31]. This latter improvement over the seasons of 2010-2012 may presumably be linked to the larger model domain since 2012 ( Figure 2 and Table 2) and better quality of IC/BCs from the NCEP GFS with time.

Results from a Simple Classification Scheme Using Peak Rainfall Amount
While the above results using exclusive groups of A-D in Section 4.1 are informative and clearly demonstrate the dependence of TSs on event magnitude, a different classification scheme was used in this section. Here, the classification beyond the "all" group simply used the observed peak rainfall amount in the 24 h segments to filter out those reaching 200, 350, 500, and 750 mm, respectively. Therefore, these groups were inclusive, and the group of a higher class (larger group) was a subset from the class below it. Using this simple method for classification, the results are presented in Figure 7. As indicated in the inserts, the classes with a peak rainfall reaching 200, 350, 500, and 750 mm had 98, 52, 26, and 14 segments; thus, their sample sizes were all roughly half of the next class below them.
Atmosphere 2021, 12, x FOR PEER REVIEW 14 of 21 48 h) as day 1 (0-24 h); thus, this is impressive compared to the results of 5 km models reviewed in Section 1. The decrease in performance was more visible only on day 3 ( Figure  6c), where the overall TSs (all group) were 0.20 at 250 mm and 0.08 at 500 mm (0.33 and 0.12 at the same thresholds in the T10 group). Typically, the QPFs can be considered to have a certain level of skill when the TS reaches 0.2 [58]. Using this value, one can say that the 2.5 km CReSS possesses a skill above this level up to 500 mm on day 1, around 450 mm on day 2, and around 250 mm on day 3 in typhoon QPFs in Taiwan. For the top 10% of rainiest events (T10), these values increased to 600, 500, and 400 mm on days 1-3, respectively. Again, we note that the TS values across the high thresholds in Figure 6 were considerably higher than those by 5 km models reviewed in Section 1, especially on days 2-3, and they were, in general, also slightly higher than those reported in W15 [31]. This latter improvement over the seasons of 2010-2012 may presumably be linked to the larger model domain since 2012 ( Figure 2 and Table 2) and better quality of IC/BCs from the NCEP GFS with time.

Results from a Simple Classification Scheme Using Peak Rainfall Amount
While the above results using exclusive groups of A-D in Section 4.1 are informative and clearly demonstrate the dependence of TSs on event magnitude, a different classification scheme was used in this section. Here, the classification beyond the "all" group simply used the observed peak rainfall amount in the 24 h segments to filter out those reaching 200, 350, 500, and 750 mm, respectively. Therefore, these groups were inclusive, and the group of a higher class (larger group) was a subset from the class below it. Using this simple method for classification, the results are presented in Figure 7. As indicated in the inserts, the classes with a peak rainfall reaching 200, 350, 500, and 750 mm had 98, 52, 26, and 14 segments; thus, their sample sizes were all roughly half of the next class below them.  In Figure 7, one can again see that the larger events exhibited higher TSs across the same set of thresholds, regardless of whether the range was day 1 (0-24 h), day 2 (24-48 h), or day 3 (48-72 h). Because these groups were inclusive, the differences in TSs between curves were not as large compared to Figure 6. Toward the highest threshold, the "all" curves became closer and closer to those from the highest class (i.e., with a peak amount reaching 750 mm), because such big events were almost the only ones to provide data points into the categorical statistics at these high thresholds. While the TS curves for all events ("all group") and TY Morakot (days 1-2 only) were the same as in Figure 6, those for segments with a peak amount ≥750 mm were only slightly lower than the T10 curves (at the corresponding range) since the former contained a few more time segments (at 14). Since Figure 7 shows the TS results using a simple and intuitive classification method, it is perhaps a good way to examine the performance of model QPFs for increasingly larger events, especially as a routine practice. Figure 8 shows the BS curves corresponding to the groups in Figure 7 using the inclusive classification method. Overall, these curves indicate that the BS values for the "all group" were quite stable across the thresholds within 500 mm, with slight overprediction (1 ≤ BS ≤ 1.2) on day 1 (Figure 8a) and nearly perfect values (1 ≤ BS ≤ 1.1) on days 2 and 3 (Figure 8b,c). Only toward the extreme thresholds where the data points become fewer did the BS values become more unstable and show some overprediction on days 1 and 3, but not much on day 2 ( Figure 8). As pointed out earlier, the total number of data points was 86,016 from all segments; for example, 50,325 of them in observation reached 2.5 mm, but only 278, 33, and three points reached 500, 750, and 1000 mm, respectively. To put this in perspective, the probability of reaching 1000 mm in our study period for 29 TCs was extremely low (0.0035%). At 1000 mm, since the denominator (i.e., O = H + M) in Equation (2) is so small, one can see how a BS of 3.33 is completely understandable (Figure 8a) if the model produces a total of 10 points reaching 1000 mm. Such a BS value, however, would be interpreted as serious overprediction at a low threshold, where the data points are ample. Due to its unstable nature with small sample size, the BS is not suitable to compute when using small samples (such as individual segments) as discussed, and special caution must also be exercised in the interpretation of its results. In Figure 8, for the larger events, the overprediction on day 1 was less, the BSs were very good on day 2, and some under-forecasts occurred on day 3. This indicates that, over the longer range, there is a tendency to under-forecast the rainfall if the event turns out to be one of high accumulation. In any case, the BS curves overall indicated good model  In Figure 8, for the larger events, the overprediction on day 1 was less, the BSs were very good on day 2, and some under-forecasts occurred on day 3. This indicates that, over the longer range, there is a tendency to under-forecast the rainfall if the event turns out to be one of high accumulation. In any case, the BS curves overall indicated good model performance across the thresholds in rainfall amount, especially in the range of 24-48 h (day 2). As discussed and shown earlier in the examples, since the data points tend to be too few (or even none) toward the high thresholds in groups B-D (see Figure 6d), the results in BS from the exclusive classification method are not representative (in some threshold ranges), and Figure 8 shown here is a more suitable way to evaluate the BSs.

Dependence of TS on Rainfall Area Size
From previous sections, we confirmed that the larger rainfall events from typhoons tend to possess higher TSs across a fixed set of thresholds than those from all events during the same verification period in Taiwan (Figures 6 and 7). In this section, we further investigate the influence of rain-area size on the TSs among groups A-D and T10. If the TSs for the same rain-area size (instead of at a fixed rainfall threshold) among different groups are comparable, it would imply that the above phenomenon of dependence results solely from the variation in rain-area size. On the other hand, if the TSs are still higher in larger groups for events with the same rain-area size, this would indicate that the model indeed possesses a higher ability to predict events of greater accumulation, which are typically under stronger forcing at the synoptic scale and the mesoscale. To do this, a procedure similar to W15 [31,32] was used. For each 24 h segment (Table 2), the observed rainfall amounts at all sites were sorted and ranked to identify a set of new thresholds that gave certain percentages of rain-gauge sites, i.e., areal coverage (O/N) in Taiwan, from 99%, 95%, 90-10% (every 10%), 5%, 3%, 2%, and 1%, respectively. From 99% to 1% in size, these 15 percentiles correspond to rainfall thresholds from low to high for each segment. Using this new set of thresholds (different for each segment), the numbers of H, M, FA, and CN could be obtained at each rain-area size in terms of the percentages of O/N, and the TSs could eventually be computed at these rain-area sizes from one combined 2 × 2 contingency table for each group (A-D or T10) as before. Essentially, the rainfall thresholds were standardized using the fixed rain-area sizes in the above process.
After the standardization, the TSs for A-D and T10 groups are presented in Figure 9a-c. Now, the horizontal axis is the rain-area size (O/N in %) from large to small; thus, the impact of different rain-area sizes in events of different groups is eliminated. Even so, the TSs were higher in the larger accumulation groups across the thresholds of rain-area sizes in all three ranges of days 1-3, following the order of T10 ≥ A ≥ B ≥ C ≥ D almost exclusively (Figure 9a-c). The "all" curves from all segments were between those for groups B and C. Over the longer range of day 3, the differences among the groups also became smaller, especially across the middle thresholds, from roughly 80% to 40% in rain-area sizes. In Figure 9a-c, the TS results also indicate that the model is considerably more capable of predicting the T10 events toward the high thresholds (with smaller rain areas) compared to lower accumulation groups. For example, for rain areas that occupied only 10% and 2% of Taiwan (in terms of the percentages of verification points) in the T10 group, the TSs on day 1 were 0.38 and 0.19, respectively, where the mean rainfall thresholds were about 350 and 530 mm (Figure 9d). Similarly, the TSs for the same targets on day 2 were 0.35 and 0.21 (even higher than on day 1), but dropped to 0.25 and 0.10 on day 3. Even on day 3, the value of 0.25 at a threshold of 350 mm, which is the same as in Figure 6c, still suggests a certain skill level that is quite high. In Figure 9d, the mean rainfall thresholds in different groups as a function of rain-area size are shown, and they were much higher in larger groups, especially toward the high threshold (smaller rain-area size). Overall, Figure 9 indicates that the 2.5 km CReSS model is indeed more skillful in predicting larger typhoon rainfall events, and more factors than just the rain-area size are involved in this dependence. Some aspects were discussed in W15 [31,58] but are beyond the scope of the present update study.
suggests a certain skill level that is quite high. In Figure 9d, the mean rainfall thresholds in different groups as a function of rain-area size are shown, and they were much higher in larger groups, especially toward the high threshold (smaller rain-area size). Overall, Figure 9 indicates that the 2.5 km CReSS model is indeed more skillful in predicting larger typhoon rainfall events, and more factors than just the rain-area size are involved in this dependence. Some aspects were discussed in W15 [31,58] but are beyond the scope of the present update study. Figure 9. (a-c) As in Figure 6a-c, but showing the TS for groups A-D, group T10, and all typhoon events as a function of observed rain-area size (%), instead of rainfall threshold, for (a) day 1, (b) day 2, and (c) day 3. (d) As in (a), but for mean threshold of the six groups (same for days 1-3). The total numbers of 24 h segments in each group are given in parentheses. Figure 9. (a-c) As in Figure 6a-c, but showing the TS for groups A-D, group T10, and all typhoon events as a function of observed rain-area size (%), instead of rainfall threshold, for (a) day 1, (b) day 2, and (c) day 3. (d) As in (a), but for mean threshold of the six groups (same for days 1-3). The total numbers of 24 h segments in each group are given in parentheses.

Conclusion and Summary
In this study, 24 h QPFs by the cloud-resolving 2.5 km CReSS model (initialized at 12:00 a.m. and 1200 p.m. UTC) over three ranges of day 1 (0-24 h), day 2 (24-48 h), and day 3 (48-72 h) during warning periods of 29 typhoons in Taiwan in six seasons of 2010-2015 (193 24 h time segments in observation) were verified and examined using categorical skill scores. The study is an update from W15 [31][32][33] for 15 typhoons during 2010-2012 (99 segments), and the sample size was roughly doubled in order to produce more stable statistics toward the high and extreme thresholds (up to 1000 mm per 24 h) and to better understand the capability of the model to predict these high-impact rainfall events. The major conclusions are summarized below.
(i) The overall TS values of day 1 (0-24 h) QPFs for all events were 0.34, 0.28, and 0. 18 at 250, 350, and 500 mm, respectively, and the corresponding scores at the three thresholds were 0.31, 0.25, and 0.16 on day 2, and 0.20, 0.15, and 0.08 on day 3. Compared to results from contemporary studies of 5 km models (often from fewer samples for a single season), the above TS values at these high thresholds are higher and represent considerable improvement, especially toward the high thresholds and at ranges beyond day 1. In particular, the day 2 scores are only slightly lower than those of day 1, suggesting a comparable model QPF skill at 24-48 h in relation to 0-24 h. (ii) The dependence found in W15 [31], i.e., higher TSs in larger rainfall events, was also evident in our results here, as expected, and this means a further improved ability to produce QPFs for typhoons with greater rain accumulations in Taiwan. After classification, the TSs for the T10 group (roughly top 5% of events) on day 1, again at 250, 350, and 500 mm, were 0.50, 0.39, and 0.25, respectively, while the corresponding scores were 0.49, 0.38, and 0.21 on day 2, and 0.34, 0.25, and 0.12 on day 3. Using a different and simple classification scheme based on the observed peak rainfall amount, the TSs for the top class (about top 7%, with peak rainfall ≥750 mm) were also similar or slightly lower, indicating that these results are stable and robust. Thus, for the top typhoon rainfall events that have the highest potential for hazards, the 2.5 km CReSS exhibits an improved ability to produce QPFs on the basis of categorical statistics. (iii) The classification method based on the observed peak rainfall amount successively filters out subsets of samples with heavier rainfall, and the situations of insufficient points in samples are avoided (as much as possible) even toward the high thresholds. The resultant groups are inclusive and, thus, better suited for categorical statistics, particularly the BS. Overall, the BSs of the 2.5 km CReSS are quite good and especially ideal on day 2, and they show stable results close to unity for all groups across all thresholds with sufficient data points. Thus, the model does not have a tendency to underpredict rainfall toward even the highest threshold, and it is, thus, capable of producing extreme rainfall. For the larger events, nonetheless, there is a slight tendency to under-forecast rainfall toward the higher thresholds.
Overall, the 2.5 km CReSS herein shows improved typhoon QPFs over coarser 5 km models in Taiwan. Therefore, a further increase in model resolution, perhaps down to a grid size at the kilometer or finer scale, similar to the HWRF model [59,60], may potentially further improve the QPF performance (in categorical statistics) in Taiwan to some extent. While this question remains to be answered, some related studies are currently ongoing, and their results will be reported when available.