An Evaluation of the Performance of Sea Ice Thickness Forecasts to Support Arctic Marine Transport

: In response to declining sea ice cover, human activity in the Arctic is increasing, with access to the Arctic Ocean becoming more important for socio-economic reasons. Accurate knowledge of sea ice conditions is therefore becoming increasingly important for reducing the risk and operational cost of human activities in the Arctic. Satellite-based sea ice charting is routinely used for tactical ice management, but the marine sector does not yet make optimal use of sea ice thickness (SIT) or sea ice concentration (SIC) forecasts on weekly timescales. This is because forecasts have not achieved sufﬁcient accuracy, veriﬁcation and resolution to be used in situations where maritime safety is paramount, and assessing the suitability of forecasts can be difﬁcult because they are often not available in the appropriate format. In this paper, existing SIT forecasts currently available on the Copernicus Marine Service (CMS) or elsewhere in the public domain are evaluated for the ﬁrst time. These include the seven-day forecasts from the UK Met Ofﬁce, MET Norway, the Nansen Environmental and Remote Sensing Center (NERSC) and the European Centre for Medium-Range Weather Forecasts (ECMWF). Their forecast skills were assessed against unique in situ data from ﬁve moorings deployed between 2016 and 2019 by the Barents Sea Metocean and Ice Network (BASMIN) and Barents Sea Exploration Collaboration (BaSEC) Joint Industry Projects. Assessing these models highlights the importance of data assimilation in short-term forecasting of SIT and suggests that improved assimilation of sea ice data could increase the utility of forecasts for navigational purposes. This study also demonstrates that forecasts can achieve similar or improved correlation with observations when compared to a persistence model at a lead time of seven days, providing evidence that, when used in conjunction with sea ice charts, SIT forecasts could provide valuable information on future sea ice conditions.


Introduction
In response to declining sea ice cover, human activity in the Arctic is increasing, with access to the Arctic Ocean becoming more important for socio-economic purposes. These purposes include commercial ventures such as tourism, fishing, shipping, mineral and oil extraction, along with activities of importance to local communities such as subsistence hunting, search and rescue and community resupply.
Furthermore, with the International Maritime Organization (IMO) Polar Code [1] and the ongoing construction of new ice class ships, increased activity in all economic sectors is expected [2]. The Polar Code now mandates that 'ships shall have the ability to receive up-to-date information including ice information for safe navigation' and requires Table 1. Sea-ice forecasting systems available from the Copernicus Marine Service (CMS), Copernicus Climate Change Service (C3S) and European Centre for Medium-Range Weather Forecasts (ECMWF). Forecasts are labelled as either long-range (LR), extended-range (ER) or short-and medium-range (SMR). Three of the models in this table are evaluated later in this study: the Coupled atmosphereland-ocean-ice Data Assimilation (CPLDA) system, the Towards an Operational Prediction System for the North Atlantic and European Coastal Zones 4 (TOPAZ4) and the Nansen Environmental and Remote Sensing Center (NERSC) neXt generation Sea Ice Model (neXtSIM-F). Table adapted from Tietsche et al. (2020) [7].  There is ample research on the quality of numerical forecasts for large-scale sea ice presence at the seasonal time scale. This research is looking for potentially useful correspondence between the forecasts and observations. A consistent finding across this research is that currently operational physics-based forecast models provide skilful seasonal forecasts of large-scale sea ice presence for some regions and seasons [8][9][10][11][12]. For instance, Dirkson, Denis and Merryfield (2019) report that the C3S multi-model seasonal forecast for Arctic sea ice cover during the annual minimum in September is substantially better than simple statistical forecasts derived from past observations [13]. Forecast skill increases as lead time decreases, and a forecast issued by the C3S multi-model system at the beginning of August should be expected to be able to predict September SIC better than a statistical reference forecast throughout the Arctic Ocean domain. However, these studies often focus on targets such as the pan-Arctic sea ice volume or extent, which is of little relevance for most operational users. The predictability is often much lower for more user-relevant targets such as the seasonal time of the opening of the Northern Sea Route [14].
For sub-seasonal time scales (weeks ahead), the forecast skill for current numerical prediction systems extends to smaller scales of 10-100 km [15], although skill is highly dependent on the region and season. Moreover, verification of sea ice forecasts at these higher spatial resolutions and with higher accuracy requirements is problematic because the passive microwave observations of SIC that are widely used in the climate and forecast research community are not appropriate for such purposes.
On weather time scales (days ahead), which are the focus of this research, there are less available studies on the performance and comparison of existing operational sea ice forecast models. Nevertheless, there are some efforts to routinely monitor forecast performance of some of the CMS forecasts such as in Melsom, Palerme and Müller (2019) [16], which demonstrates explicitly and objectively how ARC-MFC forecast skill decreases for increased spatial resolutions. Some verification has also been performed for other operational ice forecasting products, including the Canadian Global Ice Ocean Prediction System (GIOPS) [17], the Regional Ice Prediction System (RIPS) from Environment Canada [18] and the US navy ice forecast system [19].
Based on the current research, the main limitations of existing systems concern the spatial resolution, data assimilation schemes, sea ice observations used for forecast initialisation and biases present in coupled atmosphere-ocean-ice systems. Probabilistic information about decametre-scale features could also provide useful information to improve such forecasts. Additionally, widespread provision of ensembles and retrospective forecasts would allow for improved forecast calibration and assessment.
The objective of this paper is to present, for the first time, the performance of the main existing SIT short-range forecasts currently on CMS or elsewhere in the public domain. These are seven-day forecasts from the UK Met Office, MET Norway, the Nansen Environmental and Remote Sensing Center (NERSC) and the European Centre for Medium-Range Weather Forecasts (ECMWF). Forecast performance is evaluated against unique in-situ data from five moorings deployed between 2016 and 2019 by the Barents Sea Metocean and Ice Network (BASMIN) and Barents Sea Exploration Collaboration (BaSEC) Joint Industry Projects. The observed SIT can vary considerably during a seven-day forecast, with the primary source of variability being ice advection. Significant thermodynamic growth of ice can also occur on these timescales, particularly in leads and areas of thin ice where growth can be of the order of 10 cm per day [20]. Assessing the viability of short-term SIT models for navigational purposes is therefore important in order to understand the extent to which this variability can be forecasted.

Observational Data
SIT observations were obtained from five moorings deployed between 2016 and 2019 by the BASMIN Joint Industry Project. The locations of these moorings are shown in Figure 1. Each mooring is equipped with an ASL Ice Profiler, an upward facing sonar that can sample sea ice drafts at a single location once per second. One-minute and ten-minute averages of sea ice draft were calculated from the raw observations in 2016 and 2017-2019, respectively, with an estimated accuracy of measurements being +/−0.05-0.10 m.
In this study, the sea ice draft was averaged to generate daily values and converted from sea ice draft measurements to SIT by dividing observations by 0.89, following the method of Rothrock, Zhang and Yu [21]. This conversion factor is applied because approximately 89% of sea ice thickness is below the surface and is measured by the ASL Ice Profiler, but there are uncertainties that surround this value. In particular, snow depth uncertainty is a large source of error in satellite and in situ measurements, both in the retrievals of freeboard and the subsequent conversion to SIT [22,23]. Due to the linear relationship between SIT and snow depth, an underestimation of snow depth can result in an underestimation of SIT. Additional uncertainty is also introduced through a lack of knowledge of snow and sea ice densities.
Mooring data were averaged to daily values so that a direct comparison could be made with forecast data that is available at hourly to daily resolutions depending on the model. This allows for a statistical comparison to be made, but the process of averaging masks the underlying spatial and temporal heterogeneity of SIT. If the probability distribution of observed SIT is skewed such that the mean and mode values differ, then the mean SIT could be unrepresentative of the typical ice thickness on a given day. This limitation to the method is exemplified in Appendix A by Figure A1a-c, which displays the distribution of sea ice draft for a day in April 2018 at three of the mooring sites. masks the underlying spatial and temporal heterogeneity of SIT. If the pro bution of observed SIT is skewed such that the mean and mode values d mean SIT could be unrepresentative of the typical ice thickness on a given d tation to the method is exemplified in Appendix A by Figure A1a-c, whic distribution of sea ice draft for a day in April 2018 at three of the mooring s

Model Data
The SIT forecasting systems evaluated in this study are the UK Met O atmosphere-land-ocean-ice Data Assimilation (CPLDA) system, the ECM hindcast system, the MET Norway Towards an Operational Prediction S North Atlantic and European Coastal Zones 4 (TOPAZ4) and the NERSC ne Sea Ice Model (neXtSIM-F). Table 2 presents information on the forcing, reso ilation, thermodynamics and dynamic schemes for each product. Table 2 values for h0, the model parameter that determines the thickness of newly f in a grid cell that previously was ice free.
Data from the different models were archived on different days, and un stated, comparisons were only made where data were available for all mode days. This means that direct comparisons between all models across every le always possible. Model data are archived more frequently for analysis field fields, and this has been used to produce multi-model comparisons of an mance in this study.
In order to analyse the forecasts at lead times up to seven days, a pers has been used as a performance baseline for comparison. This persistence duced by using SIT values at a mooring site as the predictor for SIT in the same site. Versions of the dynamical models evaluated are available on CM in the public domain, as described in the data availability statement, althoug versions may differ and generally only analysis fields are archived on CMS

Model Data
The SIT forecasting systems evaluated in this study are the UK Met Office Coupled atmosphere-land-ocean-ice Data Assimilation (CPLDA) system, the ECMWF ensemble hindcast system, the MET Norway Towards an Operational Prediction System for the North Atlantic and European Coastal Zones 4 (TOPAZ4) and the NERSC neXt generation Sea Ice Model (neXtSIM-F). Table 2 presents information on the forcing, resolution, assimilation, thermodynamics and dynamic schemes for each product. Table 2 also contains values for h 0 , the model parameter that determines the thickness of newly formed sea ice in a grid cell that previously was ice free.
Data from the different models were archived on different days, and unless otherwise stated, comparisons were only made where data were available for all models on the same days. This means that direct comparisons between all models across every lead time is not always possible. Model data are archived more frequently for analysis fields than forecast fields, and this has been used to produce multi-model comparisons of analysis performance in this study.
In order to analyse the forecasts at lead times up to seven days, a persistence model has been used as a performance baseline for comparison. This persistence model is produced by using SIT values at a mooring site as the predictor for SIT in the future at that same site. Versions of the dynamical models evaluated are available on CMS or otherwise in the public domain, as described in the data availability statement, although exact model versions may differ and generally only analysis fields are archived on CMS. Thermodynamics Zero-layer model [28] Three-layer [28] Three-layer [28] Three category model [29] Time period of data used neXtSIM-F is a fully Lagrangian sea ice model being developed at NERSC [30]. The model runs on an adaptive triangular mesh with an average side length of 10 km (with the distance from each point to the opposite side being about 7.5 km) and is interpolated to a regular 3 km grid for distribution [27]. The overall scheme of the model, including information on forcing and data assimilation, is depicted in Figure 2. SIC data are assimilated as a weighted average of the Advanced Microwave Scanning Radiometer 2 (AMSR2) and Special Sensor Microwave Imager/Sounder (SSMIS) products as described in Williams et al. (2021) [27]. The model was initialised in October 2018 using SIT from the merged CryoSat-2 and Soil Moisture and Ocean Salinity (CS2SMOS) product from AWI [31]. The neXtSIM-F model scheme [27]. Both SSMIS and AMSR2 concentration data from Ocean and Sea Ice Satellite Application Facility (OSI-SAF) are assimilated.

TOPAZ4
TOPAZ4 is a coupled ocean-sea ice data assimilation product which uses the Hybrid Coordinate Ocean Model (HYCOM) coupled to an early version of the Los Alamos Sea Ice model (CICE) with simple thermodynamics [28,32]. An ensemble Kalman Filter data as- Figure 2. The neXtSIM-F model scheme [27]. Both SSMIS and AMSR2 concentration data from Ocean and Sea Ice Satellite Application Facility (OSI-SAF) are assimilated.

TOPAZ4
TOPAZ4 is a coupled ocean-sea ice data assimilation product which uses the Hybrid Coordinate Ocean Model (HYCOM) coupled to an early version of the Los Alamos Sea Ice model (CICE) with simple thermodynamics [28,32]. An ensemble Kalman Filter data assimilation scheme assimilates SIC and ocean observations [33]. Operationally, the model is run with ten ensemble members and produces ten days of forecast interpolated to a 12.5 km polar stereographic grid. Only the ensemble average is distributed through CMS, which causes some smoothing of model fields. TOPAZ4 assimilates the Ocean and Sea Ice Satellite Application Facility (OSI-SAF) SSMIS SIC hosted on CMS, with no additional corrections. In 2017, the assimilation of thin SIT from the Soil Moisture and Ocean Salinity (SMOS) product was introduced in the winter, and since November 2019, the merged CS2SMOS ice thickness product has replaced the SMOS product. Reconciliation between SIT and SIC measurements is part of the data assimilation procedure, and SIC and SIT observations and their model counterparts enter the same cost function using their respective uncertainty measurements. Where there is inconsistency between SIT and SIC, SIC measurements prevail because of their lower uncertainties. The system uses a seven-day assimilation cycle and assimilates the gridded Sea Surface Temperature (SST), SIC and ice drift fields for the day of the analysis; and along-track Sea Level Anomaly (SLA) and in-situ temperature and salinity for the week prior to the day of the analysis.

CPLDA
The Met Office CPLDA system [34] is run operationally at the UK Met Office with coupled data assimilation. The model assimilates SIC from OSI-SAF, which is based on SSMIS SIC measurements. The sea ice component of the system uses CICE v4.1 on the ORCA025 tripolar horizontal grid, which has a resolution of approximately 13 km in the Barents Sea. Forecasts and analyses from CPLDA have been provided to CMS since 2017 on a regular 0.25 • grid.

ECMWF
Archived ECMWF ensemble perturbed hindcasts of SIT (model Cycle 46R1) were used on the ORCA025 tripolar grid, which has a spatial resolution of approximately 13 km in the Barents Sea. These hindcasts have a control member and ten perturbed members, and they extend to a lead time of 46 days. For the purposes of this study, the ensemble mean of the ten perturbed members has been used up to seven days. The model assimilates SIC from the OSTIA product, which applies some corrections to the OSI-SAF product, and is based on SSMIS SIC measurements.

Performance Metrics
In order to compare the performance of different models with observational data, daily averages of the mooring data and of the model data were interpolated to each mooring location and used to calculate correlation coefficients, mean residuals and mean accuracy. This was repeated at different lead times of the models, up to seven days.
Modified Taylor diagrams were produced to display the correlation coefficients and mean residuals between the modelled and observed SIT. These are similar to those used by Johnson et al. (2012) [35] to compare modelled SIT against a variety of data sources including moored upward looking sonars. Positive correlation coefficients in these diagrams are represented by the radial difference from the origin. Negative coefficients cannot be displayed on this diagram; however, in this study, no forecast was negatively correlated with observations. The mean residuals are scaled to an angle of (-180, 180) from the positive x-axis such that the perfect forecast lies where the outer circle meets the positive x-axis ( Figure 3). Pearson's correlation was used as the coefficient to represent correlation values in these plots, although similar results are found if this is repeated with Kendall's Tau. Forecast accuracy was calculated for all models as a percentage of days where the model SIT was within +/−0.2 m of the observed SIT, following the definition of a hit in Figure 4. related with observations. The mean residuals are scaled to an angle of (-180, 180) from the positive x-axis such that the perfect forecast lies where the outer circle meets the positive x-axis ( Figure 3). Pearson's correlation was used as the coefficient to represent correlation values in these plots, although similar results are found if this is repeated with Kendall's Tau. Forecast accuracy was calculated for all models as a percentage of days where the model SIT was within +/−0.2 m of the observed SIT, following the definition of a hit in Figure 4.

Case Study at Mooring Location M3
The model systems appear to perform poorly in 2019 at the mooring location M3, with all four models having a large number of days where the analysis field modelled SIT is high, but the observed values are zero. This overestimation of SIT is evident in the scatter plots of daily modelled and observed SIT ( Figure 5) and can be observed for all four models. By plotting different winters in separate colours, it is apparent that the models do not generally overestimate the SIT in this manner.

Case Study at Mooring Location M3
The model systems appear to perform poorly in 2019 at the mooring location M3, with all four models having a large number of days where the analysis field modelled SIT is high, but the observed values are zero. This overestimation of SIT is evident in the scatter plots of daily modelled and observed SIT ( Figure 5) and can be observed for all four models. By plotting different winters in separate colours, it is apparent that the models do not generally overestimate the SIT in this manner.   Comparing the seven-day running mean of modelled and observed SIT in 2019 suggests that poor performance could be related to an unusual temporal pattern in SIT at the mooring M3 during this period. The observed SIT increases rapidly in February and early March 2019 and then abruptly declines in late March, whilst the modelled SIT develops well but fails to decrease (Figure 6c). SIT over the same period in previous years displays a more typical evolution and is better represented by the models (Figure 6a,b). Repeating these plots at mooring locations M1 and M5 revealed that this event was localised in nature, occurred to a much lesser extent at M5 ( Figure A2) and was not observed at M1 ( Figure A3). SIT charts from April 2019 suggest that there was substantial ice in the region surrounding M3, but the presence of an area categorised as 'Close Drift Ice' rather than 'Very Close Drift Ice' on some days indicates that there may have been a localised break-up of sea ice over the mooring location (Figure 7). The relevance of this type of event is highlighted in the Section 4. Observations from M3 have been excluded from most of the remainder of the analysis in this study because the results at this location are dominated by model overestimation presented in this section.

Analysis of Correlation and Residuals
Correlation coefficients and mean residuals can be used to compare the performance of the model analysis and forecast fields interpolated to the mooring locations M1, M2, M4 and M5. These are displayed in modified Taylor Diagrams, which are described in Figure 3.

Model Analysis Fields
Forecast data from neXtSIM-F were only available from November 2018 onwards; thus, the residuals and correlations of the analysis fields have been plotted for all four model systems for the period November 2018-July 2019. At mooring locations M1 and M2, all models have correlation coefficients between 0.6 and 0.9 (Figure 8a), but the TO-PAZ4 model has a negative bias, and the ECMWF model has a positive bias. Results are similar at moorings M4 and M5; however, at these locations, CPLDA and neXtSIM-F forecasts overestimate SIT (Figure 8b).
This comparison is repeated for CPLDA, TOPAZ4 and ECMWF models for a longer period spanning December 2016-July 2019. Compared with the shorter period, results show similar performances at M1 and M2 but show a smaller overestimation of SIT by

Analysis of Correlation and Residuals
Correlation coefficients and mean residuals can be used to compare the performance of the model analysis and forecast fields interpolated to the mooring locations M1, M2, M4 and M5. These are displayed in modified Taylor Diagrams, which are described in Figure 3.

Model Analysis Fields
Forecast data from neXtSIM-F were only available from November 2018 onwards; thus, the residuals and correlations of the analysis fields have been plotted for all four model systems for the period November 2018-July 2019. At mooring locations M1 and M2, all models have correlation coefficients between 0.6 and 0.9 (Figure 8a), but the TOPAZ4 model has a negative bias, and the ECMWF model has a positive bias. Results are similar at moorings M4 and M5; however, at these locations, CPLDA and neXtSIM-F forecasts overestimate SIT (Figure 8b). This comparison is repeated for CPLDA, TOPAZ4 and ECMWF models for a longer period spanning December 2016-July 2019. Compared with the shorter period, results show similar performances at M1 and M2 but show a smaller overestimation of SIT by CPLDA at moorings M4 and M5 (Figure 9). CPLDA, TOPAZ4 and ECMWF model systems all had an average bias of less than +/−0.1 m at the mooring locations during this longer period of time.

Model Forecast Fields
Model data were more frequently archived for analysis fields than for forecast fields; thus, a full multi-model comparison of performance across all lead times was not possible. In order to evaluate the performance of the model forecasts at different lead times, correlations and residuals have been plotted for each lead day and compared to a persistence model.
The CPLDA system has a correlation coefficient that drops off gradually with lead time at the different mooring sites (Figure 10a,b). The bias varies depending on the mooring site and appears to increase with lead time, which could indicate that ice is growing too quickly in winter and spring months or is melting too slowly in the melt season. For the ECMWF model system, the relationship between correlation and lead time is less clear, and bias does not vary greatly with lead time (Figure 10c,d). The results for the TOPAZ4 and neXtSIM-F models also reflect this, and there is generally only a weak relationship between model performance and lead time in the correlation and residual plots (Figure

Model Forecast Fields
Model data were more frequently archived for analysis fields than for forecast fields; thus, a full multi-model comparison of performance across all lead times was not possible. In order to evaluate the performance of the model forecasts at different lead times, correlations and residuals have been plotted for each lead day and compared to a persistence model.
The CPLDA system has a correlation coefficient that drops off gradually with lead time at the different mooring sites (Figure 10a,b). The bias varies depending on the mooring site and appears to increase with lead time, which could indicate that ice is growing too quickly in winter and spring months or is melting too slowly in the melt season. For the ECMWF model system, the relationship between correlation and lead time is less clear, and bias does not vary greatly with lead time (Figure 10c,d). The results for the TOPAZ4 and neXtSIM-F models also reflect this, and there is generally only a weak relationship between model performance and lead time in the correlation and residual plots ( Figure 11).
As one would expect, the persistence model has a very low bias, and its correlation coefficients decrease with increasing lead time. The correlation coefficients for the persistence model generally decay faster than that of the dynamical models, and at mooring location M1, all four models have a better correlation with observations at a lead time of seven days than the persistence model.  The ratio of the day seven to day zero forecast performance can provide information on how model performance varies with lead time and is provided as a ratio of correlations and residuals for all models in Figure 12. For the CPLDA system, the correlation coefficients at a lead time of seven days are all less than at day zero, but this ratio is a lot closer to 1 for other model systems. This indicates that model correlation coefficients at day seven are largely dependent on analysis field correlations.

Forecast Accuracy
Accuracy, defined as the percentage of days in which the modelled SIT is within +/−0.2 m of observed SIT, can be plotted as a function of lead time for the models at the different mooring sites. Figure 13 shows the accuracy of CPLDA, TOPAZ4 and ECMWF model systems averaged over mooring sites M1, M2, M4 and M5 between 2017 and 2019. Note that in order to produce this comparison across different lead times, data from different days have been used to produce accuracy values for each model. The period of data of January 2017-July 2019 has been selected because the three models compared all have data at a constant frequency and with no gaps during this time.
Generally, the CPLDA model displays a slight decrease in accuracy from 85% to 80% as the lead time is increased to seven days. The TOPAZ4 and ECWMF models have accuracies of 77-83% and 74-78%, respectively, and for both systems, the accuracy does not have an obvious dependency on lead time.

Forecast Accuracy
Accuracy, defined as the percentage of days in which the modelled SIT is within +/−0.2 m of observed SIT, can be plotted as a function of lead time for the models at the different mooring sites. Figure 13 shows the accuracy of CPLDA, TOPAZ4 and ECMWF model systems averaged over mooring sites M1, M2, M4 and M5 between 2017 and 2019. Note that in order to produce this comparison across different lead times, data from different days have been used to produce accuracy values for each model. The period of data of January 2017-July 2019 has been selected because the three models compared all have data at a constant frequency and with no gaps during this time.
Generally, the CPLDA model displays a slight decrease in accuracy from 85% to 80% as the lead time is increased to seven days. The TOPAZ4 and ECWMF models have accuracies of 77-83% and 74-78%, respectively, and for both systems, the accuracy does not have an obvious dependency on lead time.

Analysis Field Variability
Scatter plots can be used to further explore the extent to which models accurately represent SIT compared to observational data. Splitting the data by year can also inform us about how much the annual variability of SIT impacts the models' predictive capability. Data have been shown at mooring locations M1 and M5 because M1 is representative of the M1/M2 region, and M5 is representative of the M4/M5 region. Figure 14a,b show that CPLDA slightly underestimates SIT at mooring M1 and slightly overestimates at mooring M5, with SIT being overestimated the most in 2019. On average, ECMWF hindcasts have very little bias in the scatter plots, but the distribution of points highlights some overestimation on days where the observed SIT was low ( Figure  14c,d).
TOPAZ4 generally underestimates SIT at both mooring sites (Figure 15a,b) and nextSIM-F performs well at M1 but overestimates SIT at M5 (Figure 15c,d). The model seems to perform well on average at M1 but overestimates SIT at the M5 mooring. For the neXtSIM-F system there is a large variability in the modelled values compared to the observational ones, which could be a result of the model resolving high resolution structures in the sea ice; these structures can be observed in the example model field (Figure 16a) when compared to examples of fields from TOPAZ4 and CPLDA (Figure 16b,c).
The value of h0 has also been plotted for each model, where h0 is the model parameter that provides the initial thickness of ice that has formed in a previously ice-free grid cell. Previous work has found that there is a relationship between the value of h0 and ice thickness in the LIM2 model, with annual mean model SIT increasing with the value of h0 [37].
The ECMWF product appears to display an overestimation of SIT for thin ice between 0 and 0.5 m, with many of the modelled values for SIT being near the h0 value of 0.6 m. It could be the case that, during ice formation, new grid cell values of 0.6 m contribute to the

Analysis Field Variability
Scatter plots can be used to further explore the extent to which models accurately represent SIT compared to observational data. Splitting the data by year can also inform us about how much the annual variability of SIT impacts the models' predictive capability. Data have been shown at mooring locations M1 and M5 because M1 is representative of the M1/M2 region, and M5 is representative of the M4/M5 region. Figure 14a,b show that CPLDA slightly underestimates SIT at mooring M1 and slightly overestimates at mooring M5, with SIT being overestimated the most in 2019. On average, ECMWF hindcasts have very little bias in the scatter plots, but the distribution of points highlights some overestimation on days where the observed SIT was low (Figure 14c,d).
TOPAZ4 generally underestimates SIT at both mooring sites (Figure 15a,b) and nextSIM-F performs well at M1 but overestimates SIT at M5 (Figure 15c,d). The model seems to perform well on average at M1 but overestimates SIT at the M5 mooring. For the neXtSIM-F system there is a large variability in the modelled values compared to the observational ones, which could be a result of the model resolving high resolution structures in the sea ice; these structures can be observed in the example model field (Figure 16a) when compared to examples of fields from TOPAZ4 and CPLDA (Figure 16b,c).
The value of h 0 has also been plotted for each model, where h 0 is the model parameter that provides the initial thickness of ice that has formed in a previously ice-free grid cell. Previous work has found that there is a relationship between the value of h 0 and ice thickness in the LIM2 model, with annual mean model SIT increasing with the value of h 0 [37]. The ECMWF product appears to display an overestimation of SIT for thin ice between 0 and 0.5 m, with many of the modelled values for SIT being near the h 0 value of 0.6 m. It could be the case that, during ice formation, new grid cell values of 0.6 m contribute to the overestimation that we have observed. In a similar manner, the underestimation seen for the TOPAZ4 product could be in part due to the lower model h 0 parameter value of 0.1 m. overestimation that we have observed. In a similar manner, the underestimation seen for the TOPAZ4 product could be in part due to the lower model h0 parameter value of 0.1 m.

Discussion
In this study, we have compared four publicly available SIT forecasts against each other and against a persistence model at five mooring sites in the Barents Sea.
In Section 3.1, we present a case study at mooring location M3 which describes an event in spring 2019 in which observed SIT at M3 decreased rapidly, and the four model systems failed to replicate this, resulting in poor performances in both analysis and forecast fields. SIT models display thickness values as cell-averaged means and are unable to explicitly resolve subgrid-scale processes such as leads that form in the ice. This presents a challenge for usability in shipping because, from a navigational perspective, the subkilometre SIT is important for safety and route optimisation. Validation of SIT models against mooring data particularly highlights this challenge, as localised breakup of sea ice over a mooring site can cause models to perform poorly when compared to cell-averaged means. The evidence presented in our case study suggests that, at their current resolution, these forecast models cannot provide reliable information for tactical ice management during operations where maritime safety is of importance. Developing forecast systems

Discussion
In this study, we have compared four publicly available SIT forecasts against each other and against a persistence model at five mooring sites in the Barents Sea.
In Section 3.1, we present a case study at mooring location M3 which describes an event in spring 2019 in which observed SIT at M3 decreased rapidly, and the four model systems failed to replicate this, resulting in poor performances in both analysis and forecast fields. SIT models display thickness values as cell-averaged means and are unable to explicitly resolve subgrid-scale processes such as leads that form in the ice. This presents a challenge for usability in shipping because, from a navigational perspective, the subkilometre SIT is important for safety and route optimisation. Validation of SIT models against mooring data particularly highlights this challenge, as localised breakup of sea ice over a mooring site can cause models to perform poorly when compared to cell-averaged means. The evidence presented in our case study suggests that, at their current resolution, these forecast models cannot provide reliable information for tactical ice management during operations where maritime safety is of importance. Developing forecast systems with improved resolution will enhance the ability to model localised ice conditions, but it is also important that high-resolution data are assimilated into the forecasts so that modelled SIT does not drift from observed conditions. Assimilating SIT could help prevent large discrepancies between modelled and observed SIT in cases where ice is present (e.g., Figure A2c), but in situations in which the modelled ice edge is inaccurate, assimilation of higher resolution SIC data would provide more substantial improvements for forecasting because of the better sensitivity of SIC measurements. By analysing the model performance more generally, we found that the models correctly forecasted SIT out to seven days for approximately 75-85% of days. We found that correlation coefficients between models and observations at different lead times vary in the range of 0.5-0.9 depending on the mooring site and model. Comparing the model performances to that of a persistence model provided us with a benchmark performance to compare the models against. Model performance decreased only gradually with lead time, which resulted in the seven-day forecasts of the model systems performing as well as or better than that of the persistence model. This indicates that these dynamical model systems can be useful for predicting mean sea ice conditions in regions where ice is variable on daily to weekly timescales. Such information could be useful for short-term planning purposes to monitor how the large-scale ice thickness in a region is predicted to change, but in practice the low resolution of the models remains a limiting factor in usability.
A previous evaluation of the CPLDA product found that its sea ice has a tendency to melt too much in summer months [34], which could result in a negative bias at larger lead times in the summer. We did not observe such a negative bias in our results, and in fact found that bias generally increased slightly with lead time; this difference could have arisen because performance metrics in this study were calculated as an average across different seasons.
The TOPAZ4 product appears to underestimate SIT, which is consistent with the results of an evaluation of the corresponding reanalysis which found that there is a SIT bias of approximately −1.1 m when compared to data from ICESat and IceBridge, as well as sea ice draft data [38,39]. In this study, we have found that bias appears to be of the order of −0.1 m, and the TOPAZ4 system correctly predicts SIT to within +/−0.2 m on 75-85% of days analysed, across lead times of up to seven days. The neXtSIM-F product, which uses TOPAZ4 ocean forcing but with a more advanced and higher resolution representation of sea ice, appears to no longer underestimate SIT, although it subsequently overestimated SIT at mooring locations M4 and M5 in 2019.
We have considered the possible influence of h 0 on performance, where h 0 is the model parameter that defines the thickness of newly formed sea ice. It has been demonstrated that the tuning of h 0 can have a significant impact on modelled SIT [37], and in this study, we have seen evidence that the choice of h 0 could be causing a bias in SIT forecasts at the mooring sites.
In order to understand the relationship between performance and model lead time, we plotted the bias for each model system at different lead times up to and including the seven-day forecasts. These plots indicate that, rather than being a function of lead time, the bias of the forecasts is largely dependent on the initial SIT bias of the analysis field, which could be caused by errors in the SIT or SIC. On these same plots, the correlation coefficients between modelled and observed SIT seem to decrease with lead time, but the magnitude of decrease appears to be small compared to the variation in analysis field correlation coefficients across mooring sites. This highlights the importance of having an accurate analysis field, which can be achieved by assimilating SIT data into the models. Regular assimilation of SIT into forecasts could therefore increase the performance of the models and should be considered an important step in the development of SIT forecasting products for arctic navigation.
It is important to note that because this study only covers up to four years of observational data, individual events can have a large impact on the overall performance metrics for a model system. In particular, it is hard to draw conclusions about the performance of the neXtSIM-F product because only one year of model data has been used for this study. We suggest that similar observational comparisons are made using longer periods of data and using the newly released version of neXtSIM-F.
To extend this study, we would recommend that future studies make use of the ensemble nature of the ECMWF SIT product. Ensemble members provide numerous realisations of ice conditions, the spread of which can be used as a measure of the forecast uncertainty, as currently applied in seasonal ice forecasting [12]. Whilst this does not improve the overall performance of the model, providing a measure of uncertainty alongside shortterm forecasts could be useful for navigational planning purposes, although further work is necessary to explore the extent to which observations fall within the ensemble range of uncertainty. We would also recommend evaluating the performance of the ECMWF product on longer timescales because forecast data are available for lead times up to 46 days, although we note that our study has found that model error does not seem to increase appreciably within lead times up to seven days. Future work could also explore the possibility of evaluating the performance of multi-model ensemble forecasts. Batté et al. (2020) [12] have provided evidence that multi-model ensembles can be effective in predicting sea ice edge when compared to forecasts from single model systems, where sea ice edge was defined as the edge containing cells with 15% or more SIC.
All models evaluated in this study assimilate the SSMIS product, which due to the nature of passive microwave data, has uncertainties which are unsuitable for many maritime activities. Further studies are also therefore required to investigate the benefits of improving SIC assimilation into forecasting products to improve their usability for navigational purposes. Funding: This research was funded by the SEDNA project which has received funding from the European Union's H2020 research and innovation programme under grant agreement No 723526.