Evaluation of Structured and Unstructured Models for Application in Operational Ocean Forecasting in Nearshore Waters

: The oceanography sub-initiative of water levels and currents, temperature and salinity, and the trajectories of surface drifters, but the computational cost of FVCOM was significantly less than that of NEMO.


Factors Considered for Evaluation
The evaluation process includes the selection of the study area, the requirements for the model setup, and the metrics for evaluating the models. The evaluation was designed to objectively (quantitatively) assess the accuracy and efficiency of NEMO and FVCOM for operational forecasting at port scales for parameters of interest to the OPP applications. Here, the term accuracy pertains to the models' ability to reproduce the observations, and efficiency refers to the computer resources that are required to run the models in an operational context. The objectives of OPP are to improve electronic navigation and to predict oil spill drift, thus the parameters of interest for the evaluation were water level, currents, water temperature and salinity (density), and surface drifter trajectories. As with any experimental design, it is important to consider the ability of the evaluation to detect contrasts in the models and to force the strongest contrast possible. Thus, the evaluation was designed to challenge the models with respect to accuracy and efficiency. Consequently, the evaluation helped gain a better understanding of the strengths and weaknesses of the models, and although the evaluation focused on one port, the results can be reasonably expected to extend to other ports.
Of the six OPP pilot ports, one was selected for the evaluation process. The selection of the study area was based on (1) the regional oceanography being sufficiently complex to include key dynamic processes; (2) the availability of forcing data (from atmosphere, rivers, and open ocean) to drive the port models; and (3) the availability of sufficient observational data to assess the predictive accuracy of the models with respect to the parameters of interest. The parameters of interest at Canadian ports are typically influenced by complicated coastlines, bathymetry, tides, river runoff, the open ocean, and, in some cases, the presence of sea-ice. Due to the urgent timeline of the OPP (i.e., the evaluation was to be completed within the first year of OPP), the explicit inclusion of sea-ice and the simulation of the intertidal zone (with a wetting/drying scheme) were not required, but to ensure that FVCOM was used to its full potential, the use of the wetting/drying scheme was encouraged. At the time of the evaluation, a wetting/drying scheme was not available in NEMO. Forecasting the surface waves at port scales was not considered for the current phase of the OPP.
Various aspects of the model setup were built into the process of evaluation, including spatial resolution, the inclusion of dynamic processes, forcing fields, the duration of the simulation, computational cost, and the variables, frequencies and format of the model output. The required horizontal grid spacing of the models was 100 m or less to resolve the horizontal gradients of currents and the presence of eddies due to nonlinear processes. The required vertical grid spacing was 1 m or less near the surface, to resolve the currents in the upper layer that are important for navigation and oil spill drift. The models included the full baroclinic dynamics to simulate the variations due to surface momentum and buoyancy fluxes, river runoff, and open ocean forcing into the port. The models were subjected to the same forcing, which included atmospheric forcing from the operational weather forecasting system, large-scale oceanic forcing from the operational regional ice-ocean forecasting system, and available tidal and river runoff. At the current stage of the OPP, the port models did not include any data assimilation, partly due to the lack of sufficient real time observational data. A common time frame for the simulation was determined by the available observations. The duration of the simulation was 15 months to allow for a spin-up of the models and the evaluation of a full annual cycle. For the proper evaluation of model efficiency, both models were run on the same high-performance computer facility of the Government of Canada. For operational applications, the required run-time of the models was 0.5 h (or less) for a 48 h simulation. The frequency of the model output was minimally 0.5 h for the proper evaluation of tidal variations.
The metrics for the model data comparison were defined based on the existing expertise of the team, and through expert consulting and literature research. The metrics were defined for the tidal and non-tidal components of sea level and currents, vertical profiles of water temperature and salinity, time series of sea surface temperature (SST) at fixed stations, and the trajectories of surface drift. The quantitative metrics are statistically robust to measure the discrepancy between the model solution and observational data. Prior to performing the statistical comparison, the model results were extracted from the grid node nearest to the locations of the observations. Time series analysis also included a comparison of the energy spectrum. Because the domains of both models cover the area beyond the port, the evaluation was carried out for the "inner harbour" (port) and "outer harbour" separately, with an emphasis put on the inner harbour. In addition to the quantitative evaluation, the models were evaluated qualitatively based on known features of the regional oceanography, e.g., the presence of the river plume, tidal fronts, eddies, etc.
Finally, when possible, the models were compared to existing operational products that are used in Canada. This includes the Scotia-Fundy-Maine WebTide (hereafter referred to as WebTide) solutions for tidal elevation and currents [26], and the regional ice-ocean prediction system (RIOPS) that covers the North Atlantic, Arctic and North Pacific with 1/12-degree resolution [8]. Note that neither WebTide or RIOPS were developed for near-shore operational applications and are not high-resolution port solutions, but were the existing operational products for the area at the time of this evaluation.

Results
This section describes how the evaluation process, outlined in Section 2, was applied to the NEMO and FVCOM models for the Port of Saint John. Samples of the evaluation results are based on the model solutions near the end of the evaluation process (December 2017). The evaluation identified some deficiencies or errors in both models, and the subsequent work beyond this evaluation process led to improved model solutions, e.g., documented in Paquin et al. for NEMO [25].

Regional Oceanography
The Port of Saint John was selected as the study site, in part because of its complicated regional oceanography. The port is in Saint John Harbour (hereafter SJH) located in the Bay of Fundy (Figure 1). The area presents tides among the largest in the world, with a maximum range reaching 16 m in the upper Bay of Fundy [27]. The tidal range in SJH can exceed 8 m at spring tide [28,29]. SJH receives significant freshwater influx from the Saint John River (hereafter SJR) which ranges from about 5 of 26 500 m 3 s −1 during summer low-water conditions [29] to a maximum of about 10,000 m 3 s −1 during the spring freshet [30]. The combination of tidal flow and river runoff generates sharp density fronts which propagate through the vicinity of the port. Due to the large tidal amplitudes in the SJH and throughout the Bay of Fundy, there are significant inter-tidal zones (tidal flats) in the area. The local geography also leads to a complex flow system in the estuary. As SJR discharges into the harbour, water flows over a shallow sill into the Reversing Falls (see inset in Figure 1), a gorge in which rapids usually reverse in direction. Above the sill, the river expands into a river system that receives fresh water input from a large watershed that includes many lakes, hereafter called the river estuary. The Reversing Falls sill, between the harbour and the river estuary, constrains the upstream propagation of tides into the river estuary [28]. The flow conditions in the Reversing Falls change dramatically over the course of a tidal cycle, with the flow alternating between upstream and downstream. Typically, during the flood tide, the sea level in the harbour rises above the water level in the river estuary and there is a strong flow of saline water into the river system [31]. During the ebb tide, the sea level in the harbour drops to a lower level than in the river estuary and a mix of river and sea water flows back out into the harbour [31]. During freshet conditions, if river levels are sufficiently high, the river discharge dominates the flow through the Reversing Falls, such that there is no reversal of the flow. In the river estuary, a two layer system of fresher surface water and brackish bottom water, with a salinity of up to 20 ppt [28], is established in Long Reach (shown in Figure 1) and is present for most of the year, except during the spring and fall freshets when it may be flushed away [28,32]). In winter, the ice is present on the river estuary, but only thin ice appears near the shore in the harbour [28]. geography also leads to a complex flow system in the estuary. As SJR discharges into the harbour, water flows over a shallow sill into the Reversing Falls (see inset in Figure 1), a gorge in which rapids usually reverse in direction. Above the sill, the river expands into a river system that receives fresh water input from a large watershed that includes many lakes, hereafter called the river estuary. The Reversing Falls sill, between the harbour and the river estuary, constrains the upstream propagation of tides into the river estuary [28]. The flow conditions in the Reversing Falls change dramatically over the course of a tidal cycle, with the flow alternating between upstream and downstream. Typically, during the flood tide, the sea level in the harbour rises above the water level in the river estuary and there is a strong flow of saline water into the river system [31]. During the ebb tide, the sea level in the harbour drops to a lower level than in the river estuary and a mix of river and sea water flows back out into the harbour [31]. During freshet conditions, if river levels are sufficiently high, the river discharge dominates the flow through the Reversing Falls, such that there is no reversal of the flow. In the river estuary, a two layer system of fresher surface water and brackish bottom water, with a salinity of up to 20 ppt [28], is established in Long Reach (shown in Figure 1) and is present for most of the year, except during the spring and fall freshets when it may be flushed away [28,32]). In winter, the ice is present on the river estuary, but only thin ice appears near the shore in the harbour [28].

Available Ocean Observations
A reason for choosing the Port of Saint John as the study area was the quantity and quality of available ocean observations during a one-year window, from May 1 2015 to April 30 2016, which was subsequently set as the time period for the model simulation and evaluation. The types and locations of the available observations are shown in Figure 1. The map also divides the study area into three sub-areas: the inner harbour, the outer harbour, and the greater Bay of Fundy. Table 1 provides a summary of the observations, including the variables that were measured by each instrument and the number of observations in each of the three sub-areas. Note that extensive effort was put into collecting and carrying out the quality control on the observational data by various DFO projects.

Available Ocean Observations
A reason for choosing the Port of Saint John as the study area was the quantity and quality of available ocean observations during a one-year window, from 1 May 2015 to 30 April 2016, which was subsequently set as the time period for the model simulation and evaluation. The types and locations of the available observations are shown in Figure 1. The map also divides the study area into three sub-areas: the inner harbour, the outer harbour, and the greater Bay of Fundy. Table 1 provides a summary of the observations, including the variables that were measured by each instrument and the number of observations in each of the three sub-areas. Note that extensive effort was put into collecting and carrying out the quality control on the observational data by various DFO projects. All tide gauge and tide station data were obtained from the Canadian Hydrographic Service (CHS). The "constituent only" stations provided the harmonic constants for sets of tidal constituents based on the analysis of past water level time series observations, but did not provide observational time series during the evaluation period. The time series of water levels were available from two real-time tide gauges in the study area: one in the inner SJH and one at Yarmouth. The SJH gauge was critical for evaluating the tidal and non-tidal components of the modelled water level in the inner harbour. The Yarmouth gauge, located near the mouth of the Bay of Fundy, was used to assess the lateral open boundary condition and large-scale performance of the models. DFO provided observations from 11 moored Acoustic Doppler Current Profilers (ADCPs) that were deployed between August 2015 and February 2016: six in the inner harbour and five in the outer harbour. The ADCPs were either moored a few metres off the bottom or within tripod bottom mounts and recorded the velocity in 1 m bins. To ensure confidence in the observations, a minimum of 75% ping return cut-off was used to define the valid observations. The ADCPs were also set to record the bottom pressure. The bottom pressure data were converted into variations of water level. The "inverse barometric" effect, accounting for the water level variations due to surface atmospheric pressure, was computed from the forcing data and added to the water level derived from the bottom pressure measurement.
Conductivity-temperature-depth (CTD) profiles were also provided by the DFO and were used to evaluate the modelled temperature and salinity fields. There were 80 CTD casts collected at 35 stations in and around SJH between June 2015 and April 2016. Although it is difficult to fully evaluate the large-scale models with point measurements, the locations and timings of the CTD casts made it possible to evaluate the seasonal changes in both the water properties and the depth of the mixed layer at representative locations in the three sub-areas. A SmartAtlantic [33] buoy located just outside the SJH provided the only time series measurement of the sea surface temperature. The buoy data, provided by the Fisheries and Marine Institute of the Memorial University of Newfoundland, were essential for evaluating the seasonal cycle of SST. Monthly mean SST and SSS (sea surface salinity) over the Bay of Fundy were evaluated in a qualitative manner to identify any large-scale discrepancies between the two models.
Lastly, the DFO deployed 134 surface drifters near SJH (in the outer harbour) between July 2015 and January 2016. Four different types of surface drifters were used: Seimac Accurate Surface Tracker barrel-shaped drifters (hereafter referred to as Barrel), MetOcean iSphere spherical drifters (hereafter referred to as Sphere), MetOcean CODE/Davis drifters (hereafter referred to as Davis), and Oceanetic Surface Circulation Tracker drifters (hereafter referred to as Sponge). The drifters provided records of drift trajectories over short periods of time (typically only one day). They were used to evaluate the drifter simulations based on the modelled surface current and the influence of surface winds (described in Section 3.7), and more generally, the models' ability to predict drift, which is one of the primary applications of the models under OPP.

Setup of NEMO and FVCOM Models
The models are based on NEMO version 3.6 and FVCOM version 3.2.1. Both models solve the 3D equations controlling the variations of ocean currents, water levels, temperature and salinity. The hydrostatic and Boussinesq approximations are applied. The minimum water temperature was set to freezing temperature. This is a simplified approach to account for sea-ice formation in winter, without turning on the sea-ice modules in either model.
The setup of the model domains considered the large-scale ocean forecasting system for providing the lateral open boundary condition (OBC). At the time as the evaluation, the available large-scale model was RIOPS which has a nominal horizontal resolution of 1/12-degree longitude/latitude, about 7.5 km in the study area. The finer coastal ocean prediction system for the east coast of Canada, at the nominal resolution of 1/36-degree, was still in the planning stages.
To take the OBC from RIOPS, FVCOM had the advantage of using a single unstructured-grid with variable horizontal resolution. Based on previous experience, the FVCOM domain was set as shown in Figure 2a, which encompasses the Bay of Fundy and Gulf of Maine, and extends offshore to include the Scotian Shelf and the shelf break. The horizontal cell size ranges from 14 km offshore to 48 m in SJH ( Figure 2b). There are 21 geometrically spaced vertical sigma-levels resulting in layer thicknesses ranging from centimeters at the surface, to hundreds of meters at depth in the shelf break area.

Setup of NEMO and FVCOM Models
The models are based on NEMO version 3.6 and FVCOM version 3.2.1. Both models solve the 3D equations controlling the variations of ocean currents, water levels, temperature and salinity. The hydrostatic and Boussinesq approximations are applied. The minimum water temperature was set to freezing temperature. This is a simplified approach to account for sea-ice formation in winter, without turning on the sea-ice modules in either model.
The setup of the model domains considered the large-scale ocean forecasting system for providing the lateral open boundary condition (OBC). At the time as the evaluation, the available large-scale model was RIOPS which has a nominal horizontal resolution of 1/12-degree longitude/latitude, about 7.5 km in the study area. The finer coastal ocean prediction system for the east coast of Canada, at the nominal resolution of 1/36-degree, was still in the planning stages.
To take the OBC from RIOPS, FVCOM had the advantage of using a single unstructured-grid with variable horizontal resolution. Based on previous experience, the FVCOM domain was set as shown in Figure 2a, which encompasses the Bay of Fundy and Gulf of Maine, and extends offshore to include the Scotian Shelf and the shelf break. The horizontal cell size ranges from 14 km offshore to 48 m in SJH ( Figure 2b). There are 21 geometrically spaced vertical sigma-levels resulting in layer thicknesses ranging from centimeters at the surface, to hundreds of meters at depth in the shelf break area. The structured-grid of NEMO does not have the flexibility to vary the horizontal grid size greatly with a single design. Instead, the approach taken was to develop three configurations (shown in Figure 2a) that are connected with one-way nesting. The grids of the three components are all aligned with the tri-polar ORCA grids created by the DRAKKAR Group [34], [25]. The outer component covers the Bay of Fundy and the western Scotian Shelf with a nominal horizontal resolution of 1/36degree, hence referred to as BoFSS1/36. The intermediate component covers the Bay of Fundy with an approximate 500 m resolution (hereafter BoF500). The inner component covers the inner and outer harbour with an approximate 100 m resolution (hereafter SJAP100). Note that the Saint John river estuary system, including the Reversing Falls, is included in SJAP100, roughly represented in BoF500, but totally excluded in BoFSS1/36. The one-way nesting is achieved by the larger domain component The structured-grid of NEMO does not have the flexibility to vary the horizontal grid size greatly with a single design. Instead, the approach taken was to develop three configurations (shown in Figure 2a) that are connected with one-way nesting. The grids of the three components are all aligned with the tri-polar ORCA grids created by the DRAKKAR Group [25,34]. The outer component covers the Bay of Fundy and the western Scotian Shelf with a nominal horizontal resolution of 1/36-degree, hence referred to as BoFSS1/36. The intermediate component covers the Bay of Fundy with an approximate 500 m resolution (hereafter BoF500). The inner component covers the inner and outer harbour with an approximate 100 m resolution (hereafter SJAP100). Note that the Saint John river estuary system, including the Reversing Falls, is included in SJAP100, roughly represented in BoF500, but totally excluded in BoFSS1/36. The one-way nesting is achieved by the larger domain component providing the OBC forcing to the next level smaller domain component successively, i.e., from RIOPS to BoFSS1/36, from BoFSS1/36 to BoF500, and from BoF500 to SJAP100. NEMO uses z-levels in the vertical space, with "bottom partial cells" for the accurate representation of the varying bathymetry, and the "variable volume level" scheme [35] to allow for the stretching and compression of the level thickness according to the changing water levels. Regarding the set-up of vertical levels, BoFSS1/36 uses the same 50-level setup as RIOPS but only has 28 active levels because its domain does not reach the deep ocean; BoF500 and SJAP100 have the same setup with 41 and 35 active levels, respectively, due to the difference in the maximum water depth between the two domains. The finest vertical resolution is near the surface, 1 m for all three components.
A high-resolution coastline dataset provided by the CHS was used to define the coastline represented in FVCOM and each component of NEMO. The model bathymetry within the Bay of Fundy and part of the Gulf of Maine was interpolated from a high-resolution dataset constructed from CHS and the Ocean Mapping Group, University of New Brunswick (OMG) data sources. Over the area not covered by the CHS/OMG data, NEMO used the global high-resolution SRTM30 bathymetry product [36] and FVCOM used bathymetry from the Scotia-Fundy-Maine grid of WebTide. As a wetting/drying scheme is unavailable in NEMO version 3.6, several modifications were made to the coastline and bathymetry, mainly in the upper Bay of Fundy and in the SJH which has large intertidal areas. Part of the intertidal zone was excluded, and part had the water depth increased to avoid drying and to maintain numerical stability. The modifications ensured the good representation of the M2 tides in and near the SJH [25].
Except for the river estuary system, which was initialized with the same data as the NEMO BoF500 simulation, the FVCOM model was initialized with the 3D temperature and salinity from the RIOPS solution on 1 February 2015, and was ramped up from rest (zero sea level and velocity) over a period of 36 h. The outer component of NEMO (BoFSS1/36) was initialized with the RIOPS solution on 1 February 2015, including 3D temperature and salinity, sea levels, and currents. BoF500 was initialized with the BoFSS1/36 solution on 1 April 2015, but with the temperature and salinity in the river estuary system taken from the solution of a separate 1-year simulation. Finally, SJAP100 is initialized on 10 April 2015, from the BoF500 solution. The simulations prior to 1 May 2015 (the start of the evaluation duration), are treated as the spin-up of the models. For FVCOM, the model spin-up occurs between 1 February and 30 April 2015. This spin-up period is considered sufficient for the winter conditions of the study area [25].
Input data for the lateral OBC included tidal and non-tidal components. For the FVCOM model and the BoFSS1/36 of NEMO, the non-tidal water levels for the OBC, at hourly intervals, were obtained from the de-tided RIOPS solution. The NEMO OBC also included non-tidal depth-averaged velocity from RIOPS. De-tiding is necessary because the RIOPS tidal solution contains significant errors in the Bay of Fundy. The input for OBC also included the daily averaged 3D temperature and salinity (without de-tiding) from RIOPS. The tidal water levels (and depth-averaged velocity for NEMO only) were computed from the harmonics of five major tidal constituents (M2, N2, S2, K1 and O1) from the WebTide solution [26]. Other tidal constituents will be included in the future development and operational application of the models. The BoFSS1/36 obtained a sufficiently accurate tidal solution near the open boundary of BoF500, and hence, BoF500 used both the tidal and non-tidal forcing for OBC from the solution of the BoFSS1/36, at 0.5 h intervals. Consequently, the SJAP100 took the full OBC forcing from the BoF500, also at 0.5 h intervals. Regarding the setup of the OBC, NEMO applied a "radiation" scheme for the barotropic (depth-averaged) current normal to the open boundary [37], and a "flow relaxation" scheme for temperature, salinity and the baroclinic current over a "relaxation zone" [38]. FVCOM enforced the water level, temperature and salinity at the open boundary nodes. A sponge layer was implemented at the open boundary with a radius of 25 km. Within the sponge layer, the longitudinal and latitudinal velocities, u and v, were reduced by a factor of u/(1+ Cspg u2) and v/(1+ Cspg v2). The friction coefficient, Cspg, was specified to be 0.002 at the open boundary nodes and linearly decreases to zero over the radius of the sponge layer.
The Saint John River runoff was included in FVCOM, and the BoF500 and SJAP100 components of NEMO. For both the NEMO and FVCOM configurations, the open boundary of the river was placed upstream of the Long Reach and below the Spoon Island (see Figure 2). This location is close to the upstream extent of the salt wedge intrusion and limits the inclusion of complicated river morphology.
The river forcing was not included in the BoFSS1/36, but this did not cause a significant error in the BoFSS1/36 solution at the lateral open boundary of the BoF500. Although both models have the ability to include river discharge, suitable input data were not available, and thus the entry of the SJR was treated as an open boundary forced with the observed water levels at the Oak Point station in the river estuary system (Figure 2). A constant value of 0.7 m was subtracted from the observed Oak Point water levels prior to using them for OBC forcing. This constant is the estimated difference in the reference levels between the Oak Point station and the model [21].
At the sea surface, the models were forced using the forecast fields from the High-Resolution Deterministic Prediction System of the CCMEP, with a horizontal resolution of 2.5 km [39]. The forecast variables included hourly winds at 10 m height, air temperature and specific humidity at 2 m height, sea-level air pressure, precipitation, and surface incoming longwave and shortwave radiation. In NEMO, the surface wind stress, the sensible and latent heat fluxes, and the rate of evaporation were calculated using the forecast variables and the bulk formulae of the Coordinated Ice-Ocean Reference Experiments [40]. In FVCOM, the corresponding calculations were based on the Coupled Ocean-Atmosphere Response Experiment (COARE) version 3.0 algorithm [41], which was modified to use the specific humidity (instead of the relative humidity) as the input. Note that for the solutions provided for evaluation, the forcing of surface freshwater flux, due to evaporation and precipitation, was included in NEMO but not in FVCOM. According to the evaluation results (presented in the following sections), this difference was not a main factor for the difference in the model solutions.
Finally, both NEMO and FVCOM include parametrizations for unresolved sub-grid processes. The parameterizations used by the two models are similar but with some subtle differences, and are tuned separately to improve the performance of the models. NEMO uses a partial slip scheme for velocity near the lateral solid boundary, and the variable horizontal mixing for momentum computed with the scheme of Smagorinsky [42,43]. FVCOM specifies a zero normal component of the velocity on the lateral solid boundary and also uses the Smagorinsky eddy parameterization for the horizontal diffusion [44]. Both models adopt high-order closure schemes to compute the vertical mixing for tracers and momentum [45]: the k-ε configuration of the generic length scale scheme for NEMO and the 2.5 level scheme of Mellor-Yamada for FVCOM. Both models adopt the quadratic law for bottom drag parameterization but with different tuning of the drag coefficient to maintain the numerical stability and improve the tidal solutions. For NEMO, the drag coefficient has background values of 2.5 × 10 −3 , 4 × 10 −3 and 5 × 10 −3 for the BoFSS1/36, BoF500 and SJAP100 components, respectively. In BoF500 and SJAP100, the drag coefficient was locally increased in the upper Bay of Fundy and in the Reversing Falls [46]. In FVCOM, the coefficient Cd is calculated by fitting a logarithmic bottom boundary layer to the model at a specified height zab above the bottom with a bottom roughness scale of z0, Cd = max(k2/ln(zab/z0)2,Cdmin) where k is the von Karman constant, z0 = 0.001, and Cdmin = 0.02. To obtain stable runs and simulate the strong dissipative effects of the rapids in the Reversing Falls, the sponge layer feature of FVCOM was used to reduce the horizontal currents in the Falls and approximately 1 km downstream using a damping coefficient of 0.015 m and a sponge radius of 500 m.

Evaluation of the Tidal Water Level and Currents
The tides dominate the variations of water levels and currents in the BoF and are a major concern for navigational safety in SJH. The tidal currents have a direct impact on the drift of spilled and other hazardous materials in the water.
The evaluation of the tidal water level and currents was based on their harmonic constants using the following metrics: Difference in amplitude and phase: Percent difference in amplitude: Vector difference for water level: Vector difference for currents: with: In the above equations, the subscript "o" denotes the observations and "m" denotes the model results.
All the variables are defined for a single constituent. In Equations (1) and (2), X represents either the amplitude or phase. In Equation (3), A and ϕ denote the amplitude and phase of water level, respectively [26]. In Equations (4a) and (4b), Q is the 2D tidal flow (u, v) represented in complex form (with i = √ −1), t is the time and T is the period of a specific tidal constituent with frequency ω. The other symbols are related to the tidal current ellipse: M is the length of the semi-major axis, ε is the eccentricity (defined as the ratio of the minor axis to the major axis), θ is the orientation, and ψ is the phase of the complex current [46].
For an area like the BoF, where there is significant spatial variability in the tides, percent difference is a more useful metric than the absolute difference. Vector difference (Equations (3) and (4)) combines the errors in amplitude and phase into one metric. Vector difference was the primary metric for evaluating the models' overall skill in reproducing the tides, but evaluating the amplitude and phase separately was useful for understanding the root of the errors. Harmonic analysis on the water level and currents used the t_tide package for [47] with the Rayleigh criteria equal to 1. For stations with water level observations during 1 May 2015-30 April 2016, which include the real-time gauges and the moored ADCPs, the model output was extracted for the same time period as the observations. We also obtained harmonic constants derived from historical observations at the "constituent-only" stations. At these stations, the harmonic constants of the model solutions were obtained by performing the analysis on the full 1 year model output, regardless of the length of observations that were used to compute the observed constituents. For tidal currents, the model output was interpolated to the observed depths. Constituents (major and minor axis, inclination and phase) were computed for each depth bin of the observations. The evaluation focused on selected depths: the "surface", and 5, 10 and 20 m from the surface. Here, the "surface" is the uppermost bin with >75% ping return for the entire time series.
The polar plots in Figure 3 visually report the tidal water level metrics for each station in the inner harbour ( Figure 3a) and outer harbour (Figure 3b), for the M2 constituent. The tidal ellipses in Figure 3 are representative examples of the M2 current at the surface from one station in the inner harbour ( Figure 3c) and one station in the outer harbour (Figure 3d). Table 2 summarizes the evaluation results for the M2 constituent, for several selected metrics: D% (amplitude), D (phase) and Dv for water levels; and D (major axis), D (phase), and Dv for currents. The mean and the standard deviation of the mean for each metric, across the stations in the inner harbour and outer harbour, separately, are listed. Note that the means and standard deviations were computed using the absolute value of the metric so that the values reflect the average error in the model, regardless of whether the error was positive or negative. Similar tables were produced for other tidal constituents, but, for brevity, they are not presented here. For the tidal water levels, according to Dv, FVCOM performed the best overall, followed by WebTide, NEMO, and then RIOPS. The performance of the models was not significantly different between the inner harbour and the outer harbour. Compared to WebTide, FVCOM had a slightly larger error in amplitude but a smaller error in phase. The larger error in NEMO was due to an error in the input files that was identified toward the end of the evaluation and corrected afterward [25]. RIOPS significantly underperformed compared to the other models, mainly due to the coarse spatial resolution that cannot capture the resonance characteristics of the M2 tide in the Bay of Fundy and uncalibrated tides in general.
For tidal currents, according to the Dv metric, FVCOM and NEMO performed similarly in the inner harbour, but FVCOM performed slightly better than NEMO in the outer harbour. FVCOM had smaller errors in the phase in the inner harbour, and smaller errors in the major axis in the outer harbour. Both FVCOM and NEMO significantly outperformed WebTide, mainly because the WebTide does not include the baroclinic dynamics. The errors for the RIOPS currents were not quantified but were expected to be significantly larger.     For the tidal water levels, according to Dv, FVCOM performed the best overall, followed by WebTide, NEMO, and then RIOPS. The performance of the models was not significantly different between the inner harbour and the outer harbour. Compared to WebTide, FVCOM had a slightly larger error in amplitude but a smaller error in phase. The larger error in NEMO was due to an error in the input files that was identified toward the end of the evaluation and corrected afterward [25]. RIOPS significantly underperformed compared to the other models, mainly due to the coarse spatial resolution that cannot capture the resonance characteristics of the M2 tide in the Bay of Fundy and uncalibrated tides in general.
For tidal currents, according to the Dv metric, FVCOM and NEMO performed similarly in the inner harbour, but FVCOM performed slightly better than NEMO in the outer harbour. FVCOM had smaller errors in the phase in the inner harbour, and smaller errors in the major axis in the outer harbour. Both FVCOM and NEMO significantly outperformed WebTide, mainly because the WebTide does not include the baroclinic dynamics. The errors for the RIOPS currents were not quantified but were expected to be significantly larger.

Evaluation of the Non-Tidal (Residual) Water Level and Currents
The non-tidal (or residual) component of the water level is important for predicting extreme water level events like storm surges. Residual currents dominate the net (after averaging out tidal excursions) evolution of the drift trajectory and contribute to extreme events of concern for navigational safety.
The residual water level and currents were computed by removing the tidal component (as determined by t_tide) from the observations and model output. An additional filter was applied to remove the energy in the tidal period bands (22-28 h, 11-14 h, 5-7 h). For the currents, all energy at periods < 7 h was removed.
The following metrics were used to evaluate the residual water level and the currents at each station (and depth for currents): Mean bias: Root mean square error: Relative average error: Correlation: Skill: In the above equations, X denotes the time series of the water level or a velocity component, and the overbar denotes its time mean. The other notations are consistent with those used in Equations (1)-(4). In conjunction with the above metrics, we also compared the power spectra of the time series.  Figure 4d,e, respectively. NEMO and FVCOM obtained consistent results for the two freshet events, including the deficiencies after the peak water levels. That is, the modelled water levels were lower than those observed after the peak of the fall freshet, and higher than that observed after the peak of the spring freshet. The causes of these discrepancies were not further explored. The three models (NEMO, FVCOM and RIOPS) all captured the main characteristics of the observed variations of residual water level over a full year period, including similar spectral distributions.    Table 3 summarizes the results as the mean and standard deviation across all stations (and the depths for currents) for each metric. For the residual water level, the metrics were computed on the demeaned times series. The mean bias for the water level across all the stations (including the ADCP measurements) was not computed. NEMO, FVCOM and RIOPS showed very similar values for the three metrics (RMSE, RAE and correlation) listed. There are also no significant differences in the skills between the inner and outer harbours. The correlation of both models with the observed water level was high; on average, 0.74 in the inner harbour, and 0.78 in the outer harbour, but as much as 0.9 at some stations. The metrics showed no significant difference in model performance between the inner harbour and the outer harbour, but the spectra from the Saint John tide gauge indicates that, in the inner harbour, the NEMO model had better representation of the energy in the 5-10 day period band, and the FVCOM captured more of the high-frequency (<6 h) energy. Table 3. The evaluation metrics for the residual water level and currents (see definitions in Equations (5)- (8) and in text), averaged for the stations in the inner and outer SJH, separately. Numbers in brackets are the standard deviations across the stations. RMSE, RAE, R and Skill are defined in Equations (6)- (9). Note that for the water levels, the mean bias is not computed and the other metrics are computed for the demeaned time series. Water level metrics are computed for NEMO, FVCOM and RIOPS. Current metrics are computed for NEMO and FVCOM only.  Figure 5 presents examples of the observed and modelled residual current at the surface from representative ADCPs in the inner harbour (a,b) and the outer harbour (c,d). As reflected in the figures, both NEMO and FVCOM captured the timing of large-scale events but underestimated the magnitude of the currents in the inner harbour. The magnitude of the currents was better predicted by the models in the outer harbour. The metrics in Table 3 indicate that both NEMO and FVCOM showed a better performance in the inner harbour (with average correlations in u an v between 0.42 and 0.53, respectively) compared to the outer harbour (average correlation between 0.15 and 0.32). We did not evaluate the models against RIOPS because its coarse horizontal resolution poorly represents the spatial variations of currents in SJH.

Evaluation of the Temperature and Salinity Fields
Variations in the water temperature and salinity cause changes in density, density-driven currents, the stratification of the water column, and the chemistry (which subsequently influences the fate and behavior of spilled oil) of the water.
Variations in the temperature and salinity in SJH were first evaluated qualitatively for three cases: low water, spring freshet and fall freshet. Salinity variations are mainly attributed to the input of freshwater from the Saint John River and the mixing of freshwater and seawater in the harbour. Because NEMO and FVCOM were forced by the observed water level at Oak Point station (in the river estuary system), we first qualitatively diagnosed the freshwater flux (river runoff) across a section a few kilometers downstream of Oak Point and determined that the diagnosed fluxes were similar in both models. Figure 6 presents the surface density fields from NEMO and FVCOM, as well as satellite-derived images of suspended particulate matter (provided by E. Devred, DFO), during

Evaluation of the Temperature and Salinity Fields
Variations in the water temperature and salinity cause changes in density, density-driven currents, the stratification of the water column, and the chemistry (which subsequently influences the fate and behavior of spilled oil) of the water.
Variations in the temperature and salinity in SJH were first evaluated qualitatively for three cases: low water, spring freshet and fall freshet. Salinity variations are mainly attributed to the input of freshwater from the Saint John River and the mixing of freshwater and seawater in the harbour. Because NEMO and FVCOM were forced by the observed water level at Oak Point station (in the river estuary system), we first qualitatively diagnosed the freshwater flux (river runoff) across a section a few kilometers downstream of Oak Point and determined that the diagnosed fluxes were similar in both models. Figure 6 presents the surface density fields from NEMO and FVCOM, as well as satellite-derived images of suspended particulate matter (provided by E. Devred, DFO), during two different phases of the tide. These figures show that the models obtained a similar extent of the river plume, consistent with the interpretation of satellite images.  The SmartAtlantic Buoy provided the only time series measurement of SST in SJH and was essential for evaluating the annual cycle of the SST. Figure 7 presents the comparison of the observed and modelled SST from the NEMO and FVCOM. The mean bias, RMSE, RAE, correlation and skill (as described in equations [5][6][7][8][9] were computed for the time series of SST from NEMO, FVCOM and RIOPS, and are presented in Table 4. Because the temperature fields from RIOPS were only available as daily means, the metrics for RIOPS were computed using the daily mean of the observations. Both NEMO and FVCOM captured the observed seasonal cycle as well as the high-frequency variations, but there was a warm bias in the FVCOM model. FVCOM performed slightly better than NEMO in terms of correlation (0.96 for FVCOM and 0.94 for NEMO), and both models outperformed RIOPS across all metrics except mean bias. Figure 8 compares the observed vertical profiles of temperature and salinity from the CTD casts with the model simulations at two representative stations: one in the inner harbour (a-b), and one in the outer harbour (c-d). The temperature and salinity profiles fluctuate significantly with the phase of the tide in SJH, particularly for salinity due to the interaction of river runoff, tidal flow, and mixing. The observed profiles were compared to the nearest model output in time (t) and space. The figures also include shaded areas that represent the profiles within t − 3 hours and t + 3 hours which show how the modelled profiles varied over a 6 hour period. As expected, the figures show more variation in the modelled T and S profiles in the inner harbour than in the outer harbour. For the qualitative evaluation of the profiles, the model results were "acceptable" if the observed profile fell within the shaded area. Generally speaking, NEMO showed a better representation of the vertical stratification. FVCOM obtained more uniform temperature and salinity profiles, probably due to too strong vertical mixing that needs to be tuned. The SmartAtlantic Buoy provided the only time series measurement of SST in SJH and was essential for evaluating the annual cycle of the SST. Figure 7 presents the comparison of the observed and modelled SST from the NEMO and FVCOM. The mean bias, RMSE, RAE, correlation and skill (as described in Equations (5)-(9)) were computed for the time series of SST from NEMO, FVCOM and RIOPS, and are presented in Table 4. Because the temperature fields from RIOPS were only available as daily means, the metrics for RIOPS were computed using the daily mean of the observations. Both NEMO and FVCOM captured the observed seasonal cycle as well as the high-frequency variations, but there was a warm bias in the FVCOM model. FVCOM performed slightly better than NEMO in terms of correlation (0.96 for FVCOM and 0.94 for NEMO), and both models outperformed RIOPS across all metrics except mean bias. Figure 8 compares the observed vertical profiles of temperature and salinity from the CTD casts with the model simulations at two representative stations: one in the inner harbour (a,b), and one in the outer harbour (c,d). The temperature and salinity profiles fluctuate significantly with the phase of the tide in SJH, particularly for salinity due to the interaction of river runoff, tidal flow, and mixing. The observed profiles were compared to the nearest model output in time (t) and space. The figures also include shaded areas that represent the profiles within t − 3 h and t + 3 h which show how the modelled profiles varied over a 6 h period. As expected, the figures show more variation in the modelled T and S profiles in the inner harbour than in the outer harbour. For the qualitative evaluation of the profiles, the model results were "acceptable" if the observed profile fell within the shaded area. Generally speaking, NEMO showed a better representation of the vertical stratification. FVCOM obtained more uniform temperature and salinity profiles, probably due to too strong vertical mixing that needs to be tuned. The mean bias, RMSE, and RAE were computed for each profile, and the correlation and skill were computed using all the profile data in the inner harbour and outer harbour, separately. Table 3 summarizes the results, listing the mean and standard deviation across all stations in the inner harbour and outer harbour, separately, for mean bias, RMSE, and RAE. According to the correlation metric, NEMO and FVCOM performed better in the inner harbour than the outer harbour. With respect to salinity, the summarized metrics indicate that the errors for NEMO were lower than FVCOM, but a t-test shows no significant difference. The correlation metric indicates that NEMO had a slight advantage over FVCOM in the inner harbour, while the opposite was true in the outer harbour.
The high-variability of temperature and salinity in Saint John Harbour, particularly when close to the mouth of the river, was difficult to evaluate with point measurements, but nonetheless, the analysis revealed the presence of a warm/salty bias in the FVCOM model. The warm/salty bias in FVCOM has since been corrected and was due in small part to a technical issue in the implementation of the COARE 3.0 algorithm that was identified toward the end of the evaluation, and in large part due to the placement of the outer boundary extending past the shelf break. That being said, both models performed better than RIOPS due to the inclusion of the freshwater flux from the Saint John River. profiles). See the definitions of the metrics in equations 5-9 and in the text. The mean bias (D), RMSE and RAE are averaged for the stations in the inner and outer SJH. Numbers in brackets are the standard deviations across the stations. Correlation and skill were computed using all the profile data in the inner and outer harbour, separately.   The mean bias, RMSE, and RAE were computed for each profile, and the correlation and skill were computed using all the profile data in the inner harbour and outer harbour, separately. Table 3 summarizes the results, listing the mean and standard deviation across all stations in the inner harbour and outer harbour, separately, for mean bias, RMSE, and RAE. According to the correlation metric, NEMO and FVCOM performed better in the inner harbour than the outer harbour. With respect to salinity, the summarized metrics indicate that the errors for NEMO were lower than FVCOM, but a t-test shows no significant difference. The correlation metric indicates that NEMO had a slight advantage over FVCOM in the inner harbour, while the opposite was true in the outer harbour. Table 4. Evaluation metrics for the time series of the SST (compared to the observations from the SmartAtlantic (SA) Buoy), and the water temperature and salinity (compared to the observed CTD profiles). See the definitions of the metrics in Equations (5)- (9) and in the text. The mean bias (D), RMSE and RAE are averaged for the stations in the inner and outer SJH. Numbers in brackets are the standard deviations across the stations. Correlation and skill were computed using all the profile data in the inner and outer harbour, separately. The high-variability of temperature and salinity in Saint John Harbour, particularly when close to the mouth of the river, was difficult to evaluate with point measurements, but nonetheless, the analysis revealed the presence of a warm/salty bias in the FVCOM model. The warm/salty bias in FVCOM has since been corrected and was due in small part to a technical issue in the implementation of the COARE 3.0 algorithm that was identified toward the end of the evaluation, and in large part due to the placement of the outer boundary extending past the shelf break. That being said, both models performed better than RIOPS due to the inclusion of the freshwater flux from the Saint John River.

Evaluation of Drifter Simulations
Improving drift prediction is one of the primary objectives of the oceanography component of the OPP. Hence, our evaluation process includes a simulation of the surface drift trajectories and a comparison with the observed trajectories of the four types of surface drifters deployed in the SJH.
Drifter simulations were performed using the Canadian Oil Spill Modelling Suite (COSMoS) [48]. COSMoS is a Lagrangian displacement model that solves the equations of motion for drift using surface currents and surface wind. It takes surface currents from the results of models on the structured or unstructured native grids. Note that currents are computed at the centre of the FVCOM grid elements so that the solutions had to be interpolated to the nodes. The computation of the drifter trajectories uses a fourth order Runge-Kutta scheme. The simulations were run with surface currents from the NEMO (SJAP100 and BoF500), FVCOM, and WebTide. A simulated drifter was "released" at the same time and location of each observed drifter, and the simulation was terminated at the end of the observed drifter release. The drift evaluation was restricted to the smallest common domain, which is the SJAP100 domain. Even though BoF500 has a lower resolution in this area, the drift simulations were completed using currents from both SJAP100 and BoF500 to help understand the influence of increased resolution on drift prediction accuracy.
The inclusion of surface wind in the drifter simulations is to account for supplemental wind effects on the drifting object that are not included in the surface currents. However, the true wind effects may differ for different types of drifters. Simulations were run with 0% and 3% of the wind speed (10 m winds from HRDPS) to evaluate the effect. The results consistently showed that adding 3% of wind to the drifter simulations slightly improved the overall skill of the NEMO-based trajectories, but degraded the FVCOM-based solutions. Thus, the results presented here focus on the solutions with no added wind effect.
The first aspect of the drift evaluation was to qualitatively compare the distributions of surface velocities derived from the observed and modelled drifter trajectories. Drifter velocity is approximated as a forward Euler difference between the successive positions and reported in along-shore (241 • T positive) and cross-shore (151 • T positive) components. Note that without including the wind effect, the "drifter velocity" from the models are very close to the surface currents. Figure 9a-e shows the distributions of drifter velocities for the Davis drifters, but the results were similar for all drifter types. The mean and RMS speeds derived from these distributions were within a few percent of each other for all cases, and the shapes of the distributions were generally similar. However, the distribution of the observed drift velocities appears to have heavier tails than the modelled drift velocity distributions, which is consistent with the underprediction of currents noted in Section 3.5. As expected, the distributions based on WebTide were more bi-modal (consistent with ebb and flood tide) in the along-shore direction and more peaked around 0 in the cross-shore direction. They showed no mean speed, and underpredicted the variability in the drift speeds. Such consequences of WebTide having no baroclinic flow was evident here, as well as across all the evaluations of the drifter trajectories (panels d, i, n, s, w, aa of Figure 9). Therefore, the WebTide-based results are not discussed further in this paper.
The second aspect of the drift evaluation was to compare the absolute dispersion of the observed and modelled drifters by plotting distributions of along-shore and cross-shore drifter displacements as a function of time lag, τ. Mathematically: where ∆X is the change in along-shore or cross-shore displacement during time lag τ. To account for all the possible displacements in the area sampled by the drifters, all non-overlapping segments of drifter tracks were used in the analysis. This is equivalent to assuming homogeneous and stationary absolute dispersion statistics. Figure 9f-o shows the dispersion of the observed and modelled Davis drifters for up to 24 h. Displacements in the along-and cross-shore directions (y axis) are binned at hourly intervals (x axis) and the density of the observations in each bin is plotted to visualize the temporal evolution of the drifter 'cloud'. The statistics of along-shore dispersion display a clear 'tidal' character, that is, the expansion and contraction of the 'displacement cloud' on a~12 h cycle. This is superimposed on a mean down-coast (along-shore positive) drift. Oscillations were not as evident in the cross-shore displacements. These characteristics were consistent across all four drifter types.
Generally, the distributions of displacement produced by the NEMO and the FVCOM were similar to the observed statistics, though extreme displacements vary slightly between models and observations. Both FVCOM and NEMO retained the general shape of the tracks, but on average the modelled along-shore displacements were lower than the observed values, indicating that the motion was characteristically similar but occurred closer to Saint John in the models than was observed. In the cross-shore direction, both the FVCOM and the NEMO predicted a larger mean displacement than was observed, indicating that the modelled drifters travelled further offshore than the observed ones. Separation rate is defined as the relative rate of change between the modelled drifter position and the observed drifter position . This is a basic metric for trajectory model performance that allows for the simple interpretation of the quality of future forecasts. It is visualized here by plotting the separation between the modelled and observed trajectories as a function of time ( Figure  9p-s). The 0FVCOM-based trajectory predictions showed the slowest separation rate, though the results varied based on the drifter type. The FVCOM performed best when predicting the tracks of the CODE/Davis drifters, which suggests that the modelled currents averaged over the upper meter of the ocean correspond best with currents at 0.5 m depth (the approximate centroid of the subsurface Figure 9. Results of the drifter analysis for the simulated drifters using the FVCOM (left column), SJAP100 (second from left), BoF500 (middle), WebTide (second from right), and observations where applicable (right column). Distributions of along-shore (blue) and cross-shore (red) drifter velocity (for the Davis drifters) are shown in the first row (a-e). Absolute dispersion in the along-/cross-shore directions (for the Davis drifters) is shown in the second and third row (f-o) with the mean drift superimposed on the density of the simulated/observed positions in each direction. Separation rate between the simulated drifters and the corresponding observations is shown in the fourth row (p-s) for all four drifter types. Instantaneous skill corresponding to the various model predictions is shown in the fifth row (t-w) for all drifter types, and the cumulative skill for the prediction of the Davis drifter trajectories is shown in the sixth row (x-aa), with the mean cumulative skill as a function of time is shown as a solid blue line.
The third aspect of the drift evaluation was to quantitatively assess the simulations using the following metrics: Instantaneous skill: Cumulative skill: Separation rate is defined as the relative rate of change between the modelled drifter position → x mod and the observed drifter position → x obs . This is a basic metric for trajectory model performance that allows for the simple interpretation of the quality of future forecasts. It is visualized here by plotting the separation between the modelled and observed trajectories as a function of time (Figure 9p-s). The 0FVCOM-based trajectory predictions showed the slowest separation rate, though the results varied based on the drifter type. The FVCOM performed best when predicting the tracks of the CODE/Davis drifters, which suggests that the modelled currents averaged over the upper meter of the ocean correspond best with currents at 0.5 m depth (the approximate centroid of the subsurface portion of the CODE/Davis drifters). Separation rates from SJAP100 and BoF500 were mostly similar, but BoF500 performed slightly better at longer time scales.
Instantaneous skill (adapted from Molcard et al. [49]) aims to answer the question, 'To what extent can the model help us locate a drifting object given a last known position and associated time?'. It defines the model's skill as a function of absolute separation from the initial location at an instant in time, denoted by the ith discrete time interval. In Equation (12), d i is the separation between the observed and modelled trajectories at a given point in time (denoted by subscript i), and S i is the linear distance between the observed drifters' position and its deployment location. The normalization by S i ensures that skill is assigned based on the accuracy of the prediction relative to the magnitude of the observed displacement. For example, a prediction that is within 500 m of the corresponding observation is assigned a higher skill if the observed drifter has travelled 5 km than if it has travelled 1 km. This implies that the model skill can increase with time if Si increases more rapidly than d i . This behaviour is clearly evident in the results for Sponge drifters, which exhibited a large down-coast displacement. A non-zero instantaneous skill score indicates that the model improves our knowledge of the drifting object's position from the guess that it is still at the last known position. Therefore, all non-zero skill scores indicate that the model is useful, with higher skill scores indicating a narrower margin of error. FVCOM simulations had the highest instantaneous skill score, and perform best when simulating the tracks of the CODE/Davis drifters.
Cumulative skill (taken from Liu and Weisberg, [50]) answers the question, 'How well can the observed trajectory be reproduced when all the aspects of the trajectory evolution are considered?'. It quantifies the skill of the model using cumulative sums to account for the model's ability to reproduce the magnitude, direction, and timing of drift events, and the range of spatial scales in the flow that are encountered by a drifter along its trajectory. In Equation (13), d i is as defined above, and D obs,i is the distance travelled by the observed drifter during the time i. Cumulative skill is a demanding test of the model's ability to reproduce the timing, magnitude, and direction of drift events, and any skill score above zero was considered an indication of useful model predictions. Figure 9x-aa shows the cumulative skill scores for the prediction of Davis drifter trajectories. Upon close inspection, the FVCOM model slightly outperforms SJAP100 with the solutions indicating non-zero skill for the forecasts up to (and slightly exceeding) 6 h for all drifter types. Again, the FVCOM's increased performance for the CODE/Davis drifters is evident in the cumulative skill scores (Figure 9x-y). We note that both models showed no cumulative skill for >50% of the observed trajectories, however, this might be expected for a highly dynamic region such as SJH where small differences in predictions near the start of the trajectory can lead to substantial deviations at later times.
Interestingly, across all metrics, simulations using BoF500 outperformed the SJAP100 configuration, and according to the cumulative skill scores, BoF500 produced the longest skillful forecast overall-a significant non-zero skill up to 19 h for the CODE/Davis drifters (Figure 9z). This suggests that increased spatial resolution does not necessarily improve the representation of the currents in the uppermost layer of the water column, and a smoothed solution may provide better trajectory predictions. This is related to the unconstrained turbulence in the higher resolution configuration, explained further in Jacobs et al. [51].

Evaluation of Model Efficiency
The evaluation of model efficiency included a comparison of (1) the number of cores and run-time required to run the models, and (2) the scalability of the models which describes the model's ability to run faster on more cores. The performance of the two models can be directly compared because both were run on the same high-performance computing platform of the Government of Canada. Both the NEMO and FVCOM models for Saint John Harbour met the operational requirement (a 48 h simulation in less than 30 min), but the FVCOM model had much less computational cost overall. For a 48 h simulation, the FVCOM used eight slots (24 cores per slot) with a run-time of 19.5 min; and the three-level nested NEMO model used 31 slots with a run-time of 29.1 min. The high computational cost of NEMO was mainly due to the 100 m configuration (SJAP100). Figure 10 shows that the run time of SJAP100 rapidly decreased when the number of slots increased from 10 to 18, and then gradually decreased with an increasing number of slots. For the FVCOM, the rapid decrease in run time occurred as the number of slots increased from 5 to 8. The scalability curve for the 500 m configuration of the NEMO (BoF500) is closer to that of the FVCOM. The requirements on the number of computer slots are related to the difference in the numbers of computed nodes (grids) among different models: the FVCOM has 56,635 nodes and 108,301 elements, while SJAP100, BoF500 and BoFSS1/36 have 466,718, 161,670 and 39,600 grid points, respectively. This demonstrated a major computational efficiency in using the unstructured grid FVCOM over the structured NEMO for nearshore applications. Despite having a higher resolution in SJH, the FVCOM ran faster than SJAP100. configuration (SJAP100). Figure 10 shows that the run time of SJAP100 rapidly decreased when the number of slots increased from 10 to 18, and then gradually decreased with an increasing number of slots. For the FVCOM, the rapid decrease in run time occurred as the number of slots increased from 5 to 8. The scalability curve for the 500 m configuration of the NEMO (BoF500) is closer to that of the FVCOM. The requirements on the number of computer slots are related to the difference in the numbers of computed nodes (grids) among different models: the FVCOM has 56,635 nodes and 108,301 elements, while SJAP100, BoF500 and BoFSS1/36 have 466,718, 161,670 and 39,600 grid points, respectively. This demonstrated a major computational efficiency in using the unstructured grid FVCOM over the structured NEMO for nearshore applications. Despite having a higher resolution in SJH, the FVCOM ran faster than SJAP100.

Conclusions
This paper represents a significant collaborative effort that took place during the first year of a major Canadian research and development program, the Oceans Protection Plan. The oceanography sub-initiative of the OPP was tasked to develop port-scale models to fit into the multi-scale operational ocean prediction systems in Canada, i.e., to extend the prediction capability from global, basin and coastal scales to nearshore waters. The work included three components: (1) the development of an evaluation process, (2) the development of model configurations with two stateof-the-art open source community models, and (3) the evaluation of the performances of the two model configurations including the comparison of each with the existing operational models. Note

Conclusions
This paper represents a significant collaborative effort that took place during the first year of a major Canadian research and development program, the Oceans Protection Plan. The oceanography sub-initiative of the OPP was tasked to develop port-scale models to fit into the multi-scale operational ocean prediction systems in Canada, i.e., to extend the prediction capability from global, basin and coastal scales to nearshore waters. The work included three components: (1) the development of an evaluation process, (2) the development of model configurations with two state-of-the-art open source community models, and (3) the evaluation of the performances of the two model configurations including the comparison of each with the existing operational models. Note that the second and third components were carried out simultaneously, and both model configurations were constantly improved and fine-tuned with the guidance of the evaluation.
The evaluation process includes (1) the selection of the study area, (2) the requirements for the model setup, and (3) the metrics for evaluating the models. The selection of the study area was specific to the nature of the OPP project in that one of the six OPP pilot ports was selected, but the choice of port enabled a general assessment of the models for near-shore port-scale applications. The requirements for the model setup were quite general for evaluating the performance of different source models under the same setting, but by ensuring that both models were configured using the same sources of input data for the model setup and forcing, the differences in the model performances can be mostly attributed to the differences in grid structure, model numerics, and software technology. Thus, the results of this study are valuable for further improvements of the models. The metrics can be used for evaluating any model. They were selected to assess the various aspects of the model results, and collectively, they were able to detect minor contrasts between the two models.
The chosen study area, Saint John Harbour in the Bay of Fundy, features the presence of strong tides, significant river runoff and a narrow tidal-river channel (Reversing Falls). The complicated regional oceanography posed challenges to both the NEMO and FVCOM configurations with the full baroclinic dynamics included. The common challenge was to ensure that the models ran stably, particularly in the narrow channel with strong currents and spatial gradients of temperature and salinity. For this purpose, both models included local treatment of bottom drag parameterization. Previous configurations of the FVCOM model for the Saint John Harbour area did not include atmospheric forcing. Adding atmospheric forcing, particularly the surface heat-flux, was a challenge, but the issue was later attributed to the previously mentioned bug in the code. A challenge for NEMO was the lack of experience in creating a multiple-level nested configuration for nearshore waters. Despite the urgent timeline, both configurations were created and improved during the first year of OPP.
The evaluation metrics were defined based on the existing expertise of the team, through expert consulting and literature research, as well as the available ocean observational data. Evaluation with these metrics led to continuous improvements of both models, that is, errors in model settings (bathymetry, model parameters and forcing) were constantly identified and corrected. The evaluation results presented in Sections 3.4-3.8 were based on the model results obtained toward the end of the first year of OPP. In terms of model accuracy, compared to the observational data, NEMO and the FVCOM achieved comparable metrics for the tidal and non-tidal components of sea level and currents, seasonal variation of the sea surface temperature, vertical profiles of water temperature and salinity, and the trajectories of surface drift. Note that for the evaluation of the drifter trajectories, the impacts of winds were not fully considered. This aspect needs to be included in future work. A major difference identified was that the FVCOM required less computer resources and ran faster than the NEMO. One possible solution to reduce the required number of the computer slots is to reduce the domain size of the finest model configuration (SJAP100), and using the two-way nesting approach to ensure the dynamic interaction across different configurations are properly simulated. Initial testing on two-way nested configuration was achieved after the completion of the evaluation in the first year (results not shown). The results of the model efficiency evaluation suggest that both the NEMO and FVCOM meshes could be further refined, but in doing so, the NEMO model would not meet the operational requirement for runtime due to the limitations of the model numerics and computer efficiency.
Finally, the evaluation of the models was limited by the existing observational data. With increasing resolution, the models are reaching the limits to which the ocean is observed in terms of small scale, rapidly varying, and nonlinear processes. To support the envisaged future of port-scale operational e-navigation systems, we need to reassess what observational capacities are required to support this, both from a monitoring point of view (i.e., real-time and high spatial resolution data), as well as for model development (e.g., delayed-mode data such as moorings).