Accurate Measurements of Forest Soil Water Content Using FDR Sensors Require Empirical In Situ (Re)Calibration

: Monitoring volumetric soil water content ( θ v ) is the key for assessing water availability and nutrient ﬂuxes. This study evaluated the empirical accuracy of θ v measurements using standard and in situ calibrated frequency domain reﬂectometers (FDR) with gravimetric water content and bulk density measurements of 1512 samples gathered from 15 proﬁles across 5 ICP Forests level II intensive monitoring plots. The predicted θ v , calibrated with standard functions, predominantly underestimated the real water content. The measurement error exceeded the threshold of 0.03 m 3 m − 3 in 93% of all soil layers. Layer speciﬁc calibration removed bias and reduced the overall prediction error with a factor up to 2.8. A simple linear regression often provided the best calibration model; temperature correction was helpful in speciﬁc cases. To adequately remove bias in our study plots, a calibration dataset of up to 24 monthly observations was required for topsoils (whereas 12 observations sufﬁced for subsoils). Based on estimated precision errors, 3 sensors per soil layer proved to be sufﬁcient, while up to 16 sensors are needed to meet the required accuracy in organic topsoils. Validating FDR sensor outputs using in situ gravimetric measurements is essential for quality control and assurance of long term θ v monitoring and for improving site speciﬁc instrumentalization.


Introduction
Volumetric soil water content (θ v ) is a key variable in hydrological, climatological, and environmental studies and its monitoring is essential to investigate soil-water-plant relationships and nutrient fluxes. Under a changing climate, the need for accurate spatiotemporal soil moisture data in all types of soils and land-use is even increasing.
Inaccurate soil moisture data has several potential implications: it may lead to the development of unreliable indicators for soil water availability or drought, it will introduce noise and systematic errors in water balance models at various spatial scales; and it could lead to incorrect nutrient budget calculations. Hence, there is a great need for affordable, highly accurate soil water measurement systems operating at a low cost for a long time.
For decades, soil moisture sensors have been widely used for their ability to provide nondestructive continuous data [1] and their predictive performance has been evaluated in many studies [1][2][3][4][5][6][7][8][9][10]. Sensors are often grouped in electromagnetic (EM) methods, such as time domain reflectometry (TDR) and frequency domain reflectometry (FDR), time domain transmission (TDT), capacitance and impedance sensors [6], and also neutron moisture meters (NMN) [11] and heat-pulse probes (thermal sensors) [12] have been applied for soil water content measurements at the point scale. The conventional reference method both in the field and in the lab is destructive gravimetric sampling, preferably in steel cylinders in order to determine gravimetric water content and bulk density on the same undisturbed soil sample. The Campbell CS616 sensor, we evaluated in this study, is an FDR type sensor widely recognized as a robust probe suitable for long-term monitoring, less accurate than TDR but more cost-effective [13]. FDR differs from conventional TDR in that it uses a specific voltage level of the signal reflected from the end of the waveguide to trigger the next pulse instead of analyzing the entire waveform as in TDR. Its installation, operation and performance are well described by Ruelle and Laurent [13]. The rise time of the reflected pulse of FDR sensors changes with the soil bulk electrical conductivity, clay type (mineralogy and specific surface area) and content, soil temperature, and organic matter content [10,11]. This limitation makes site-specific calibrations of such sensors necessary [14]. Therefore, efforts have been made for soil specific calibration of CS616 sensors: from coarse textured soils [1], clay-loam soils [5], silty clays [15] to soils with high-charge clays [3,7] or a range of soil textures [10] including saline soils [14].
Various ways of sensor calibrations are performed. The most straightforward one-step comparison is relating the FDR output period as an indirect method with a direct measure of θ v (standard gravimetric method). FDR response may also be compared with the response of other indirect methods assumed to be more accurate and precise, for instance NMN [3] or TDR [10]. Alternatively, a two-step calibration procedure can be used. In a first step, the relationship between sensor response and permittivity is determined for each sensor (i.e., a sensor-specific calibration). In a second step, site-specific relationships between permittivity and soil water content can be established using empirical or semi-empirical models [16] with a limited number of measurements on soil samples, preferably using the highly accurate TDR method (soil specific calibration) [6]. Lab calibration is often carried out on artificial or real volumetric soil samples in order to control most factors affecting the readings. Only few calibration studies are conducted directly in the field, since taking field samples is destructive, labor intensive, slow, not timely, and costly [11,17]. Most published calibration studies are performed on arable soils [1,4,17], which are generally more homogenous due to cultivation compared to forest soils characterized by greater and more variable organic matter contents, bulk densities and stoniness. For forestry, soil water is of primary concern as an essential medium enabling movement of nutrients towards and within trees, strongly impacting their growth, yield and vitality over decades. Therefore, research sites in forests are often equipped with TDR or FDR soil moisture sensors for continuous water budget monitoring [18], although few calibration studies evaluated the trueness and precision of these sensors. Findings in forest soils were reported by Czarnomski et al. [9] using a CS615 sensor, a former version of CS616, and Udawatta et al. [15] tested the water dynamics with the CS616 for an agroforestry practice. To our knowledge, this study is the first attempt to systematically perform in situ recalibration and validation of FDR sensors in European forest soils, and at various depths.
Direct measurement of reference gravimetric θ v , with good practice, can be accurate to less than 0.01 m 3 m −3 [11]. For most agricultural applications, Hignett et al. [11] recommended the sensor measurement accuracy for measuring θ v to be less than 0.02 m 3 m −3 , while Varble and Chávez [4] applied an acceptable mean bias less than 0.02 m 3 m −3 and RMSE < 0.035 m 3 m −3 . The latter performance criteria were copied by Dong et al. [1] to evaluate their CS616 FDR sensors in agricultural fields.
For forest soils, the Expert Panel on Meteorology operating under the International Cooperative Program (ICP) Forests (Working Group on Effects, UNECE AIR convention) is recommending in its current manual [19] a minimum acceptable accuracy for θ v measurements of 0.03 m 3 m −3 . In this study, we evaluated whether this accuracy level is achievable at single FDR sensor level using the standard calibration functions proposed by the manufacturer [20] or by recalibration functions using the direct field approach.
Hence, the research questions of this study are: (1) Does in situ recalibration substantially improve the accuracy of soil moisture measurements compared to standard calibration? (2) Is calibration required for each sampling location and depth to achieve the given data quality objectives? (3) Which soil characteristics have the largest impact on the bias and precision error? (4) How many reference measurements over which period are needed for adequate calibration of the FDR sensors? The practical outcome of this study is a calibration dataset for each plot and adequate layer specific calibration functions for our long-term monitoring plots.

Study Sites and Soil Characteristics
All measurements were conducted on five intensive monitoring plots of the International Co-operative Programme on Assessment and Monitoring of Air Pollution Effects on Forests (ICP Forests), in Flanders (North of Belgium) and monitored since the beginning of the 90s (Table A1). Flanders has a moderate Atlantic climate with a mean annual precipitation (MAP) of 837 mm and a mean temperature (MAT) of 11 • C (Reference period 1991-2020). Soil climate is classified as Mesic for soil temperature and Udic for soil moisture regime [21]. Two plots are installed in evergreen coniferous forest: Pinus nigra subsp. laricio (Poiret) Maire as an introduced species in Ravels (RAV), and Pinus sylvestris L. in Brasschaat (BRA). These even-aged Pinus stands (both around 91 years old) are typically growing on the predominantly sandy soils (Arenosols and Podzols) of the Campine region. The soil profile in RAV is well-drained, while in BRA the infiltration of water is locally slowed down by clay lenses at 100-190 cm depth. Three other plots are installed in deciduous broadleaved forest: Fagus sylvatica L. in Wijnendale (WIJ) and Hoeilaart (HOE) and a mixture of Quercus robur L. and F. sylvatica L. in Gontrode (GON). According to the World Reference Base for Soil Resources [22][23][24], the soil of WIJ is classified as an Umbrisol, GON as a Planosol and HOE as a Retisol, formerly classified as Albeluvisol. The associated qualifiers are listed in Table A1, as well as the parent material of the soils. Moder humus systems [25] were found in GON and HOE, whereas the other sites qualified as Mor systems (Table A1). All sites encompass circular plots with an area of 0.25 ha.
Average physico-chemical soil characteristics of the fixed depth soil layers are provided in Table 1. Soil textural fractions (clay, silt, and sand), acidity (pH-H 2 O), electrical conductivity (EC), total organic carbon (TOC), total nitrogen (TN) and cation exchang capacity (CEC) are determined on a sample composite of 24 sampling points, while bulk density (ρb) is the average of 5 undisturbed samples of each layer and the mass of the organic layer (OLM) based on at least 3 replicate measurements. These analytical results are stored in the ICP Forests Level II aggregated soil database [26]. For all soil variables methods were applied according to the ICP Forests soil manual [27]. For the distinct soil layers of the three profiles (A, B, C) per site, soil water retention curves (SWRCs) were fitted according to the Van Genuchten model [28] based on water retention measurements (Sand/Kaolin box and membrane pressure plates from Eijkelkamp equipment, The Netherlands) of undisturbed samples taken at the time of FDR sensor installation in each layer. Table 2 lists the parameters residual water content (θ r ) , saturated water content (θ s ), air entry (α) and pore size distribution (n). From these SWRCs we derived field capacity (θ fc ) for sandy soils (WIJ, RAV and BRA) at a potential of −10 kPa and for the silt loam to clay soils (GON and HOE) at −31.6 kPa. Permanent wilting point θ pwp was determined at −1584 kPa. The difference between both predicts available water capacity. Table 2. Soil water retention and soil moisture characteristics.

Van Genuchten Parameters of Modeled Soil Water Retention Curve
Soil Moisture Characteristics The volumetric soil water content is measured at four depths using Campbell CS616 soil water reflectometers (FDRs). Each FDR sensor consists of two stainless steel rods (300 mm long, 3.2 mm diameter, 32 mm spacing) connected to a printed circuit board. A shielded four-conductor cable is connected to the circuit board to supply power, enable the probe, and monitor the pulse output. The circuit board is encapsulated in epoxy. Highspeed electronic components on the circuit board are configured as a bistable multivibrator. The output of the multivibrator is connected to the probe rods which act as a wave guide. The travel time of the signal on the probe rods depends on the dielectric permittivity of the material surrounding the rods and the dielectric permittivity depends on the water content [20]. Digital circuitry scales the multivibrator output to an appropriate frequency for measurement with a datalogger (in our case a Campbell Scientific CR1000). The sensor output is essentially a 0.7 V square wave with frequency dependent on water content. The probe output period average (PA) ranges from about 14 microseconds with rods in air to about 42 µs with the rods completely immersed in tap water. A calibration function, which needs to be monotonically increasing, converts output period (PA) to volumetric water content (θ FDR ) predicted by the sensor. The probe output period (PA) was measured by a special CS616 data logger instruction and stored in the loggers' memory as raw data.
Campbell Ltd. [20] provides standard manufacturers' calibration functions, valid for all soils with a bulk electrical conductivity < 500 µS/cm and a soil bulk density < 1550 kg m −3 : where θ FDR being the measured volumetric water content expressed as a fraction (m 3 m −3 ), PA the period average (in µs) as uncorrected sensor output. The manufacturer reports an error in measured θ FDR caused by the temperature dependency of the CS616 sensor, and provides the following correction equation to be applied to the uncorrected sensor output: where PA c is the temperature corrected period (in µs), ST the soil temperature ( • C) at the time and depth of the PA measurement. The coefficients of the standard linear (LM) and quadratic model (QM) given in (1) and (2) are also applied when using PA c instead of PA. Soil temperature was measured using BetaTherm 100K6A Thermistor (model T107, Campbell) with a measurement range between −35 • C and +50 • C and with an interchangeability error of <±0.2 • C within the 0-60 • C range. According to the manufacturer and in a "worst case" situation, all errors add to an accuracy of ±0.4 • C within the range −24 • C to 48 • C and maximally ±0.9 • C within the range −38 • C to 53 • C.
Three dedicated profile pits were dug in spring 2010 and soil material was layerwise temporarily on-site stored till the pit was ready to be closed again, layerwise with exactly the same soil material. Prior to installation of the sensors, a soil profile description of the pedogenetic horizons was performed and soil core samples were taken at the fixed depth intervals 0-10, 10-20, 20-40 and 40-80 cm, yielding the data in Table 2.
CS616 sensors were carefully introduced horizontally in the wall of these soil pits ( Figure 1). A fork hole with a dummy and level was applied to assure good contact of the rods with the soil following the guidelines of [13]. Through the horizontal installation of the 30 cm rods in order to determine vertical water fluxes, soil water content is 'averaged' over at least this length at the specific depth reducing small scale soil variation and capturing possible vertical preferential flow. Ruelle et al. [13] state that a horizontal buried probe at a particular depth will give the mean soil water content at that depth ± approximately 1.5 cm. Sensors were inserted at 4 depth intervals: 0-5 cm, 5-20 cm, 20-40 cm and 40-80 cm maximally overlapping the sampling depths of the long-term Level II soil monitoring program ( Table 2) in order to relate physico-chemical data with time series of soil water dynamics for each layer.
rods with the soil following the guidelines of [13]. Through the horizontal installation of the 30 cm rods in order to determine vertical water fluxes, soil water content is 'averaged' over at least this length at the specific depth reducing small scale soil variation and capturing possible vertical preferential flow. Ruelle et al. [13] state that a horizontal buried probe at a particular depth will give the mean soil water content at that depth ± approximately 1.5 cm. Sensors were inserted at 4 depth intervals: 0-5 cm, 5-20 cm, 20-40 cm and 40-80 cm maximally overlapping the sampling depths of the long-term Level II soil monitoring program (Table 2) in order to relate physico-chemical data with time series of soil water dynamics for each layer. At each site, the CR1000 datalogger was situated in the middle of the triangle formed by the profile pits A, B, and C ( Figure 1), limiting the cable length between CS616 probe and logger to maximum 10 m. The T107 temperature sensors were installed at Pit A only (at all four depths) and at least laterally 10 cm away from the CS616 probe to avoid interference. The soil temperature readings of pit A were applied for possible temperature correction in the other pits. After installation of all sensors, the wiring holes were carefully covered with soil and the profile refilled layerwise. All sensors have continued operation since 2010, whereas some of the central CR1000 data loggers required periodical revision or replacement.
The accuracy specifications mentioned by the manufacturer for the θFDR measurement using the CS616 probes is based on laboratory measurements in a variety of soils and over the water content range air-dry to saturated. The soils were typically sandy loam and coarser. The CS616 accuracy is ±0.025 m 3 m −3 using standard calibration At each site, the CR1000 datalogger was situated in the middle of the triangle formed by the profile pits A, B, and C ( Figure 1), limiting the cable length between CS616 probe and logger to maximum 10 m. The T107 temperature sensors were installed at Pit A only (at all four depths) and at least laterally 10 cm away from the CS616 probe to avoid interference. The soil temperature readings of pit A were applied for possible temperature correction in the other pits. After installation of all sensors, the wiring holes were carefully covered with soil and the profile refilled layerwise. All sensors have continued operation since 2010, whereas some of the central CR1000 data loggers required periodical revision or replacement.
The accuracy specifications mentioned by the manufacturer for the θ FDR measurement using the CS616 probes is based on laboratory measurements in a variety of soils and over the water content range air-dry to saturated. The soils were typically sandy loam and coarser. The CS616 accuracy is ±0.025 m 3 m −3 using standard calibration with bulk electrical conductivity ≤ 500 µS cm −1 and soil bulk density ≤ 1550 kg m −3 in a measurement range from 0 to 0.50 m 3 m −3 . Resolution (i.e., the minimum change in the dielectric permittivity that is reliably detectable) is better than 0.001 m 3 m −3 . Precision or repeatability of the θ FDR measurements in the same soil material is also better than 0.001 m 3 m −3 . Probe-to-probe variability ranges from ±0.005 m 3 m −3 in dry soil to ±0.015 m 3 m −3 in a typical saturated soil [20].

Soil Sampling and Analysis
For in situ calibration of the FDR sensors, monthly sampling during the period March 2015-March 2017 for bulk density and gravimetric soil moisture was performed using an Edelman-and Riverside auger and Kopecky steel cylinders of 100 cc (5.3 cm diameter; 5 cm of height). Since none of the soils are stony nor contain coarse fragments, this method is adequate for both the assessment of bulk density, gravimetric and volumetric water content. These core samples were taken using a closed ring holder (Eijkelkamp, Giesbeek, The Netherlands) loaded with the steel cylinder. The exact sampling depths are listed in Table A2. On the WIJ and GON plots the forest floor was >5 cm thick and FDR sensors were installed in the forest floor (OFH layer) of the B and C profiles. Hence, volumetric sampling was carried out in OFH layers as well.
Immediately after sampling and recording of the ring ID, the Kopecky cylinders were sealed with plastic covers (top and bottom) to prevent moisture loss, stored in an extra plastic bag and transported in a cooled box to the lab. Monthly sampling of the three soil profiles per site (4 sampling depths) was performed clockwise in a circle with a radius of 0.5 m from the reference point located 2 m north of the sensors (Figure 1). The sampled vertical profiles were about 25 cm apart from each-other. So, all core samples were taken between 1 and 3 m from the FDR sensors. This distance was required in order to avoid direct damage or interference with the active FDR and soil temperature sensors (and cabling), or indirectly by disturbing the local soil water distribution or preferential flow over the profile by monthly coring and sample extraction. We hereby assumed that the soil water content in the same soil layer was similar~2 m away from the FDR sensor.
Following sampling, the borehole was carefully refilled and a bamboo stick was left in the last sampled borehole as an indicator for the next sampling. Care was taken not to disturb (compact) the soil too much when sampling the soil, especially on the future sampling locations. The second-year sampling was carried out in a circle with a radius of 1 m from the reference point ( Figure 1, green circle).
The core samples were processed in the laboratory. Their wet/moist mass was determined prior to drying in a ventilated oven at 105 • C until constant mass, which was maximally one week for all soils. Then dry bulk density (ρ bKop ) was determined as oven dry mass per unit volume and gravimetric moisture content quantified. Volumetric water content was computed by multiplying gravimetric moisture content with the ρ bKop of the sample. Sampling characteristics for all 5 sites are listed in Table 3. Overall, 1512 cores were processed to compile ρ bKop and θ v values for the calibration dataset.

Calibration Dataset
For all five sites, a calibration dataset was assembled with the attributes listed in Table 4. Table 4. Attributes of the calibration dataset.

Attribute Format/Unit Description
TimeStamp YYYY-mm-dd hh-mm-ss Nearest FDR recording time to field sampling event (in UTC + 1) Period average raw sensor output for each profile i and depth j Fine earth bulk density of profile i and depth j θ m i, j g kg −1 Soil water content by mass (gravimetric) Soil water content by volume (volumetric) Logger data V, • C Logger diagnostics: record ID, battery voltage, logger temperature After final calibration we added the corresponding recalibrated FDR measurements to the datasets (Downloadable from Supplementary material).

Statistical Evaluation
All statistical processing, evaluations and graphics were performed in the R environment, using R version 4.1.0 [29,30] in R Studio v1.3.1073. For time series manipulation and analysis, we used package xts v0.12.1. Evaluation of models was performed both for their relative and absolute quality. Relative model selection among candidate models was evaluated using the Akaike Information Criterion (AIC) corrected for small sample sizes (AICc) using the AICcmodavg package v. 2.3-1 [31]. The AICc weights for each model are given and the evidence ratios (ER) quantifying the amount of support in favor of a model relative to a competing model. Absolute quality of models was judged using a set of complementary validation indices of their predictive quality by comparing their predicted P i with observed (reference) values O i (Equations (4)-(8)). R 2 p is the prediction coefficient of determination informing the explained variance of the model (4), MPE is the mean prediction error or bias (5), SDPE is the standard deviation of the prediction error or noise (precision) (6), RMPSE is the overall model prediction error or model accuracy (7) and is the general closeness of agreement between the FDR output measurement (θ FDR ) and the reference observations (θ v ). RMSPE is the square root of the mean square prediction error (MSPE). For large n, MSPE equals the sum of MPE 2 (bias) and SDPE 2 (noise) as shown in Equation (8). With effective calibration the MPE (bias) is removed and the precision component of accuracy dominates.
The revised index of agreement (dr) is dimensionless and bounded by [−1, 1] with 1 indicating perfect agreement [32]. Principal component analysis was applied using the R prcomp function, with data centered and scaled. The ggbiplot v.0.55 package was used to generate biplots using ggplot2 in order to plot both the position of each sample in terms of the principal components together with the initial variables mapped as vectors. In order to evaluate the importance of soil variables on the RMSPE, a generalized boosted regression tree model was built using the R package gbm v.2.1.8 with a gaussian distribution and following parameters: n.trees = 10,000, shrinkage = 0.01 and an interaction.depth = 4. The relative influence of each predictor, i.e., the reduction of squared error attributable to each variable, was used to select the most important ones for explaining RMSPE. Possible interaction was evaluated using the Friedman's H-statistic which is on the scale of [0, 1] with higher values indicating larger interaction effects.

Validation of FDR Measurements
The summary statistics of the reference dataset used for calibration are presented in Table A3. On the plots WIJ and GON, the upper FDR sensors were located in the forest floor layers (OFH) of B and C profiles, while in mineral topsoil (Ah layer) for the A profile. Hence, bulk samples for reference θ v were taken in the same OFH layers (n = 47 in WIJ and n = 50 in GON) and the remaining in M01 layer of the A profiles. Local spatial variation in OFH thickness caused some reference samples to be more mineral than organic and vice versa. Therefore, we considered reference samples having a ρ bKop < 850 kg m −3 (threshold derived from [29]) as part of OFH layers, and above this threshold as M01. On RAV and HOE, two reference Kopecky samples qualified as OFH samples, whereas none in BRA topsoil samples.
Over the total calibration period raw FDR sensor output ranges from 20.5 to 34.1 µs in the OFH layers, and from 17 to 41.9 µs in mineral layers, covering the full range reported by the manufacturer [20]. Note that on the plots WIJ and GON, neither significant differences in period means nor ranges were found between the FDRs in the OFH and those in the mineral layer (M01) for the same site. However, the observed mean period was lower on the sandy site WIJ compared to the heavier textured GON site (Table 1). Overall, the PA values are lower in the sandy soils (WIJ, RAV and BRA) than in the silty and clayey soils of GON and HOE. Within sites, the mean PA values differ significantly among depth layers as shown by the 95% confidence interval of the means, which can be attributed both to specific soil layer properties and a possible sensor effect.
Each sites' soil temperature was sensed in the A profile only. Mean annual soil temperature was roughly above 10 • C on the sandy sites, and slightly lower on the heavier textured soils, but this difference is neither significant between sites nor between soil layers within the sites for the reference dataset. However, the range is wider in the topsoil layers, especially in the forest floor (OFH) which may warm up to 25 • C. In subsoil layers, the amplitude of the temperature fluctuations is reduced due to the well-known damping effect. Soil bulk density, known for its significant effect on dielectric permittivity, is significantly smaller in the OFH layers compared to the mineral layers (M01-M48). In the mineral soil profile of each site, ρ b generally increases with depth. Lowest ρ b values are found in GON and highest in HOE (Retisol), but also elevated values in the Arenosols (RAV and BRA).
At the WIJ and GON sites, the mean gravimetric soil water content (θ m ) in the OFH layers is more than three times θ m in the mineral soil, clearly showing that forest floor layers can hold more water than their dry mass compared to mineral layers generally storing less than 50% of their mass. On all sites, mean θ m significantly differs by layer with higher mean moisture contents in the 0-10 cm topsoil (M01) due to elevated contents of organic matter. Depending on the depth of the groundwater, perched water table or subsurface flows, mean θ m can rise again in the M48 layer. Volumetric soil water content (θ v ) is greatest in the OFH up to 0.62 m 3 m −3 while maximally 0.52 m 3 m −3 in M01 layers. Again, significant differences in means are found by mineral layers within sites and between sites, justifying the need for multiple moisture sensors. The observed means and ranges of θ v indicate clearly that RAV and BRA are the driest plots (0.05-0.06 m 3 m −3 in dry periods), WIJ can store more soil water in its organic-rich sandy soil compared to HOE in the silt-loam soil, while highest mean θ v values are observed in GON (0.38-0.47 m 3 m −3 ) due to the perched water table above the clayey substrate (Table 1).
For all plots, layer specific graphs are presented in S1 to S5 showing the temporal variation of sampled soil water content in the three profiles, the corresponding FDR response (Period) and their linear relationships. These graphs clearly show variable types of relationships among layers, and within layers (especially topsoil layers) distinct linear relationships between the profiles. For WIJ ( Figure A1 and Figure S1) calibration will be difficult in the topsoil due to high scatter of calibration data (mix of M01 and OFH layers) and limited slope of the linear models. For deeper mineral layers of the WIJ plot with a more limited range in water content, adequate calibration is more feasible, and similar slopes indicate higher correlation between profiles even suggesting that one single calibration function per layer would be feasible. Other hard to calibrate profiles are profile A of layer 1 in RAV and profiles B and C in GON layer 4. Overall, these graphs indicate that layer and profile specific calibration is preferred using reference measurements taken close to the sensors.

Calibration Testing
Based on the calibration dataset, a two-step testing approach was adopted. First, we validated the standard models suggested by the manufacturer by simply relating the measured θ v at a specific moment with the observed output period of the FDR sensor at that moment, for each of the 60 soil layers. In Figure 2a these observed data-pairs are plotted along with the standard function for the M12 layer of profile B on the HOE site. Clearly the standard function underestimates the measured θ v (negative bias of −0.0534 m 3 m −3 , data in Table S5). In the second step, we recalibrated the model in order to remove the bias, and used this recalibrated model for predicting θ FDR using the continuous FDR output period values. Note that with calibration the accuracy improved (RMSPE decreased from 0.057 to 0.019 m 3 m −3 ) and the index of agreement substantially increased from 0.67 to 0.91. Figure 2b illustrates that the monthly measured real soil water content is much better predicted when the recalibrated function is applied.  Table 2).
The standard response remains permanently below field capacity whereas the recalibrated response during winter is situated between θs and θfc as expected. In summertime standard response drops below θpwp in contrast to recalibrated response and the calibration measurements.  Table 2).
The standard response remains permanently below field capacity whereas the recalibrated response during winter is situated between θ s and θ fc as expected. In summertime standard response drops below θ pwp in contrast to recalibrated response and the calibration measurements.
Note that the calibration measurements on the HOE plot started in February when soils were above field capacity, then followed a drying phase until September and a fastrewetting phase in October/November. To improve calibration for the HOE plot, more calibration pairs would be beneficial for θ v between 0.25 and 0.35 m 3 m −3 , roughly situated between May and the end of November.
For each plot, layer and profile, graphs as in Figure 2b were produced to illustrate all calibration results of this study ( Figures A1-A5), with indication of measured θ fc and θ pwp .

Validation of Standard Functions
For all plots, profiles and layers, standard calibration functions (Equations (1) and (2)) were applied and the coefficients of determination (R 2 ) and predictive quality indices are listed in Tables S1-S5.
The standard linear (LM) and quadratic models (QM) generally explained the measured topsoil water content rather poorly, indicated by R 2 as low as 0.005 in OFH layers of the WIJ site and explaining just 10 to 50% of the variance in mineral soil layers. With increasing depth, the standard calibration functions approximate the measured water contents slightly better. Figure 3 displays the positive or negative bias (MPE) and total prediction error (RMSPE) of the standard linear and quadratic functions when predicting the measured θ v .
In general, the quadratic standard model generates a greater bias and total prediction error compared to the linear model. A paired t-test of all profiles and layers reveals that when the standard LM is used the bias is 0.015-0.019 m 3 m −3 smaller compared to the QM for all plots but GON (Table 5). On the latter plot a positive bias (i.e., overestimation of θ v ) was observed attributed to the clayey subsoil layers ( Figure 3). As expected, the random error (SDPE) does not differ significantly between standard LM and QM. The resulting overall prediction error is significantly smaller when using the standard LM, and its index of agreement is significantly greater.
On the WIJ, RAV, BRA, and HOE plots a predominantly negative bias was found, especially in the topsoil: up to 0.16 m 3 m −3 in mineral (M01) and up to 0.32 m 3 m −3 of θ v in OFH layers was underestimated (Tables S1-S5).
For almost all layers the allowable measurement error threshold for θ v of 0.03 m 3 m −3 is exceeded except for two M48 layers in RAV, and two M48 layers in HOE. So, for only 4 out of 60 layers the (linear) standard function predicts θ v according to the required level of accuracy.
The mean total prediction error is greatest for the GON plot (RMSPE = 0.17 m 3 m −3 ) and smallest for the RAV plot (0.07 m 3 m −3 ). The RMSPE is similar for BRA and HOE. The greater overall prediction error on the sites WIJ and GON is mainly due to the large prediction errors of the OFH layers ( Figure 3). If we omit the values for OFH, the mean RMSPE for WIJ becomes 0.09 m 3 m −3 whereas it remains high (0.17 m 3 m −3 ) for GON here due to the high estimation errors in the clayey substrates. In general, the quadratic standard model generates a greater bias and total prediction error compared to the linear model. A paired t-test of all profiles and layers reveals that when the standard LM is used the bias is 0.015-0.019 m 3 m −3 smaller compared to the QM for all plots but GON (Table 5). On the latter plot a positive bias (i.e., overestimation of θv) was observed attributed to the clayey subsoil layers ( Figure 3). As expected, the random error (SDPE) does not differ significantly between standard LM and QM. The resulting overall prediction error is significantly smaller when using the standard LM, and its index of agreement is significantly greater.  Prior to the use of the standard LM and QM, a temperature correction (TC) of the PA signal was performed using Equation (3), and compared to the same model without TC (Table 6). Results show that TC reduces the MPE (bias) significantly for both model types, but has a small to insignificant impact on the precision error, except for WIJ.
For the WIJ plot, overall accuracy as indicated by RMSPE and index of agreement improved significantly, but for the other plots the improvement was negligible. Hence, temperature correction for the standard functions may help in reducing prediction bias. When evaluating the overall RMSPE of standard models with temperature correction, only the predicted θ v of the M48 layers in RAV was not exceeding the allowable error of 0.03 m 3 m −3 , so for only 2 out of 60 layers water content was predicted accurately.

Validation of In Situ Recalibrated Functions
For each layer four candidate recalibration models (LM and QM, with and without temperature correction, denoted LMTC and QMTC), are compared relatively to one another and the best and second-best models were selected based on lowest AICc value (Table A4). For 32 out of 60 layers, the simple LM proved to be the most parsimonious model and was supported over the LMTC or QM. Using the AICc, a LMTC was selected as the best model for 23% of all layers, more frequently than the QM (13% of all layers) and QMTC (10% of all layers). Evidence ratio's (ER) between the best and second-best model ranged from 1.01 (equally well) to 22.8 times better. Generally, the best model in the topsoil had no temperature correction which is remarkable for the layer where variation in soil temperature is greatest.
By definition, recalibration of the standard functions eliminates bias so that overall prediction error is mainly determined by the SDPE (precision) composed of errors attributed to (short distance) spatial variation, sampling errors and random measurement error. Figure 4 shows the calibration gain of the overall prediction error for the best standard (i.e., with highest d r ) and recalibrated functions for all plots.
Hence, compared to the standard functions, recalibration lowered the overall prediction error with a factor 1.8 (RAV) to 2.8 (GON). When comparing plots, the mean RMSPE is greatest at the GON plot, still double the allowable error of 0.03 m 3 m −3 while on HOE the mean RMSPE equals the threshold while the means of the other plots are in between. For GON, not a single layer meets the threshold, whereas in HOE only three layers are exceeding the threshold when using the best recalibrated model.
Apparently, accurate prediction of θ v is most challenging for topsoil layers, whether they are organic (OFH) or mineral (M01) or most likely organo-mineral, with organic contents and bulk densities in between pure organic and mineral materials ( Figure A1). However, some subsoil layers, such as M48 in GON show high prediction errors. From Figure A4 it is clear that the recalibrated response for the B and C profiles in the M48 layer shows almost no variation in contrast to highly variable gravimetric measurements.
We carefully checked whether all quadratic calibration models (QM and QMTC) selected as best models based on AICc (Table A4) were monotonically increasing functions (i.e., increasing output period should physically relate to higher soil water content). Five functions (WIJ-B-M12, RAV-C-M12, GON-B-M12, GON-C-M12, and GON-C-M48) did not meet this requirement. For example, in the GON-C-M12 layer, the QMTC calibration function increased until a temperature corrected period of 31 µs and decreased afterwards ( Figure 5). We replaced this function by the LM (selected as second best by AICc in Table A4), hereby explaining even less variance (27% to 12%) but with a more plausible calibration function, though a greater prediction error.
For the heavy-clay subsoil M48 layer in GON C profile, no adequate monotonic function could be constructed, meaning that no calibration of that FDR sensor in a layer with~47% clay was feasible with the given calibration dataset ( Figure A4).

Contributors to Bias
The principal component analysis (PCA) biplot in Figure 6 is based on the estimated bias (MPE) of the standard LM along with the soil properties of each layer of the 5 sites (Tables 1 and 2). The first 2 principal components (PC1 and PC2) explain 65% of the variance, PC3 and PC4 add 12 and 9% and these are mainly determined by EC and pH-H2O, respectively. The factor loadings of PC1 are dominated positively by ρbKop, and negatively by CEC, θs, TOC and TN. Bias is mainly contributing to PC2, strongly associated with Clay content and soil depth, all having negative factor loadings. Indeed, in topsoils (M01 and OFH) we mostly found a negative bias (underestimation of real θv) while a positive bias was generally found in subsoils, especially when they are clay rich. The impact of depth justifies separated calibration by depth layer.

Contributors to Bias
The principal component analysis (PCA) biplot in Figure 6 is based on the estimated bias (MPE) of the standard LM along with the soil properties of each layer of the 5 sites (Tables 1 and 2). The first 2 principal components (PC1 and PC2) explain 65% of the variance, PC3 and PC4 add 12 and 9% and these are mainly determined by EC and pH-H 2 O, respectively. The factor loadings of PC1 are dominated positively by ρ bKop , and negatively by CEC, θ s , TOC and TN. Bias is mainly contributing to PC2, strongly associated with Clay content and soil depth, all having negative factor loadings. Indeed, in topsoils (M01 and OFH) we mostly found a negative bias (underestimation of real θ v ) while a positive bias was generally found in subsoils, especially when they are clay rich. The impact of depth justifies separated calibration by depth layer.

Contributors to Overall Error
When the bias is eliminated by in situ calibration the overall prediction error (RMSPE) is dominated by the precision error. In Figure 7 the PCA clearly indicates that on PC1 the RMSPE is closely and positively associated with TOC and TN and negatively

Contributors to Overall Error
When the bias is eliminated by in situ calibration the overall prediction error (RMSPE) is dominated by the precision error. In Figure 7 the PCA clearly indicates that on PC1 the RMSPE is closely and positively associated with TOC and TN and negatively correlated with ρ bKop and Depth. The lower the ρ bKop , the higher RMSPE, and this is extremely visible in the OFH layers, which obviously have a high TOC and TN content. PC2, explaining still 22% of the variation, is mainly determined by texture fractions Clay, Silt and Sand, but is rather uncorrelated with RMSPE. Conversely, Clay and Silt are associated with greater values for θ fc and θ pwp .

Contributors to Overall Error
When the bias is eliminated by in situ calibration the overall prediction error (RMSPE) is dominated by the precision error. In Figure 7 the PCA clearly indicates that on PC1 the RMSPE is closely and positively associated with TOC and TN and negatively correlated with ρbKop and Depth. The lower the ρbKop, the higher RMSPE, and this is extremely visible in the OFH layers, which obviously have a high TOC and TN content. PC2, explaining still 22% of the variation, is mainly determined by texture fractions Clay, Silt and Sand, but is rather uncorrelated with RMSPE. Conversely, Clay and Silt are associated with greater values for Ɵfc and Ɵpwp.  The PCA clearly shows that BRA and HOE are more homogenous sites in terms of soil properties as indicated by smaller ellipses, and independently from the fact that they are texturally very different from each-other, have low overall prediction errors. Conversely, WIJ and GON have high variation in soil properties and thick OFH layers leading to high overall RMSPE especially for the organic layers low in ρ b .
When we build a generalized boosted regression tree model to predict RMSPE based on the layer specific soil properties (Tables 1 and 2), the relative influence of the predictors is given in Table 7. Two predictor sets were applied, Set 1 using ρ bKop (Table A3), and Set 2 using ρ b as listed in Table 1.
Upon using the first set, bulk density in the sampling rings is clearly the major predictor and has higher influence than depth, θ s , clay fraction, and all other soil variables. When replacing ρ bKop by ρ b , relative influence decreases and the saturated soil water content takes over as the major predictor. Since θ s is inversely correlated with ρ bKop (Figure 7) this is no surprise. Samples with greater θ s and consequently lower ρ b have more macropores, allowing air to enter and influencing the dielectric permittivity and PA output. Two-way interactions between predictors as indicated by Friedman's H-statistic are all below H = 0.11 (between depth and θ r ) and therefore negligible.

Improving the Prediction Errors in Organo-Mineral Topsoil Layers
The statistical evaluations clearly indicate that accurate prediction of θ FDR is most challenging for the topsoil layers due to high natural variability of ρ b and organic carbon in the organo-mineral layers.
For the plots WIJ and GON where the FDR sensors are installed in the OFH layers (B and C profiles), accurate calibration is jeopardized by the high variability of the reference samples taken each month. This can be seen from the scatterplot of the WIJ site in Figure 8a. With high PA values (PA > 31 µs) of the sensor, θ v may range from 0.31 to 0.73 m 3 m −3 and ρ b from 300 kg m −3 (pure organic matrix) to 1310 kg m −3 (mineral matrix). It is also clear that the lower the ρ b (and hence more organic the reference sample), the more soil moisture (θ v ) may be stored, often more than 0.5 m 3 m −3 compared to mineral samples (ρ b > 850 kg m −3 ) holding less than 0.5 m 3 m −3 (see θ s values in Table 2). Based on AICc criterion, QM, LMTC nor QMTC models performed better than the LM. However, when varying different thresholds for ρb, the best linear calibration model was found for organo-mineral samples with a ρb < 600 kg m −3 , and parameters: ƟFDR = 0.39962 + 0.008767 PA, with R² = 0.276, p = 0.005 and RMSPE = 0.0417 m 3 m −3 . The selected samples (n = 24) and calibration function are indicated in Figure 8a. Only one-third of the total number of calibration samples was selected, but this number resembles the 25 samples used for calibration of each individual profile.
Note that the overall prediction error lowered substantially from 0.1126 to 0.0417 m 3 m −3 . The resulting recalibrated response for the B and C profile can be seen in Figure A1. The moisture content remains high in the OFH layers compared to the A profile or underlying mineral layers of that plot.
The situation is somewhat different for the GON site ( Figure Figure 8b. Note that this calibration function is not performing better than the originally recalibrated one for profile B (RMSPE was 0.0553 m 3 m −3 while the revised function yields 0.0591 m 3 m −3 but better for the OFH layer in profile C that had a RMSPE of 0.0797 m 3 m −3 ). Since the sensor is effectively located in organo-mineral layers, we replaced both by the new function (Table A5). The resulting time series for GON is presented in Figure A4. Here the seasonality of the moisture content of the OFH is much more pronounced than in WIJ, with expected higher moisture levels than in the M01 layer. Note that the Ɵfc was estimated for mineral samples, not for the OFH layer. Selecting from each WIJ profile (max 25 calibration pairs) only the organo-mineral samples by their ρ b value (i.e., ρ b < 850 kg m −3 ) yields too few samples for adequate calibration per profile. Therefore, we pooled all topsoil samples from the three profiles of the WIJ plot (n = 75 samples). When calibrating a LM the slope of the model was not significantly different from 0 (Figure 8a). When subselecting organo-mineral samples with ρ b < 850 kg m −3 , a LM based on n = 47 paired observations was calibrated: θ FDR = 0.320295 + 0.00971 PA, with R 2 = 0.119, p = 0.0099 and RMSPE = 0.0745 m 3 m −3 .
Based on AICc criterion, QM, LMTC nor QMTC models performed better than the LM. However, when varying different thresholds for ρ b , the best linear calibration model was found for organo-mineral samples with a ρ b < 600 kg m −3 , and parameters: θ FDR = 0.39962 + 0.008767 PA, with R 2 = 0.276, p = 0.005 and RMSPE = 0.0417 m 3 m −3 . The selected samples (n = 24) and calibration function are indicated in Figure 8a. Only one-third of the total number of calibration samples was selected, but this number resembles the 25 samples used for calibration of each individual profile.
Note that the overall prediction error lowered substantially from 0.1126 to 0.0417 m 3 m −3 . The resulting recalibrated response for the B and C profile can be seen in Figure A1. The moisture content remains high in the OFH layers compared to the A profile or underlying mineral layers of that plot.
The situation is somewhat different for the GON site ( Figure Figure 8b. Note that this calibration function is not performing better than the originally recalibrated one for profile B (RMSPE was 0.0553 m 3 m −3 while the revised function yields 0.0591 m 3 m −3 but better for the OFH layer in profile C that had a RMSPE of 0.0797 m 3 m −3 ). Since the sensor is effectively located in organo-mineral layers, we replaced both by the new function (Table A5). The resulting time series for GON is presented in Figure A4. Here the seasonality of the moisture content of the OFH is much more pronounced than in WIJ, with expected higher moisture levels than in the M01 layer. Note that the θ fc was estimated for mineral samples, not for the OFH layer.

Requirements for Optimal Calibration
The practical question is raised how many field samples are required for adequate calibration of the FDR sensors. Figure 9 shows a simulation for the HOE site with the evolution of the overall prediction error (RMSPE in black), the bias (MPE in blue) and precision (SDPE in red) relative to an increasing number of calibration months, beginning from a three months period. Starting when soils are near field capacity (in February/March) extra calibration data is gained during spring and summer months when soils are gradually drying out. This substantially decreases the bias until August (i.e., eight months). Since the bias is eliminated by then, the overall error is converging with the precision error during this first calibration period.
We considered the bias as sufficiently eliminated when the MPE is continuously below a 0.01 m 3 m −3 threshold (i.e., the maximal error of the gravimetric reference measurement). The number of months required for each plot, profile and layer is given in Table 8. It is clear that for all plots a longer calibration period is needed to remove the bias in the topsoil (19−24 monthly observations) compared to subsoil layers, although some special layers also require almost two years of calibration.
The precision error starts to decline substantially from 9 till 14 observation months, when soils are rewetting until field capacity in March, but raises again during the drying phase until driest conditions in August/September and declines again in the rewetting phase afterwards. For the M12 layers of the HOE site, the precision (and overall) error drops below the 0.03 m 3 m −3 threshold after more than 22 observation months, but for the M01 layers in HOE and on the other sites much more than 25 observation months are required or will exceed a reasonable calibration period (see RMSPE Table A5). In that case, due to the high precision error (noise), reducing the prediction error will only be possible by increasing the number of sensors for these specific layers.   /OFH 21  20  3  19  23  21  19  23  17  20  22  24  21  21  22  M12  11  13  6  17  20  11  17  20  12  17  17  20  13  11  21  M24  12  19  8  10  19  10  10  19  22  14  20  20  8  5  22  M48  10  6  19  8  12  9  10  3  22  14  20  23  7  9  8 For those layers where the sensor's precision error after calibration still exceeds the   21  20  3  19  23  21  19  23  17  20  22  24  21  21  22  M12  11  13  6  17  20  11  17  20  12  17  17  20  13  11  21  M24  12  19  8  10  19  10  10  19  22  14  20  20  8  5  22  M48  10  6  19  8  12  9  10  3  22  14  20  23  7  9  8 For those layers where the sensor's precision error after calibration still exceeds the threshold of 0.03 m 3 m −3 , and assuming that the bias is near-zero and the observed precision error is of the same magnitude for other individual CS616 sensors used under similar (soil) conditions, the average measurement error can be reduced by the square root of the number of sensors installed in parallel in the given layer, following the classical central limit theorem.
For each layer, we therefore estimated the number of sensors required to reduce the maximum SDPE found in the three profiles of that layer (data from three individual sensors) to a measurement error below 0.03 m 3 m −3 . So, the most difficult calibration environment per layer will determine the number of sensors needed for averaging soil moisture measurements of that layer at the plot level (Table 9).  Table 9 suggests that the current number of three sensors is sufficient for all layers of the sites BRA and HOE. However, more sensors are required for the topsoils up to 20 cm of depth for WIJ and RAV, and also for the subsoil layers in GON, with up to 10 sensors in the clayey substrate if the required measurement accuracy has to be met.

Final Calibration Functions
The best performing calibrated functions with eliminated bias for each site and layer are presented in Table A5. These functions are applied to the respective FDR sensors of the LII plots to construct the θ FDR time series and may be considered the end-product of this study. For the 60 layers examined, a linear model (LM) was the best choice for 58% of these layers and 27% of them a LM with temperature correction. Quadratic models seemed more appropriate for 10% (QM) and 5% (QMTC) of all layers. The resulting time series of all calibrated models for the 2015-2017 period is given in Figures A1-A5.
On average, calibration functions explained 56% (median 61%) of the total variance of the calibration dataset with a maximum of 96%, but also with no variance explained at all (R 2 = 0.003) in the special case of extremely clayey substrates. The RMSPE ranges from 0.012 to 0.117 m 3 m −3 with a mean of 0.043 m 3 m −3 and median 0.041 m 3 m −3 . Hence, regardless of specific soil conditions, measurements of one sensor per layer will not suffice to predict θ v better than the 0.03 m 3 m −3 accuracy threshold.

Attainable Level of Predictive Accuracy
One of the fundamental questions of this study is the level of predictive accuracy, as indicated by the RMSPE, one can achieve using the CS616 sensor for θ v measurement in various soil types compared to reference gravimetric sampling. Hignett et al. [11] estimated the RMSPE of the reference method to maximally 0.01 m 3 m −3 . The user manual [20] of the CS616 mentions that the factory standard calibration is accurate to ±0.025 m 3 m -3 in the θ v range of 0−0.50 m 3 m -3 valid for soils with low bulk electrical conductivity ≤500 µS cm −1 and bulk density less than 1550 kg m −3 . This conforms with all forest soil layers under study.
In Table 10 we compiled RMSPE values obtained with standard and revised functions in various studies for coarse to fine textured soils and organic layers using the CS616 sensors. From all mentioned studies it is clear that recalibration or correction for the specific soils significantly reduces the overall prediction error. The standard function provided by the manufacturer [20] generally performs best for sandy samples [1,10], but as clay content increases, larger prediction errors are found and soil-specific recalibration is absolutely required [1,3,4,6,7,9,11]. Most recalibrations or corrections are performed in the laboratory, since field-based calibration is challenging to perform for many users [10]. The RMSPE quality criteria set by [4] for a mean bias ± 0.02 m 3 m -3 and RMSE < 0.035 m 3 m -3 or a RMSE < 0.30 m 3 m -3 for forest soils by [18] are rarely met, even after (re)calibration or correction.
Furthermore, revised calibration functions developed under controlled but limited laboratory conditions, are not always appropriate for field conditions [1,5]. In 2008, an international group of soil water instrumentation experts came to the conclusion that "all soil moisture sensors must be field calibrated in order to obtain reasonable accuracy" except conventional time domain reflectometry (TDR), which is accurate to ±0.02 m 3 m -3 in most soils [11].
Most studies mentioned in Table 10 collected (large) volumetric field samples in containers and transported them to the lab in order to test a number of moisture sensors and compare the results using standard and revised functions. For practical reasons they calibrate their sensors during a drying phase [3] under controlled temperature and humidity conditions. In our in situ calibration approach we sampled over two years two drying phases and two rewetting phases (Figure 2b), accounting also for the real soil temperature variations associated with the moment and depth of sampling.
On the other hand, since gravimetric soil sampling is destructive and we cannot sample close (i.e., less than 1 m) to the sensors to avoid interference, we add spatial and sampling errors to our validation measurements. This set-up requirement is the major limitation of our in situ calibration approach, certainly since many forest soils tend to have a high small-scale variability of moisture influencing soil parameters. Most reference studies reported that the factory-based standard function underestimates the real moisture content, but under-or overestimation seems dependent on the soil texture. However, Varble and Chávez [4] found in the laboratory that the factory calibrated CS616 sensors overestimated θ v by an average bias of 0.10 m 3 m −3 in sandy clay loam, 0.03 m 3 m −3 in loamy sand, and 0.24 m 3 m −3 in clay loam. This is contrasting to [1] where negative biases and much lower values were found for the same texture classes. In our study we found that standard functions generally underestimated sand layers with MPE = −0.042 m 3 m −3 and loamy sand with −0.053 m 3 m −3 which is greater than these studies, but this might be related to our in situ sampling.
Dong et al. [1] corrected the standard functions with correction equations but the improvement was quite small: 0.011 m 3 m -3 for sand to 0.002 m 3 m -3 in loamy sand and 0.027 m 3 m -3 in sandy clay loam. We believe that recalibration of the models is more effective than applying correction equations on poorly performing factory-based functions, and that only calibrations using in situ measurements can lower the effective bias substantially.
The standard functions (Equations (1) and (2)) provided by Campbell are found inadequate for clay and clay loam soils (Table 10), often overestimating θ v. Varble and Chávez [4] reported the average RMSE of the clay loam samples (n = 65) as 0.289 m 3 m −3 compared to 0.21 m 3 m −3 for clays in our study. Using laboratory-based recalibration, they succeeded to eliminate bias and reduce the resulting RMSE below 0.021 m 3 m −3 for the clay loam soil, an error reduction of 93%. In our study RMSE after in situ recalibration is still 0.076 m 3 m −3 for clay samples, an error reduction of 63% (Table 10).
Veldkamp and O'Brien [7] tested heavy clay soils (60−80% clay) and found standard calibrated CS616 sensors to underestimate water content up to 0.15 m 3 m −3 . This is in contrast with our findings that clay soils are overestimated with the standard function, but clay mineralogy clearly differed between our studies. Instead of recalibrating the models for the clayey soils [7] developed a 3-phase mixing model for topsoil and subsoil separately and reduced RMSE for these clay soils to maximum 0.012 m 3 m −3 . (error reduction of 93%). This 3-phase mixing model, using a physical basis for the derivation of the calibration function, might be a good alternative for recalibration functions, especially for soils rich in clay and organic matter.
Vaz et al. [10] evaluated the CS616 among eight commercially available electromagnetic water content sensors in seven well-characterized soils (Table 10). They merely compared the sensor types and did not optimize the calibration functions. Instead, they compared the standard QM (Equation (2)) with a calibration function developed by [15], which performed much better for most of their soil samples with error reductions between 31% and 73%.
Logsdon [3] who compared laboratory and field calibrations of the CS616 sensor found that "field calibrations would be recommended over laboratory calibrations for the CS616 sensor, at least if they are used for field monitoring" and [11] recommended to develop a procedure to routinely calibrate each new sensor or method, preferably in the field and for each distinct soil horizon where it is to be used. They argue that this process will not only produce a more accurate, site-specific calibration, but will also help identify problems with installation, measurement and technique. We fully agree with this conclusion.

FDR Response to Soil Properties
It is well known that the FDR response is affected by soil bulk density, clay and organic matter content, voids and also salts (electrical conductivity). Therefore, we believe it is best that each sensor is calibrated in situ for the specific soil horizon it is located in, since the unique combination of all these factors in each horizon is hard to simulate in the lab, let alone their interactions.
Soil bulk density (ρ b ) is one of the most spatially variable soil properties, especially in the topsoil of forest soils [33]. Moreover, if ρ b is not determined from the same sample as the mass basis water content, there will be an error in the calculation of θ v [11] and this may seriously affect the quality of any calibration or validation dataset.
Bulk density in forest soils is strongly and inversely related with organic matter content, which is its best predictor. The more organic the soil, the lower ρ b , the higher the meso-and macroporosity and the more water can be stored in the soil. In fine textured soils, such as clays, their volume will change as they dry, so their ρ b may not be a constant and the relationship between FDR response and ρ b may not be constant as well. Besides, if the soils shrink, this might change the contact of the rods with the soil which will affect measurement accuracy.
The response of the CS616 to changing water content has been shown to change for some soils when bulk density exceeds 1500 kg m −3 , which is seldom the case for the studied soils. This response to changing water content is still well behaved, but the slope will decrease with increasing bulk density [20]. Most laboratory calibrations in small or larger containers try to mimic the average soil ρ b of the field. Since ρ b is strongly determining the RMSPE as we found in our study, lab conditions will presumably underestimate real intrinsic variation in field conditions and thus underestimate real RMSPE.
A big advantage of the CS616 is its larger measurement volume-about 3740 cm 3 compared to other EM sensors mostly below 1000 cm 3 [10]-which results in less spatial variability than probes with a smaller measurement volume [3]. We expect effects from ρ b and other soil properties to be averaged over the 30 cm long rods.
The amount of organic matter and clay in a soil can alter the response of dielectricdependent methods to changes in water content. This is apparent when mechanistic models are used to describe this measurement methodology, but less clear by using empirical calibration models [11]. Standard calibration coefficients are given for the CS616 for mineral soils with clay content less than 30% [20]. Consequently, higher clay contents require specific calibration but this calibration is also influenced by conductivity, compaction (ρ b ) and temperature. Therefore [10] suggested that manufacturers supply calibration relationships for a general group of mineral soils (covering a wide range of textures) and specifically for sandy and clayey soils. Udawatta et al. [15] suggested that future research should focus on the effects of clay mineralogy on FDR response.
We found that fitting calibration functions for topsoil layers and especially for OFH layers as part of the forest floor is quite complex and difficult. This is partly due to the high intrinsic heterogeneity of the organo-mineral topsoils in forests both for the organic matter content and the resulting ρ b.
Since most studies on FDR are conducted in arable soils, we found no papers dealing with FDR measurements in forest floor layers. For TDR, Schaap et al. [34] developed an adequate linear calibration function for forest floor media. They report negligible effects of decomposition, residual water and temperature on the calibration parameters, but a strong effect of shrinkage of the organic material both influencing θ v and TDR reflection times, which may cause errors up to 0.02 m 3 m −3 if not accounted for. We expect the same effects with FDR instruments. More research of FDR performance in forest floors and organo-mineral soils is definitely needed.
It is clear from Table 10 and Figure A1 that standard functions are inadequate for predicting θ v in forest floors but also in organo-mineral topsoils (Ah layers or M01). Problems start when taking the calibration samples for sensors located in forest floors of variable thickness (5-9 cm in our study) if the steel cylinders are already 5 cm in height. Impossible to avoid sampling of mineral soil material. Similarly, sampling M01 without collecting forest floor material is also tentative. In our study we applied a ρ b threshold to discriminate real forest floor samples from those contaminated with mineral soil, but this is a surrogate solution. Anyway, it helped us to lower RMSPE from 0.22 to 0.05 m 3 m −3 which is surprisingly similar to the RMSPE of 0.06 m 3 m −3 found by [10] for organic layers, using the calibration function of [15]: θ FDR = −0.283 + 0.0182PA + 0.0002PA 2 .
Vaz et al. [10] concluded that for the organic samples, FDR sensor outputs are in general lower than for mineral soils. This is expected due to the low ρ b. and high porosity of organic materials. This statement is confirmed by the increasing slope in Figure 8b compared to all samples (including mineral).
Accurate sensor readings depend on the absence of air gaps between the CS616 rods and soil [13]. Their study showed that 50% of the sensed volume is concentrated in the first 6 mm around and between the rods. So, hence heterogeneity of materials or air gaps should be avoided. However, since forest soils are full of life, bioturbation may cause tunnels or gaps around sensor rods affecting the readings. In addition, cracking or shrinking processes in clayey soils, such as in GON, might have disturbed sensor readings.
Campbell Ltd. [20] indicates that soils exhibiting bulk electrical conductivity (BEC) > 500 µS cm -1 will require a specific calibration. All soils in this study have an EC value well below 500µS/cm (Table 1), so this was no issue in our study.
Metallic soil components such as ironstone, or sesquioxide rich Bs horizons may affect FDR sensor readings [11] but we found no studies confirming that. Similarly, large knowledge gaps exist for possible effects of soil biota (roots, mycorrhiza) affecting FDR measurements, since these are transporting water. What if roots align with or cross the rods of the CS616, which makes sense if these sensors are installed for decades. Kang et al. [35] studied plant root growth effects in horticultural substrates. They found that various root size indicators (root fresh weight, root dry weight, and root water content) all had a negative effect on the estimation of θ v by FDR sensors and that this effect should be considered in research applications.
Several authors recommended considering soil temperature in the calibration process of the CS616, since they found that diurnal fluctuations in soil temperature influenced the sensor readings [3,4].
Varble et al. [4] suggested making sensor readings during periods that the soil temperature is similar (for example, every early in the morning). Our calibration dataset was compiled with observations just before or after noon, so a possible diurnal effect could not be observed in our study. However, the recalibrated functions are applied to diurnal sensor outputs and could therefore induce extra prediction error for data at night. Presumably this possible error will be greater in the topsoil than in the subsoil due to temperature damping with increasing depth.
We found that temperature correction applied in standard functions may significantly reduce bias, but had a limited effect on precision error. So, a possible bias may be introduced due to temperature effects and it is worthwhile to quantify its magnitude, preferably in the field. The manufacturer mentions possible changes in water content due to temperature effects of up to 0.005 m 3 m -3 • C -1 in clayey soils. The magnitude of the temperature sensitivity changes with water content, so the temperature correction assumes that both the water content and temperature do not vary over the length of the probe's rods [20]. With recalibration we found also that models including temperature correction were rarely supported (using AICs) as best models in the topsoil, but rather in (some) subsoil layers. This may suggest that there is a risk of overfitting the calibration function by including temperature correction given the low number of replicates.
A similar conclusion was drawn by [13]. They found that temperature compensation was not satisfactory for most of the layers they calibrated. It may be necessary for soils with high EC (>500 µS/cm) or electrically lossy clays (large surface area and CEC).

Spatio-Temporal Variation of Soil Water Content
We observed a quite high variability among the three gravimetric reference measurements in each layer ( Figures S1-S5), especially in the topsoil. A more stable calibration value would have been obtained by 4 reference measurements around each sensor or 12 measurements per layer. The number of calibration samples per layer is ideally derived from autocorrelation lengths based on spatio-temporal variogram studies of soil moisture measurements in forest soils. However, Huisman et al. [36] state that determination of correlation length from experimental variograms is highly uncertain even in the case of 100+ soil moisture measurements. Instead, they recommend ground penetrating radar measurements to assess spatial water content variation over time, but this technique might have its limitations in forest soils due to the presence of a forest floor and permanent root systems. Vereecken et al. [37] pointed out that many studies have shown that the correlation scale itself varies as a function of soil moisture content and increases as the soil is drying. So, there is a temporal dimension as well to account for. All of these studies indicate that the temporal variation of the correlation length is controlled by an interplay of several forcings, including throughfall, evapotranspiration and structural elements such as topography and soil properties and deterministic modelling using all these factors is difficult. From the three profiles studied per plot ( Figures A1-A5 and Figures S1-S5), we found indications for inhomogeneous soil properties affecting soil moisture measurements and the principal components analysis confirmed that ρ b , TOC and clay content are important controls. However, intrinsic heterogeneity due to micro-topography, drainage ditches, location of trees and ground vegetation patches could also play a role affecting soil moisture measurements. A dedicated multi-annual soil moisture monitoring on a highly instrumented site on one of the most variable plots (e.g., GON) would help to better understand how spatio-temporal variation of soil water content can be explained.

Conclusions
On all sites, in situ recalibration removed the bias of standard functions up to 0.16 m 3 m −3 in topsoils and reduced the overall prediction error by a factor of 1.8 to 2.8. Standard functions performed very poorly in organic and organo-mineral topsoil layers and clayey substrates. Definitely, recalibration is required for each location where an FDR sensor is installed and new layer-specific empirical calibration functions have to be developed for each site in order to meet the data quality objectives (bias less than 0.01 m 3 m −3 and RMSPE < 0.03 m 3 m −3 ).
However, average single FDR sensor accuracy for the studied forest soils is 0.043 m 3 m −3 so for most layers multiple sensors are required to predict average θ v with an accuracy better than 0.03 m 3 m −3 at the plot scale. Operators should be aware of the level of effective field accuracy of their sensors.
Prediction bias is strongly related with clay content and soil depth and therefore there is a need for depth specific empirical calibration functions or mechanistic functions that incorporate clay content and depth variables. Precision error is most positively related to carbon content (organic matter) and negatively with soil bulk density. Especially organo-mineral topsoil layers showing high ρ b variation are hard to calibrate in situ due to imprecise field measurements. The current number of sensors at GON is too low to reduce the total measurement error and this might thwart future application of water balance models for environmental modeling purposes.
For most sites, more than 19 monthly calibrations are required to lower the bias consistently below 0.01 m 3 m −3 , which is in fact a 2 drying and 1 rewetting phase if calibration started when soils are near field capacity. We recommend intensifying the calibration sampling during the growing season, for instance by two-weekly sampling instead of monthly for a better spread of calibration points between field capacity and permanent wilting point.
Finally, we found that two to three FDR sensors per layer for the Flemish ICP Forests Level II study plots are adequate to reach an acceptable measurement accuracy in the homogeneous sand and silt loam sites, but up to 10 to 16 sensors may be required in heterogeneous topsoil or clayey subsoil layers of the other sites to account for their high intrinsic variability, otherwise hydrological modeling will be tentative.
Future research should focus on development of dedicated calibration functions for forest floors and other organic layers followed by an adequate in situ validation.

Acknowledgments:
We especially thank Frederic Vermeiren and Eddy Smesman for the installation and initial calibration/testing of the FDR sensors at the five sites and starting up the monitoring in 2010. Technical support for this study by Koen Willems, Koen Vervaet and Mathieu Pieters for the monthly field sampling and measurements of ρ b and θ v (reference dataset), and by Yvan De Bodt for FDR data logger read-outs and data assembly is greatly appreciated and acknowledged. We also thank the Flemish Agency for Forest and Nature (ANB) for access, use and maintaining the forest sites for long term monitoring and research. Last but not least we want to thank three anonymous reviewers for their constructive comments resulting in an improvement of this manuscript.