Wildfires vegetation recovery through satellite remote sensing and Functional Data Analysis

In recent years wildfires have caused havoc across the world, especially aggravated in certain regions, due to climate change. Remote sensing has become a powerful tool for monitoring fires, as well as for measuring their effects on vegetation over the following years. We aim to explain the dynamics of wildfires' effects on a vegetation index (previously estimated by causal inference through synthetic controls) from pre-wildfire available information (mainly proceeding from satellites). For this purpose, we use regression models from Functional Data Analysis, where wildfire effects are considered functional responses, depending on elapsed time after each wildfire, while pre-wildfire information acts as scalar covariates. Our main findings show that vegetation recovery after wildfires is a slow process, affected by many pre-wildfire conditions, among which the richness and diversity of vegetation is one of the best predictors for the recovery.

critical role for assessing the impact of wildfires and learning to coexist with these events [Moritz et al., 2014]. We use Functional Data Analysis (FDA, Ramsay and Silverman [2005]) to analyze wildfire dynamics from remote sensing data. This work is part of the growing literature on FDA for remote sensing data (see, e.g., Acar-Denizli et al. [2018], Militino et al. [2019] or Sugianto et al. [2019] among others).
Information from Remote Sensing provides a very important temporal component that allows studying and quantifying the dynamical evolution of the effects of wildfires and recoveries over time. Precisely, [Serra-Burriel et al., 2020] use various sources of remote sensing data, combined with synthetic controls for assessing the vegetation impacts of wildfires over time. In this study, we analyze these recoveries processes as functional data. Each observation measures over several years how the vegetation evolves in a specific region that suffered from a large wildfire, and it represents the decrease or loss of vegetation (that will be defined in the next sections) from each wildfire, as a function of time t, starting at the time of the wildfire up until 7 years after the wildfire, showing the recovery of vegetation from these events.
Hence, the aim of this study is to explain the effects of wildfires on vegetation from remote sensing (satellite) images through FDA, as an alternative approach to classical regression methodologies used to study the effects of wildfires. Classical models usually summarize the whole recovery by comparing few periods of time, pre-and post-wildfire [Engel and Abella, 2011]. We take advantage of remote sensing technologies and modern statistical tools to answer questions like the following: i) What are the effects of wildfires on different kinds of environments? ii) Do the wildfire effects evolution depend on the vegetation of the burned area? iii) Can we explain recoveries of vegetation from wildfires using pre-wildfire observable covariates?
This study is focused on medium to large wildfires (≥ 1000 acres, or 404 hectares) in California throughout a time-span of two decades . We explain the recoveries of vegetation from wildfires using pre-wildfire vegetation conditions and other characteristics of the affected lands using FDA. One of the main advantages from this methodology is that we can use the whole recovery process as a function of time.
Previous studies use differences between pre-and post-wildfire occurrence, showing relative difference between values over fixed time periods, or comparisons of few wildfires (e.g. a dozen wildfires [Bright et al., 2019], 3 or 5 years after the event [Casady et al., 2010, Steiner et al., 2020). This results in raster maps of differences between few time periods, gaining insights on the exact locations where vegetation has decreased. However, this approach lacks the temporal nature of the problem, as vegetation changes over time in a continuous manner.
In order to estimate the dynamical causal effects of wildfires, causal inference through synthetic controls was used in [Serra-Burriel et al., 2020].
This methodology comes from the combination of Econometrics and Political Science, and it consists on the estimation of a hypothetical scenario (a counterfactual) with the absence of a wildfire (the intervention). Thus, in the present case, health vegetation indices were estimated in places where there were wildfires, as if the wildfires had not happened, using a Generalized Synthetic Control (GSC) methodology [Xu, 2017]. Then, the wildfire effect was estimated as the difference between the observed indices and the estimated counterfactuals. Usually the size of the wildfire effect decreases over time so we also refer as wildfire recovery to the wildfire effect as a function of time.
We use seasonality adjustment techniques to extract the trend of the wildfire effects estimated in the previous study. Then proceed to regress these effects, measured over time, using Functional Regression Models.
More precisely, we regress functional responses on scalar covariates. This results in estimated coefficients changing over time that provide insights into different questions, as the ones stated above. This paper is structured as follows. First, we introduce the used data and their pre-processing, as well as the algorithms used to obtain the outcomes to be predicted. Next, we explain the methodology that will be used in this study. Then, we show the attained results and summarize the key findings derived from this study. Last, we discuss the potential impact of these results and conclude with final notes.

Data Gathering
The study area of this paper is California over the time span 1996-2016.
There are three main data sources used for this study. First, perimeters from large wildfires (≥ 404 hectares) were obtained from the Monitoring Trends in Burn Severity (MTBS) program [Eidenshink et al., 2007] conducted by the United States Geological Services (USGS). Second, the Normalized Difference Vegetation Index (NDVI) Surface-Reflectances coming from several Landsat satellites was derived and aggregated using Google Earth Engine platform (GEE) over the areas of interest, as well as meteorological conditions over the areas of interest, that were obtained from GridMET [Abatzoglou, 2013] during the observed time span. Third, we use the results from a previous analysis in [Serra-Burriel et al., 2020], where the effects of wildfires were estimated using the above two mentioned data sources. Details on these data sources are expanded below.

Wildfires Data
Perimeters from large wildfires (≥ 404 hectares) that occurred over the considered time span were obtained from MTBS [Eidenshink et al., 2007] program, as it provides a consistent source of wildfire perimeters for this period. Additionally, only perimeters of wildfires that didn't overlap each other over the time period studied have been considered, because the synthetic control methodology used in [Serra-Burriel et al., 2020] is not able to deal with units that experiment more than one intervention (multiple wildfires, in this case). After pruning the wildfires that either occurred too early and thus don't have enough pre-wildfire periods (at least 5 years) to estimate the counterfactual vegetation, and the wildfires that do not have enough follow-up years after the wildfire (at least 7 years), we end up with 243 wildfires. Figure 1 shows the perimeters of the burned areas.
As an example, the upper right corner of Figure 1 shows the perimeter from a 2008 wildfire in the Mendocino County, officially named MEU LIGHTNING COMPLEX (MIDDLE). This fire burned 2087 acres, and the predominant land cover was evergreen forest. We have chosen this wildfire as an example because it corresponds to the modal median for 2008 (the deepest function in 2008 according to the modal depth [Cuevas et al., 2007]) and 2008 was the year with the largest amount of wildfires.
Moreover, several spatial covariates were obtained from MTBS: latitudinal and longitudinal centroid of the polygons, the year that the fire occurred, the month when it started, and the acres or size (in acres) of the burned areas. Lastly, another covariate indicating the average elevation of the burned areas was obtained from the National Elevation Dataset (NED) from the USGS. Table 1 shows a summary of the used covariates in this study.

Satellite Data
The NDVI is one of the Landsat Surface Reflectance Derived Specral Indices (LSR-DSI). For each pixel in a satellite image, it is defined as where Red is the spectral reflectance measurement in the red band of the spectrum (centred near 0.66 µm), and NIR measures the reflectance in the near-infrared band (centred near 0.87 µm). Both, Red and NIR, are codified as 256 grey levels. Therefore the values of NDVI are always between −1 and 1, but in general they are non-negative. Large values of NDVI are associated with high contents of live green vegetation.
For instance, Figure 2 shows the NDVI (in red, averaged over pixels) for the MEU LIGHTNING COMPLEX (MIDDLE) wildfire example. This area was covered mainly by evergreen forest, having large NDVI values before the wildfire (they oscillate around 0.75).
We use the GEE platform to obtain the NDVI for images provided by three Landsat satellites (LT5, LT7 and LO8) masking clouds, shadows and snow pixels and removing pixels from water bodies such as lakes, reservoirs, rivers and creeks, as we already did in a previous study [Serra-Burriel et al., 2020]. The Landsat satellites provide a consistent source of 30m per pixel resolution, with a frequency of 16 days (approximately 26 observations per year). All the pixels within a burned region are aggregated by taking the average of each spectral index. In this way a time series of NDVI values is obtained for each region of interest. Further details can be found in [Serra-Burriel et al., 2020].
Given that our main goal is predicting wildfires effects using prewildfire observable covariates, two additional explanatory variables were created from the spectral indices data. The average and standard deviation of the NDVI for 5 years of pre-wildfire periods were computed for all observations. These two variables work as proxies for the type of vegeta-

GlobCover
Avg NDVI 5 years Average of the NDVI for the 5 years of pre-wildfire periods (averaged over pixels).

LANDSAT
Std NDVI 5 years Standard deviation of the NDVI for the 5 years of pre-wildfire periods (averaged over pixels).

LANDSAT
Burning Index Burning index, a proxy for fire weather hazard, as defined in the NFDRS System (averaged over pixels).

GridMET
Maximum Temperature Maximum Temperature in Kelvin degrees (averaged over pixels).

GridMET
Rain Daily precipitation in mm total (averaged over pixels).
GridMET Solar Radiation Solar Radiation in W/m 2 (averaged over pixels).
GridMET tion, e.g. larger NDVI values usually show forested areas, whereas lower values of the average of NDVI and larger standard deviations (associated with strong cyclical patterns) indicate grasslands or shrublands types of vegetation.
In addition, climatological covariates or weather conditions were obtained using GEE from GridMET [Abatzoglou, 2013]. These were also aggregated on the regions of interest, taking averages over the regions of interest on all the pre-wildfire available periods (from 1990 until the period where each wildfire occurs). This dataset has a resolution of 4km per pixel and contains the maximum and minimum temperature (in Kelvin degrees), precipitation accumulation (in daily milimetres), downward surface shortwave radiation (in W/m 2 ), and burning index from the National Fire Danger Rating System (NFDRS, [Schlobohm and Brain, 2002]).

Effects of Wildfires Data
The main contribution of Serra-Burriel et al.  The extracted trend is shown in dark green. This figure shows 5 previous years and 7 years after the wildfire, as this is the inclusion criteria in this study.
have an impact on seasonal cycles of vegetation in later years. In order to have more conclusive results than the descriptive ones found in Serra-Burriel et al. [2020], statistical models must be proposed and estimated.
A promising possibility is considering regression models with functional response (the estimated wildfire effects as functions of the time elapsed after the wildfire) and explanatory variables such as geographical location, burn severity, size of the burned area, and land cover/vegetation type.
This constitutes the main contribution of the present project.
The wildfire effects over time estimated in Serra-Burriel et al. [2020] for 7 years post-wildfire and for each of the 243 wildfires that meet our inclusion criteria, are the base from which we construct the functional dataset that will be analyzed in this study. We perform one last step to preprocess the data, that is the trend extraction as explained in Section 3.1.

Methods
NDVI time series usually present seasonality, as vegetation changes throughout the seasons of the year. This is especially evident for some types of vegetation, such as grasslands or shrublands. Therefore, we expect postwildfire NDVI time series of both, the observed and the estimated counterfactual vegetation indices, to present seasonal components. These seasonal components will have different amplitudes, since the burned region will present distinct seasonal patterns during the recovery. Therefore, the difference between the burned region NDVI and the counterfactual NDVI will presumably present a changing seasonal pattern.
Note that, when aligning all the timings of the wildfires, the seasonal pattern of each particular wildfire will present a different phase, as the timings throughout the year of wildfires are different: some wildfires occur on summer periods as opposed to the ones that occur during early spring.
Hence, before aligning the recoveries for all wildfires, to conform a unique functional dataset with no mismatches in the phases of seasonality we need to extract the seasonal pattern of each wildfire separately.
In addition, several aspects of the remote sensed data can produce measurement error. Even though pixels that captured clouds were not included at the timing of aggregating multispectral data to measure vegetation, other types of noise could have potentially leaked in the data.
To reduce the amount of noise and extract recoveries of vegetation from wildfires, it is suitable for this analysis to smooth the data.
Therefore, for each time series, we perform a LOESS decomposition, that will simultaneously remove the individual seasonality from the time series, as well as remove noise from the remotely sensed data.

Representation of Data
Once the effects for each wildfire are obtained, we decompose the time series into its structural components. Trend extraction of univariate data is a wide field of study [Alexandrov et al., 2012], where the classical de- where y(t) is the outcome observed y at time t = t0 + j/26 for j ∈ One method commonly used in many fields for time series decomposition is the Seasonal-Trend decomposition procedure using LOESS [Cleveland et al., 1990], that is based on local polynomial fitting. This procedure presents several advantages, such as the flexibility on the trend and seasonal components extraction or the ability to decompose series with missing values.
We use the LOESS implementation from the Python library statsmodels      It can be proven that the principal functions are the eigen-functions corresponding to the largest q eigenvalues of the sampling covariance operator, that is, Tĉ (s, t)gj(s)ds = λjgj(t), for all t ∈ T , j = 1, . . . , q, with λ1 ≥ · · · ≥ λq. Moreover the score of the i-th functional data on the j-th principal function is ψij = T (yi(t) −ȳ(t))gj(t)dt.

Functional Regression Models
Analogous to classical regression models, Functional Regression Models (FRM) regress outcomes based on covariates when using functions as either the outcomes or regressors. Hence, FRM take advantage of the nature of time changing variables, either parametrically or non-parametrically.
To do so, it can use the functional representation of both regressors and/or outcomes.

Function-on-Scalar Regression
In this research we use the function-on-scalar regression methodology (see, e.g., [Ramsay and Silverman, 2005], Kokoszka and Reimherr [2017], or Goldsmith et al. [2015]) as it allows us to understand the relation between the observed outcome over time, with respect to the fixed covariates observed. Let (X, Y ) be a pair of random variables, where Y is functional and X = (X1, . . . , X k ) is a random vector of dimension k. The linear function-on-scalar regression model for Y given X = (xi1, . . . , x ik ) is stated as where Yi(t) is the functional response over time t ∈ T for the observation In order to allow the function-on-scalar regression model to admit richer covariate terms, Scheipl et al. [2015] introduced the functional additive mixed model (where functional covariates are also allowed). As an example, the following equation shows a function-on-scalar additive regression model with terms of different types: where β0(t) is the functional intercept, β1 is constant over time, s2(xi2) is a smooth function of the covariate, β3(t)xi3 is the same kind of covariate-coefficient relation from equation (1), γ4(t, xi4) is a smooth function depending on t and xi4, and finally εi(t) is the i-th error function. Variable selection is less developed for the function-on-scalar additive model than for the linear function-on-scalar model.

Results
In this section we present the main results from this study, showing how the characteristics of the vegetation and land cover previous to the wildfire, as well as the prior weather conditions to the wildfire, affect the vegetation recovery patterns.
We start summarizing the functional dataset containing the 243 wildfire recoveries. Their mean function is represented in Figure 5, jointly with the complete dataset. The mean wildfire effect on NDVI is always negative for the 7 year period after the wildfire, and the absolute value of this negative effect is monotonically decreasing over time, going from −0.0856 at time 0 to −0.0418 seven years later, in terms of lost NDVI points, with a global average of −0.0567. In average, the burned areas are progressively recovering 0.0438 NDVI points after wildfires (approximately 10% of the range of the functional data set values, see Figure 5).
It is also noticeable that, on average, it takes more than 7 years for a com- Next, FPCA has been applied to find the main modes of variation of the studied functional data around the average. Fig 6 shows  The main goal of this study is to quantify the influence that different pre-wildfire conditions (geographical region, climatological conditions, or vegetation types) of the burned areas have on wildfire effects over the subsequent years post-wildfire. In order to achieve this goal, function-onscalar additive models (of the type from equation 2) are fitted using the function pffr from the library refund Goldsmith et al. [2020] in R. The list of potential covariates to be included in this model is given in Table   1.
As far as we know, the variable selection problem for the function-on-  (2). Therefore we have developed a heuristic model building strategy, which we describe below.
To select the way in which we introduce each covariate to the functionon-scalar additive model, five different univariate models have been fitted for each covariate separately. Exceptions were made for three pairs of covariates (longitude and latitude, average and standard deviation of NDVI during 5 years pre-wildfire, and landcover and landcover entropy) that have been included together additively in these 5 single models, because both variables in each pair are jointly summarizing the same characteristic (geographic location, NDVI, and land cover). Table 2 shows the results   (2)), in terms of the percentage of observed variability explained (100 times the adjusted R 2 ).
The columns in Table 2 correspond to different types of models, and the rows to the variable (or to the pair of variables) used as regressors in the models. In each row, the complexity of the models increases from left to right: in the first two models, the terms depend only on the explanatory variable (linearly first, then non-parametrically), while in the other three models it depends on both, the covariate and the time index (in the third column, the term is linear in the covariate and nonparametric in time, the fourth model includes the second and third models terms additively, and finally the fifth model is nonparametric simultaneously in the covariate and the time index). In general, the models including a nonparametric term in the covariates have larger percentages of explained variability (columns 2, 4 and 5, which show an even performance) than those that are linear in the covariates (columns 1 and 3). Additionally, the inclusion of time dependent coefficients β(t) (column 3) does not represent a large improvement with respect to the standard linear term (column 1).
Therefore, for each row, a model has been selected according to a balance between explanatory power and model simplicity: a simpler model is preferred to a more complex one, if the difference in percentage of explained variability is less than 1%. At each row, the selected model is marked in bold.
Observe that the best univariate (or bivariate) fits in Table 2 correspond to the models having average and standard deviation of NDVI for the 5 previous years to the wildfires as covariates (almost 47% of explained variability), followed by those including rain (30%) or maximum temperature (around 28%) as explanatory variables.
Despite we do not delve any further into the results of these simple models (further comments on individual covariates effect on the response will be made below), we are going to build a multiple function-on-scalar additive model. Rather than delving further into the results of these simple models, we are going to build an additive multiple function scalar model, which in turn will provide further insights on the effect of individual covariates on the response.
We then proceed to fit a full model (using again the function pffr in refund), which includes the terms selected in Table 2 in the response, strongly improving the best model included in Table 2 (46.98%). Tables 3 and 4 indicate that all the terms included in the model are highly significant. This fact and the large percentage of explained variability suggest that this is an adequate model.
We describe first the results for the parametric part of the model ( Table 3), which only includes the covariate Landcover (a factor with 4 levels) with constant effects over time. The reference level for this factor is Evergreen forest. Table 3 shows that burned areas having had Grassland/Herbacious as dominant land cover experiment larger decrement in    We move our attention now to non-parametrically estimated terms, using the information contained in Table 4  The estimation of the function β0(t) in model (2) is labeled Intercept(t) in Figure 7 (upper panel). Except for a vertical shift, it is approximately equal to the mean function (see Figure 5). The vertical shift should be equal to the estimated Intercept in Table 3 if there were no factor covariates in the model. In our case, however, this Intercept is referred to the level Evergreen forest of the factor Landcover.
There are three covariates (Avg NDVI 5 years before, Std NDVI 5 years before, and Rain) that contribute with two terms (βj(t)xj and sj (xj) follows that the contribution of the term sj(xj) is much larger than that of the term βj(t)xj for this explanatory variable. The nonparametric term sj(xj) indicates that larger values of NDVI vegetation tend to suffer more from wildfires. For instance, in average, an area with pre-wildfire NDVI value equal to the mean plus one standard deviation loses 0.1 NDVI points more than another area with pre-wildfire NDVI value one standard deviation below the average. For these two fictitious areas, the effect of the term βj(t)xj is to add or subtract, respectively, the estimated coefficient βj(t). Then the area with NDVI values over the mean will have a larger decrease in NDVI the first one and a half yeas, but its recovery will be faster than in the area with previous lower NDVI values.
For Std NDVI 5 years before (standard deviation of NDVI over the 5 years before the wildfire), the relative relevance of the term βj(t)xj is also much smaller than that of the term sj(xj): their ranges are 0.03 and 0.25, respectively. The functional coefficient βj(t), negative for all t, is decreasing the first two years and almost constant from then on (with an approximate value of −0.04 NDVI points). The term sj(xj) in this case is an increasing function on the standard deviation of pre-wildfire NDVI values, indicating that vegetation diversity (large values of Std NDVI 5 years before) is a protecting factor against wildfire effects. Combining both terms, the difference in loss of NDVI points between two areas with values of Std NDVI 5 years before one standard deviation over and below the mean, respectively, for t larger than two years is (βj(t) + s(1)) − (βj(t) + s(1)) ≈ (−0.04 + 0.10) − (0.04 − 0.03) = 0.05.
For t smaller than 2 years, the differences between these two areas are smaller than 0.05 and increasing in t.
For the explanatory variable Rain, the term βj(t)xj is even less important than in the two previous cases (the range of βj(t) is smaller than 0.02 NDVI points, and it is almost constant from two years after the wildfire).
On the other hand, the term sj(xj), that has an approximate range of 0.17, grows rapidly at low values of the variable Rain (smaller than 0.3 times the standard deviation below the mean, approximately) and then it is almost constant or slightly increasing. We conclude that moderate or large precipitations seem to help recover or protect against the wildfire effects.
The remaining 10 explanatory variables contribute to the full additive function-on-scalar model only with a nonparametric term sj(xj) that remains constant over time after the wildfire. The estimations of these terms are represented in Figure 8.

Conclusions and Discussion
The functional regression methodology has shown to be an effective way to study and explain vegetation recovery from wildfires, using pre-wildfire The most important lessons we draw from this model are the following.
In average, the recovery process after a wildfire is slow and takes more than 7 years (the time span used in this study). Each particular wildfire is a combination of a unique set of conditions that alter vegetation and ecosystems in a different manner, and it seems that all of them have an effect on the wildfire recovery process. The main risk conditions for a given area from suffering larger wildfire effects are, in this order, to have a rich and homogeneous vegetation (large and uniform NDVI, dominance of grassland, herbaceous vegetation or evergreen forest as land cover), to present a low precipitation regime, to have a large elevation over the sea level, to have low burning index, to have large maximum temperatures, and to be located in the South or West of California.
The convenience of studying outcomes changing over time, together with the estimation of the effect of several kinds of conditions pre-and post-wildfire, makes functional regression models to be a perfect methodology for this kind of studies. Previous studies use standard multiple regression models to compare absolute values of spectral indices, or comparisons of geolocated rasters such that these can include the spatial component of wildfires. However, giving estimates of the effect of these characteristics on the recovery pattern of vegetation from wildfires will allow environmental scientists and land management entities to study the characteristics that need more preservation.
It is important to notice that this methodology has only been implemented over the recoveries estimated from [Serra-Burriel et al., 2020].
Nevertheless, this could be applied in many other research areas and fields, benefiting from the temporal component that this methodology includes, as everything is observed and measured over time. Expanding the study area to other fire-prone regions around the world, and increasing the timespan observed after wildfires (e.g. 15 years after each fire) would probably allow to observe full recoveries from wildfires. However, this remains outside the scope of this work.
This study tries to close the gap between satellite remote sensing and evaluation of wildfires' effects over time. It must be noted that gathering and pre-processing data, usually coming from different sources, is a crucial and highly sophisticated task when dealing with remote sensing data.
Functional Data Analysis, and functional regression in particular, is an advanced statistical methodology well suited to analyze such rich data sets.