COVID-19 Data Imputation by Multiple Function-on-Function Principal Component Regression

: The aim of this paper is the imputation of missing data of COVID-19 hospitalized and intensive care curves in several Spanish regions. Taking into account that the curves of cases, deceases and recovered people are completely observed, a function-on-function regression model is proposed to estimate the missing values of the functional responses associated with hospitalized and intensive care curves. The estimation of the functional coefﬁcient model in terms of principal components’ regression with the completely observed data provides a prediction equation for the imputation of the unobserved data for the response. An application with data from the ﬁrst wave of COVID-19 in Spain is developed after properly homogenizing, registering and smoothing the data in a common interval so that the observed curves become comparable. Finally, Canonical Correlation Analysis is performed on the functional principal components to interpret the relationship between hospital occupancy rate and illness response variables.


Introduction
The virus SARS-CoV-2 has been the main global concern ever since its start, at the end of 2019 in China. Its rapid propagation has put all areas of society on alert, not only the field of medicine. Nevertheless, a year and half after the beginning of the pandemic, the virus incidence has not seemed to decrease and the number of deaths continues its upward trend throughout the world. To obtain some idea of extremely negative impact of the pandemic, Coronavirus Disease (COVID-19) has caused a total of 2,780,266 deaths over the planet as of 28 March 2021, according to the real-time database developed by Johns Hopkins University [1]. Another crucial topic derived from the illness is the economic crisis which has devastated all countries. For instance, the unemployment rate is up 5.1% in last three months of 2020 in the UK, according to official data.
In order to combat this terrible situation, there is a great need to understand the development of the pandemic. Knowing its behaviour will enable correct decision making to mitigate the spread of the virus and to restore people's daily lives as soon as possible. To do so, the scientific community is focusing all its efforts on developing new techniques, capable of modelling and predicting the evolution of COVID-19. The main variables of interest that gauge how the epidemiological situation stands in a country are the number of positive, recovered and deceased cases. Another important indicator is the number of people who are hospitalized or in intensive care units. From a mathematical perspective, many authors have already attempted to tackle these variables from different statistical perspectives. A new Bayesian indicator is introduced in [2] to forecast the beginning of a new wave. In [3], semi-empirical models based on the logistic map are considered in order to predict the variables in different phases of the pandemic in Spain. Likewise, Ref. [4] apply SIR models to analyse the trend of the disease over the world and, more specifically, in India. These variables are also addressed from the time series design by considering quasi-Poisson regression and two-piece scale mixture normal distribution when there is a lack of symmetry in the error's distribution in [5,6], respectively. Additionally, Ref. [7] make an exhaustive comparison of five deep learning methods to forecast the number of new cases and recovered cases in Italy, Spain, France, China, the USA and Australia. Regarding the role of the environmental conditions in the evolution of the illness, Ref. [8] study whether the number of cases in China is connected with the daily average temperature and relative humidity through a generalized additive model. On this point, Ref. [9] show how the choice of the spatio-temporal model may affect the relationship between the spread of the virus and certain environmental conditions. Information theory metrics are also used to understand how time series associated with the pandemic are interconnected or causally related to each other [10]. In addition, how the incubation period distribution could vary by age and gender is investigated in [11]. On the other hand, a new family of distributions is introduced in [12] to model daily cases and deaths in Egypt and Saudi Arabia.
Taking the nature of the variables of interest into account, an approach based on Functional Data Analysis (FDA) is proposed in the current paper for data imputation. FDA is a modern branch of statistics that aims to analyse the information coming from curves or functions that evolve over time, space or other continuous arguments. Under this definition, it is clear that the number of COVID-19 positive, recovered, deceased, hospitalized and intensive care cases come from the observation of functional variables. FDA is usually applied in many areas of knowledge, such as Biosciences, Environment, Economy, Chemometrics and Electronics. A detailed review of the most important FDA methodologies, applications and computational aspects can be seen in books [13][14][15][16][17]. In this regard, some works have been developed, focused on revealing complex patterns of COVID-19 illness from an FDA viewpoint. Functional Principal Component Analysis (FPCA) and functional time series approaches based on dynamic FPCA are applied in [18] to explain variability and predict COVID-19 confirmed and death cases in the United States. On the other hand, a new Varimax rotation approach to FPCA is introduced in [19] to better interpret the main modes of variability in COVID-19 confirmed cases in the first wave in Spain. Time-varying FDA methods, to model the cumulative COVID-19 curves of cases by pooling data across countries, are applied in [20]. A multivariate FDA approach was also considered for spatio-temporal prediction of COVID-19 mortality counts in Spain [21].
All statistical models require complete and high-quality data to be able to provide accurate predictions, but, unfortunately, neither of these aspects are normally fulfilled during a pandemic. In the first wave of COVID-19 in Spain, a change in the way of recording data in some Autonomous Communities produced incomplete data in hospitalized and intensive care curves. In this paper, a functional linear regression model is proposed for the imputation of these missing curves, so that complete data are available to estimate the predictive models with guarantees. Although there are many works related to the imputation of multivariate data [22,23], there is a lot to be done in the functional framework. A novel approach for multiple imputation based on functional mixed effects models was proposed by [24] in a longitudinal data context. Different solutions to scalar-onfunction regression with missing observations in the response are considered in [25][26][27][28][29]. Additionally, an extension to multiple functional regression imputation that handles both scalar and functional response variables related to EEG data is proposed in [30]. Likewise, different FDA imputation methods under sparse and irregular functional data settings are performed in [31]. The extension of the function-on-function linear regression (FFLR) model [32][33][34] to the case of multiple functional predictors is proposed in this paper to estimate the curves of hospitalized and intensive care people (functional responses) from the curves of confirmed, deceased and recovered cases (functional predictors).
In addition to this introduction, the manuscript scheme consists of a description of the data where the process of the homogenization, registration and smoothing of the sample curves is detailed in order to make them comparable (Section 2). The theoretical framework of multiple function-on-function linear regression and the imputation procedure based on principal components regression appear in Section 3. An application to COVID-19 data in the Spanish Autonomous Communities during the first wave of the pandemic is developed in Section 4. Finally, Section 5 contains a discussion about the results obtained throughout this paper.

Data Homogenization, Registration and Smoothing
Spain is organized administratively in autonomous communities (ACs) or territorial governments that have transferred health affairs. This territorial organization consists of 17 ACs plus two autonomous cities (Ceuta and Melilla) located on the African continent, and which have been excluded from the analyses presented here because they have no exclusive competences in the organization of health care assumed by the Spanish government. The 17 ACs are, in alphabetical order: Andalucía, Aragón, Asturias, Islas Baleares, Islas Canarias, Cantabria, Castilla La Mancha, Castilla León, Cataluña, Extremadura, Galicia, Madrid, Murcia, Navarra, País Vasco, La Rioja and Valencia. The population is highly variable between the different ACs. While Madrid, Catalunya and Andalucía have more than six, seven and eight million inhabitants, respectively (6,663,394, 7,675,217 and 8,414,240), La Rioja has approximately three hundred thousand inhabitants (316,798).
The first wave of the COVID-19 pandemic in Spain occurred between 2 February and 27 April 2020. In those early days of the pandemic, Spanish authorities published daily and accumulated data of the evolution of the pandemic in Spain, based on the information communicated by the different ACs. Specifically, the data, published daily, correspond to the following variables: number of confirmed (positive) cases, hospitalized people, people in intensive care units (ICUs), recovered people and deceased persons. The observed data for some of the ACs can be seen in Figure 1. The problem that arose, and gave rise to this work, is that some ACs (Castilla La Mancha, Castilla León, Madrid and Galicia) modified the recording of the data associated with people in ICU and hospitalized people from a specific day (see Figure 2). The mathematical action against COVID-19 of the Spanish Mathematics Committee (http://matematicas.uclm.es/cemat/covid19/) (accessed on: 26 May 2021) called for the development of a meta-predictor (collaborative prediction) based on the predictions from different models/algorithms, contributed by interested researchers, which builds optimized combinations of them, disaggregated by ACs. Therefore, the imputation of the missing hospitalized and ICU data is fundamental to building forecasting models to provide optimal predictions of the evolution of the pandemic through these variables. In order to solve this problem, a functional regression model is proposed in this paper to estimate the expected form of the missing accumulated data of ICU admissions and hospitalizations from the observed accumulated data of cases, deaths and recoveries.
From this point, the time evolution of COVID-19 cases, deceases, recoveries, hospitalizations and ICU admission will be considered as functional variables that will be denoted as X 1 (t), X 2 (t), X 3 (t), Y 1 (t) and Y 2 (t), respectively (the X-variables will be considered as predictors and the Y-variables as responses in the functional regression models). The observed data are the number of daily cumulative informed values of these five functional variables for the seventeen ACs in Spain from 20 February 2020 to 27 April 2020 (see data source (https://cnecovid.isciii.es/covid19/#documentacion-y-datos) (accessed on: 26 May 2021)). Then, a random sample of curves {(x ij (t), y ik (t)) : i = 1 . . . , 17; j = 1, 2, 3; k = 1, 2} which are observed daily are available.    Before carrying out a functional analysis of the data, it is necessary to complete a data registration, given the absence of uniformity in the publication of the observations. This means that, in the same functional variable, the first day with available data and the number of discrete observations in each AC are different. For example, in Andalucía, the first recorded data of hospitalized persons were from 10 March, in which 32 hospitalized people were registered and, for this variable, there were 49 discrete observations in this AC. On the other hand, in Madrid, the first recorded data of hospitalized persons were from 12 March, in which 1304 hospitalized people were registered, and the number of discrete observations in this AC was 47. However, the curves of positive cases recorded 62 and 63 discrete observations in Andalucía and Madrid, respectively. In addition, the population size of each AC could influence the adjustments of the proposed models, as larger numbers of cases were observed in larger communities, and the different population sizes of the different ACs makes it impossible to compare data between them. In order to avoid both problems, the number of cases per 10,000 inhabitants is considered, and the first observation for each curve corresponds to the day that first exceeds the maximum of the first reported values, discarding the previous ones.
After data homogenization, the period of observation and the number of discrete observations of each functional variable for each AC continue to differ. An important constraint of FDA methods is that all sample curves of a functional variable must be observed in the same domain. Classic solutions to this problem are based on the registration of all curves in a common interval (see [13]). In this paper, we propose registering all curves in the interval [0, 1] by applying the FDA methodologies to the synchronized curves defined by where [T ij−start , T] and [T ik−start , T] are the observed domains for the i−th predictor and the k−th response curves, respectively (i = 1, . . . , 17; j = 1, 2, 3; k = 1, 2). From now on, and by abuse of a notation that helps to simplify the exposition, x ij and y ik will represent the registered curves.

From Discrete Daily Observations to Curves
Although the functional data are sets of curves, their true functional form is unknown and the recorded data are observations of each curve at a finite collection of timepoints. Then, the first step in FDA is to reconstruct the functional form of the curves from the observed discrete data.
There are different approaches to the processing of functional data, among which we can highlight the classic ones based on a basis representation of the curves [13] and the ones based on local-polynomial regression [16]. In this paper, basis expansions of the curves are considered by assuming that each of the functional variables (X 1 , X 2 , X 3 ; Y 1 , Y 2 ) generating the sample curves, are smooth stochastic processes with trajectories in the space In what follows, the basis expansion approach is illustrated for a random sample of a functional variable, defined on a general interval T. In our dataset, this procedure must be performed on each of the five considered functional variables for which the type and dimension of the basis could be different.
Consider a random of sample curves {x i (t) : i = 1, . . . , n; t ∈ T} from a functional variable X with values in L 2 (T), and let us assume that noisy observations x ik are available for each curve at a set of time knots t i1 , t i2 , . . . , t im i ∈ T, that is, Let us also suppose that the sample curves belong to a finite-dimensional space generated by a basis of functions {φ 1 (t), . . . , φ p (t)}. Therefore, each curve of the functional data set admits a basis representation in the form The functional form of each curve is then determined by the vector of its basis coefficients a i = (a i1 , . . . , a ip ) , which can be estimated in different ways, with least squares approximation being the most common method, providing the following estimation: The type of basis must be selected according to the characteristics of the curves in the functional dataset. The most common basis are B-splines and trigonometric functions (see, for example, [13]). The former generates spaces of spline functions, piecewise polynomial functions that are smoothly joined and have good local behaviour. The latter provides suitable spaces for periodic functions. Many other bases were used in practice, such as bases of wavelets which are more appropriate for curves with discontinuities and sharp spikes. An application of wavelet approximation from sample curves of lupus and stress level was developed in [32]. A robust estimation of the mean function, together with a simultaneous confidence band, based on polynomial spline estimation, is developed in [35].
In this paper, a basis of cubic B-splines of dimension ten with equally spaced knots was used to approximate the five samples of curves of COVID-19 from their daily discrete data. Least squares approximation was performed on each curve to estimate the basis coefficients. The cubic regression splines of all curves considered here can be seen in Figure 3.

Functional Linear Regression Imputation with Missing Values in the Response
Motivated by the imputation of the missing curves of COVID-19 hospitalized and intensive care people, a functional linear regression model with functional response and several functional predictors is proposed in this paper. The general formulation of this multiple function-on-function linear regression (MFFLR) model and its estimation in terms of functional principal components regression are summarized in this section.

Multiple Function-on-Function Linear Model
The multiple function-on-function linear model allows for the estimation of a functional response Y from a vector of J functional predictor variables denoted by X = (X 1 , . . . , X J ) . Let us consider a random sample from (X, . , x iJ ) , and let us assume that all functional variables take values on the Hilbert space L 2 (T) of the squared integrable functions on the interval T, with the usual inner product defined by Then, the functional linear model is formulated as where α(t) is the intercept function, β j (s, t) are the J coefficient functions and i (t) are independent functional errors. Model (2) can be written in matrix form as . This expression considers that all functional variables are defined in the same interval T, but this is not a restriction and the model can be easily generalized for different domains in each of the functional variables. The estimation of this model is an ill-posed problem that is usually solved by least-squares penalized approaches and basis expansion of functional parameters and/or sample curves [13]. Some of the basis expansion approaches reduce the model to a multivariate linear model for the matrix of response basis coefficients in terms of the matrix of predictors' basis coefficients. The main problem is that this multivariate model is affected by high multicollinearity, which causes inaccurate estimation of the parameters. Despite the good predictive ability of the model, this fact makes its interpretation more difficult. The most-studied solutions avoid the need for cross-validation to estimate the penalty parameter by reducing the problem to linear regression on uncorrelated predictor variables. Approaches based on functional PCA [36][37][38][39][40][41] and functional Partial Least Squares (PLS) [42][43][44][45][46][47] were widely studied in the literature in the context of different functional regression models.
In this paper, a principal components' regression approach is considered. This can be seen as an extension of the principal components' prediction models developed in [48,49] to predict a functional variable in a future interval of time from its evolution in the past. In the so-called PCP models, the functional response and the functional predictor correspond to the same functional variable, but were observed in different time periods. In the present approach, truncated principal component decompositions of the functional response and the functional predictors turn the functional linear model into a multivariate linear model in terms of a reduced set of response and predictor principal components.

Functional Principal Component Regression
Let us consider the principal component decompositions of both the response and the predictor functional variables, given by where the principal components scores are given by with the weight functions f x j l and f y l being the eigenfunctions of the sample covariance operators of x ij (t) and y i (t), respectively. The principal components scores are centered uncorrelated scalar variables with maximum variance given by the eigenvalues associated with their weight functions: Var(ξ Theoretical and asymptotic properties of FPCA for Hilbert-valued random functions were studied in [50][51][52][53][54]. In the case of a basis expansion for each functional variable (see Equation (1)), each functional PCA is equivalent to the multivariate PCA of the matrix AΨ 1/2 , with A = (a ij ) being the n × p matrix of basis coefficients and Ψ being the p × p matrix of inner products between basis functions, Ψ = (Ψ ij ) =< φ i , φ j >, i, j = 1, ..., p. The vector of basis coefficients of the the l−th PC weight function f l (t) is given by where v l is the l−th eigenvector of the sample covariance matrix of AΨ 1/2 (see [55] for a detailed study).
The principal component decompositions given in Equation (3) with the functional coefficients given by β j (s, t) = . By truncating each principal component decomposition, the following principal component multiple function-on-function linear regression (PC-MFFLR) model for the functional response is obtained withb x j kl being the linear least-squared estimation of the regression coefficients b kl . Different selection model approaches were developed to select the optimum PCs of each predictor variable (subsets L kj ), to be considered in Model (6) when it comes to estimating the first J PCs of the response variable. It is well known that PCs are ordered according to their explained variability and that the most explanatory components of the predictor variable might not be the most correlated with the response variable. In the case of the simple function-on-function linear model with only one predictor, a procedure that selects pairs of response/predictor PCs based on both explained variability and correlation was developed in [32]. A supervised version of FPCA that estimates the PCs by considering the correlation of the functional predictor and response variable was developed for the scalar-on-function regression model [56]. The usual selection models procedures based on stepwise and best subset regression, combined with cross-validation, can be adapted to this functional regression context.

Imputation of Missing Response Curves
Let us consider that all the predictor variables X j are completely observed and only the response variable Y has missing values. Let us assume, without loss of generality, that in the sample, the first n values of the response are observed and the last m values are missing. That means that there are n complete observed curves for all variables and m incomplete observations that are missing values for the response.
In order to estimate the missing response curves, the parameters b kl in Model (5) are estimated with the complete n sample curves of response and predictors. Then, the missing response curves {y miss i (s) : i = n + 1, . . . , n + m} are estimated by computing the principal component scores of predictors {ξ x j il : i = n + 1, . . . , n + m, l = 1, . . . , n − 1} given by the Expression (4), and substituting them in the Equation (6). Then, the estimated PC-MFFLR model can be used to predict new response values Y on a test sample and to provide an accurate interpretation of the relationship between the predictor and the response variables.
If the objective is to predict the response variable in a future interval, a regression model of type (6) could be estimated to predict the response variable Y(s) in the future interval of amplitude k, denoted by [T, T + k], in terms of the predictor variables (X 1 (t), . . . , X J (t), Y(t)) in the past interval of time [0, T]. In the case of the COVID-19 data, the parameter k must be selected by taking the average number of days it takes for a person to develop severe symptoms and need to be admitted to the hospital into account.

Covid-19 Application Results
Let us remember that the main aim of this paper is the imputation of hospitalized and intensive care curves for those ACs with missing data. To do this, multiple functionon-function linear regression approaches are developed here. In addition, a canonical correlation analysis (CCA) is performed to interpret the relationship between variables related with hospital occupation (hospitalized and intensive care people) and illness response (positive, deceased and recovered people). The computational results were obtained with the free software R ('fda' and 'yacca' R-packages for FPCA and CCA, respectively).

Data Imputation
The imputation problem is solved by applying a multiple function-on-function linear regression for each of the responses Y 1 (t) (hospitalized) and Y 2 (t) (intensive care) from the three functional predictors X 1 (t) (sick), X 2 (t) (deceased) and X 3 (t) (recovered). Both functional regression models are estimated from the data of the thirteen ACs with complete data (training sample). Then, the predictions for the four ACs with missing data (Castilla La Mancha, Castilla León, Galicia and Madrid) are used for data imputation.
The first step is the estimation of the functional PCs for each of the five functional predictors. As a result, the first PC explained almost all the variability in the five predictors (99.32%, 98.73%, 97.97%, 98.59%, 96.37% for X 1 , X 2 , X 3 , Y 1 , Y 2 , respectively). Figures 4 and 5 show the weight functions associated to each first PC, and the perturbations of the sample mean curves obtained, by adding and subtracting a multiple of them. In order to obtain weight functions and PC scores which are much easier to interpret, two new functional Varimax rotation approaches were introduced in [19] with application for COVID-19 confirmed people.
After obtaining these functional principal components' analyses, we consider a training sample composed of all the ACs except Castilla La Mancha, Castilla León, Galicia and Madrid, which will be considered as the prediction sample.
Taking into account that the first component of X 1 (t), X 2 (t) and X 3 (t) were revealed to be highly and significantly correlated with the first components of Y 1 (t) and Y 2 (t); meanwhile, the other cross-correlations between PCs were not significant, and the functionon function linear regression models were reduced to the following linear models for the first PC of the response in terms of the first PC of each of the predictorŝ These models allow for the accurate estimation of the first component of Y 1 (t) and Y 2 (t) from the first components of X 1 (t), X 2 (t) and X 3 (t) with a determination coefficient of R 2 = 0.9249 and R 2 = 0.7443, respectively. Finally, the Karhunen-Loève expansion, in terms of the predictor principal components, provides the following prediction equation for Y 1 (t) and Y 2 (t) y ik (t) = y k (t) + ξ y k i1 f y k 1 (t), k = 1, 2; i = 1, . . . , 17. (7) In order to evaluate the prediction ability of these models, the square root of the mean squared errors between observed and predicted curves are calculated by the expression These results can be seen in Table 1, where it can be observed that the predictions for ICU admission curves are more accurate.  Some of the observed and estimated training curves can be seen in Figures 6 and 7 next to confidence bands for the predicted curves. These confidence bands are obtained by pointwise confidence intervals, computed for each fixed timepoint t p as follows   Finally, the expected curves provided by the regression models in Equation (7) for hospitalizations and ICU admissions in the badly recorded ACs, next to their confidence bands and observed curves, are drawn in Figures 8 and 9. The prediction of the missing curves using the PC-MFFLR considered models provides a pointwise estimation of hospitalizations and ICU admissions that corrects the inaccurate reported data. These pointwise predictions, next to their anomalous values for the first and last days of the first wave of COVID-19 in Castilla La Mancha, Castilla León, Madrid and Galicia, can be seen in Table 2.

Canonical Correlation Analysis
Once the missing data were imputed and complete curves are available for the 17 ACs, the relationship between the variables related to the number of people admitted to hospitals (hospitalized and ICU people) and the ones affected by the disease (sick, deceased and recovered) can be studied. Canonical Correlation Analysis (CCA) was applied on the two sets of first principal components associated with these functional variables to explore this relationship without necessarily distinguishing between independent and dependent variables. The analysis makes sense because the correlations between PCs in the two groups are very high, suggesting that the variables are not linearly independent.
In agreement with the above, the first principal component of each functional variable is selected to carry out the analysis. Thus, the dataset consists of a sample of the seventeen Spanish ACs in an attempt to determine which factors influence in the hospital occupancy rate. The two groups of variables are, on the one hand, Hospital occupancy rate (HOR) formed by the first PC of hospitalized people and of ICU people (ξ ). The estimates of the squared canonical correlations between the two canonical variables for each pair appear in Table 3, next to the outcomes associated with the Barlett's test for testing the null hypothesis that the two canonical variate pairs are uncorrelated. As a result, it can be concluded that both canonical pairs are significantly correlated and dependent on each other (there is a relationship between the two sets of variables).
Note that the squared canonical correlations represent, for each pair, the percentage of variance in one canonical variate explained by the variation in the other one, but say nothing about the extent to which the canonical variates themselves account for variation in the original variables. Then, around 95.4% of the variation in the first canonical variate for HOR (U 1 ) is described by the variation in the first canonical variate for IR (V 1 ), and almost 71% of the variation in U 2 is explained by V 2 . This fact suggests that both canonical correlations are important. Figure 10 displays how the values of the canonical variates are spread over the plane. The linear relation in each pair is clearly visible in these scatterplots. Likewise, it is possible to draw conclusions about which ACs behave similarly during the first wave. The results are in accordance with multiple studies about the COVID-19 pandemic in Spain (see, for example, [19]).  Additionally, the estimated canonical coefficients (loadings) for the HOR and RI variables are in Tables 4 and 5, respectively. The magnitudes of these coefficients give the contributions of the individual variables to the corresponding canonical variable. Hence, the canonical variables are determined as follows Once the raw canonical coefficients have been estimated, the following step is to interpret each canonical component. For that purpose, the squared correlations between the variables in each group and the canonical components are computed in Table 6 and 7 for the HOR and IR groups, respectively. These parameters indicate the fraction of HOR and IR variance associated with each of their components separately. Let us observe that, for the second canonical variables (U 2 , V 2 ), none of the correlations are large, so this pair provides very little information about the variables. Regarding the first canonical variate pair (U 1 , V 1 ), all the correlations with the variables are uniformly high. This means that U 1 and V 1 are an overall measure of HOR and IR variables, respectively, with U 1 being highly correlated with hospitalizations and V 1 more correlated with positive cases and deceased people. These outcomes expose that the level of saturation in the hospitals are particularly determined by the number of hospitalized people; meanwhile, the response to the pandemic is governed by the number of positive cases and deaths. Despite the fact that the number of people in UCI and the number of recovered people play an important role in the canonical variates, their contribution is smaller.  Finally, a canonical redundancy analysis was performed in order to study the percentage of variance of one group of variables which is accounted for by the other (in the usual least squares sense). The results of this analysis can be seen in Table 8, and the correlations between each set of variables and the opposite group of canonical variates in Tables 6 and 7. Table 8 shows that both components of the first canonical pair are a good overall predictor of the opposite set of variables, since the explained proportions of variance for HOR and IR are 0.864 and 0.839, respectively. Nevertheless, despite the significant correlation for the second pair, these variables do not account for a great amount of variability. This statement is corroborated by the squared correlations displayed in Table 6 and 7. These measures indicate that the first canonical variate of the IR group has an outstanding predictive power for the number of hospitalized (95.09%) and a considerable influence for the number of people in ICU (77.81%) as well. Similar interpretations are reached for the first canonical variable of HOR, which is a superb predictor of the number of cases and deaths (92.13% and 88.40%, respectively), and to lesser extent, of the number of recuperated (71.39%). The second canonical variables add virtually nothing given that the fraction of variance in each variable set attributable to the other group through the respective canonical variates barely overcomes 10% of the total variability.

Conclusions
The current economic and sanitary crisis provoked by the virus SARS-CoV-2 has taken up most of the planet's attention since the World Health Organization declared a worldwide emergency state in the middle of March 2020. In order to control the propagation of the virus, the scientific community is immersed in the development of statistical models that enable governments to control the behaviour of the pandemic and to mitigate the devastating effects of the COVID-19 illness. Thus, it is essential to build powerful models to guarantee accurate predictions. Taking into account the nature of the variables of interest (for instance, number of positive cases, deceases, recovered, hospitalised and people in intensive care units), a wide variety of models have been tackled by considering Functional Data Analysis methodologies. Nevertheless, the good performance of these models depends on the quality of the data, which is not always as good as one might expect, especially in periods of pandemic, where the data are usually incomplete. On this matter, an extension of function-on-function linear regression is proposed for the imputation of missing values in the response, where the functional coefficients are estimated by means of principal components regression. The motivation for this work is to forecast the curves of hospitalized and intensive care people (functional responses) from the curves of positive cases, deaths and recoveries (functional predictors) for several Spanish Autonomous Communities that changed the means of recording data related to hospital occupancy rate. The imputation of these curves is made once the linear model is estimated, with a training sample composed by the remainder of communities that did not modify their way of registering data. The performance of the model is outstanding for the training sample, since the observed and predicted curves are very similar for both functional responses.
Regarding the prediction sample, the obtained forecasts can be considered as an imputation of what should have been the real behaviour of these curves in the observation period if the mode of data communication did not change. It can be observed that the model captures the trend of the curves up to the change. Additionally, once the missing data were imputed, a canonical correlation analysis was carried out in order to study the possible relationship between the two groups of variables: hospital occupancy rate (number of hospitalized people and ICU admissions) and illness response (number of positive cases, deaths and recovered people). The first principal component score of each variable was selected to form the canonical analysis, since only the first principal component explains almost all the variability in the five functional variables. After an exhaustive analysis, both sets of variables were shown to be highly correlated with each other and, moreover, each of the first canonical variables is a good overall predictor of the opposite group of variables. At this point, the variables with more predictive power are the number of hospitalizations, positives and deceases. In sum, the present document introduces a new mechanism for the imputation of missing at random functional response curves and shows the relationship among interesting functional variables associated with the COVID-19 pandemic.
Author Contributions: All authors contributed equally in developing the methodology, implementing the proposed estimation algorithms with the software R, performing the simulation study and application, as well as in writing this manuscript. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: