## 1. Introduction

COVID-19 disease caused by the severe acute respiratory syndrome coronavirus (SARS-CoV-2) first appeared in Wuhan, China, and the first cases were notified to WHO on 31 December 2019 [

1,

2]. Beginning in Wuhan as an epidemic, it then spread very quickly and was characterized a pandemic on 11 March 2020 [

1]. Symptoms of this disease include fever, shortness of breath, cough, and a non-negligible proportion of infected individuals may develop severe forms of the symptoms leading to their transfer to intensive care units and, in some cases, death, see e.g., Guan et al. [

3] and Wei et al. [

4]. Both symptomatic and asymptomatic individuals can be infectious [

4,

5,

6], which makes the control of the disease particularly challenging.

The virus is characterized by its rapid progression among individuals, most often exponential in the first phase, but also a marked heterogeneity in populations and geographic areas [

7,

8,

9]. The number of reported cases worldwide exceeded 3 millions as of 3 May 2020 [

10]. The heterogeneity of the number of cases and the severity according to the age groups, especially for children and elderly people, aroused the interest of several researchers [

11,

12,

13,

14,

15]. Indeed, several studies have shown that the severity of the disease increases with the age and co-morbidity of hospitalized patients (see e.g., To et al. [

15] and Zhou et al. [

8]). Wu et al. [

16] have shown that the risk of developing symptoms increases by

$4\%$ per year in adults aged between 30 and 60 years old while Davies et al. [

17] found that there is a strong correlation between chronological age and the likelihood of developing symptoms. Since completely asymptomatic individuals can also be contagious, a higher probability of developing symptoms does not necessarily imply greater infectiousness: Zou et al. [

6] found that, in some cases, the viral load in asymptomatic patients was similar to that in symptomatic patients. Moreover while adults are more likely to develop symptoms, Jones et al. [

18] found that the viral loads in infected children do not differ significantly from those of adults.

These findings suggest that a study of the dynamics of inter-generational spread is fundamental to better understand the spread of the coronavirus and most importantly to efficiently fight the COVID-19 pandemic. To this end the distribution of contacts between age groups in society (work, school, home, and other locations) is an important factor to take into account when modeling the spread of the epidemic. To account for these facts, some mathematical models have been developed [

13,

14,

17,

19,

20]. In Ayoub et al. [

19] the authors studied the dependence of the COVID-19 epidemic on the demographic structures in several countries but did not focus on the contacts distribution of the populations. In [

13,

14,

17,

20] a focus on the social contact patterns with respect to the chronological age has been made by using the contact matrices provided in Prem et al. [

21]. While Ayoub et al. [

19], Chikina and Pegden [

20] and Davies et al. [

17] included the example of Japan in their study, their approach is significantly different from ours. Indeed, Ayoub et al. [

19] use a complex mathematical model to discuss the influence of the age structure on the infection in a variety of countries, mostly through the basic reproduction number

${\mathcal{R}}_{0}$. They use parameter values from the literature and from another study of the same group of authors [

22], where the parameter identification is done by a nonlinear least-square minimization. Chikina and Pegden [

20] use an age-structured model to investigate age-targeted mitigation strategies. They rely on parameter values from the literature and do discuss using age-structured temporal series to fit their model. Finally, Davies et al. [

17] also discuss age-related effects in the control of the COVID epidemic, and use statistical inference to fit an age-structured SIR variant to data; the model is then used to discuss the efficiency of different control strategies. We provide a new, explicit computational solution for the parameter identification of an age-structured model. The model is based on the SIUR model developed in Liu et al. [

23], which accounts for a differentiated infectiousness for reported and unreported cases (contrary to, for instance, other SIR-type models). In particular, our method is significantly different from nonlinear least-squares minimization and does not involve statistical inference.

In this article we focus on an epidemic model with unreported infectious symptomatic patients (i.e., with mild or no symptoms). Our goal is to investigate the age structured data of the COVID-19 outbreak in Japan. In

Section 2 we present the age structured data and in

Section 3 the mathematical models (with and without age structure). One of the difficulties in fitting the model to the data is that the growth rate of the epidemic is different in each age class, which lead us to adapt our early method presented in Liu et al. [

23]. The new method is presented in the

Appendix A. In

Section 4 we present the comparison of the model with the data. In the last section we discuss our results.

## 2. Data

Patient data in Japan have been made public since the early stages of the epidemic with the quarantine of the

Diamond Princess in the Haven of Yokohama. We used data from the website covid19japan.com (

https://covid19japan.com. Accessed 6 May 2020) which is based on reports from national and regional authorities. Patients are labeled “confirmed” when tested positive to COVID-19 by PCR. Interestingly, the age class of the patient is provided for 13,660 out of 13,970 confirmed patients (97.8% of the confirmed population) as of 29 April. The age distribution of the infected population is represented in

Figure 1 compared to the total population per age class (data from the Statistics Bureau of Japan estimate for 1 October 2019). In

Figure 2 we plot the number of reported cases per 10,000 people of the same age class (i.e., the number of infected patients divided by the population of the age class times 10,000). Both datasets are given in

Table 1 and a statistical summary is provided by

Table 2. Note that the high proportion of 20–60 years old confirmed patients may indicate that the severity of the disease is lower for those age classes than for older patients, and therefore the disease transmits more easily in those age classes because of a higher number of asymptomatic individuals. Elderly infected individuals might transmit less because they are identified more easily. The cumulative number of death (

Figure 3) is another argument in favor of this explanation. We also reconstructed the time evolution of the reported cases in

Figure 4 and

Figure 5. Note that the steepest curves precisely concern the 20–60-year old, probably because they are economically active and therefore have a high contact rate with the population.

## 5. Discussion

The recent COVID-19 pandemic has lead many local governments to enforce drastic control measures in an effort to stop its progression. Those control measures were often taken in a state of emergency and without any real visibility concerning the later development of the epidemics, to prevent the collapse of the health systems under the pressure of severe cases. Mathematical models can precisely help see more clearly what could be the future of the pandemic provided that the particularities of the pathogen under consideration are correctly identified. In the case of COVID-19, one of the features of the pathogen which makes it particularly dangerous is the existence of a high contingent of unidentified infectious individuals who spread the disease without notice. This makes non-intensive containment strategies such as quarantine and contact-tracing relatively inefficient but also renders predictions by mathematical models particularly challenging.

Early attempts to reconstruct the epidemics by using SIUR models were performed in Liu et al. [

23,

26,

27,

28], who used them to fit the behavior of the epidemics in many countries, by including undetected cases into the mathematical model. Here we extend our modeling effort by adding the time series of deaths into the equation. In

Section 4 we present an additional fit of the number of disease-induced deaths coming from symptomatic (reported) individuals (see

Figure 10). In order to fit properly the data, we were forced to reduce the length of stay in the R-compartment to 6 days (on average), meaning that death induced by the disease should occur on average faster than recovery. A shorter period between infection and death (compared to remission) has also been observed, for instance, by Verity et al. [

7].

The major improvement in this article is to combine our early SIUR model with chronological age. Early results using age structured SIR models were obtained by Kucharski et al. [

33] but no unreported individuals were considered and no comparison with age-structured data were performed. Indeed in this article we provide a new method to fit the data and the model. The method extends our previous method for the SIUR model without age (see

Appendix A).

The data presented in

Section 2 suggests that the chronological age plays a very important role in the expression of the symptoms. The largest part of the reported patients are between 20 and 60 years old (see

Figure 1), while the largest part of the deceased are between 60 and 90 years old (see

Figure 3). This suggests that the symptoms associated with COVID-19 infection are more severe in elderly patients, which has been reported in the literature several times (see e.g., Lu et al. [

12], Zhou et al. [

8]). In particular, the probability of being asymptomatic (our parameter

f) should in fact depend on the age class.

Indeed, the best match for our model (see

Figure 11) was obtained under the assumption that the proportion of symptomatic individual among the infected increases with the age of the patient. This linear dependency of

f as a function of age is consistent with the observations of Wu et al. [

16] that the severity of the symptoms increase linearly with age. As a consequence, unreported cases are a majority for young age classes (for age classes less than 50 years) and become a minority for older age classes (more than 50 years), see

Figure 12. Moreover, our model reveals the fact that the policies used by the government to reduce contacts between individuals have strongly heterogeneous effects depending on the age classes. Plotting the transmission matrix at different times (see

Figure 13) shows that younger age classes react more quickly and more efficiently than older classes. This may be due to the fact that the number of contacts in a typical day is higher among younger individuals. As a consequence, we predict that one month after the effective start of public measures, the new transmissions will almost exclusively occur in elderly classes. The observation that younger ages classes play a major roles in the transmission of the disease has been highlighted several times in the literature, see e.g., Davies et al. [

17], Cao et al. [

11], Kucharski et al. [

33] for the COVID-19 epidemic, but also Mossong et al. [

34] in a more general context.

We develop a new model for age-structured epidemic and provided a new and efficient method to identify the parameters of this model based on observed data. Our method differs significantly from the existing nonlinear least-squares and statistical inference methods and we believe that it produces high-quality results. Moreover, we only use the initial phase of the epidemic for the identification of the epidemiological parameters, which shows that the model itself is consistent with the observed phenomenon and argues against overfitting. Yet our study could be improved in several direction. We only use reported cases which were confirmed by PCR tests, and therefore the number of tests performed could introduce a bias in the observed data – and therefore our results. We are currently working on an integration of this number of tests in our model. We use a phenomenological model to describe the response of the population in terms of number of contacts to the mitigation measures imposed by the government. This could probably be described more precisely by investigating the mitigation strategies in terms of social network. Nevertheless we believe that our study offers a precise and robust mathematical method which adds to the existing literature.