Trace the History of HIV and Predict Its Future through Genetic Sequences

Traditional methods of quantifying epidemic spread are based on surveillance data. The most widely used surveillance data are normally incidence data from case reports and hospital records, which are normally susceptible to human error, and sometimes, they even can be seriously error-prone and incomplete when collected during a destructive epidemic. In this manuscript, we introduce a new method to study the spread of infectious disease. We gave an example of how to use this method to predict the virus spreading using the HIV gene sequences data of China. First, we applied Bayesian inference to gene sequences of two main subtypes of the HIV virus to infer the effective reproduction number (GRe(t)) to trace the history of HIV transmission. Second, a dynamic model was established to forecast the spread of HIV medication resistance in the future and also obtain its effective reproduction number (MRe(t)). Through fitting the two effective reproduction numbers obtained from the two separate ways above, some crucial parameters for the dynamic model were obtained. Simply raising the treatment rate has no impact on lowering the infection rate, according to the dynamics model research, but would instead increase the rate of medication resistance. The negative relationship between the prevalence of HIV and the survivorship of infected individuals following treatment may be to blame for this. Reducing the MSM population’s number of sexual partners is a more efficient strategy to reduce transmission per the sensitivity analysis.


Introduction
AIDS (Acquired Immunodeficiency Syndrome) is a disease caused by HIV (Human Immunodeficiency Virus) and has become a disease that seriously threatens human health in the world today. In order to alleviate the AIDS epidemic in China, the Chinese government began to implement the "Four Frees and One Care" HIV/AIDS free treatment policy in 2003 [1,2]. This policy has greatly improved the patient's survival rate and the patient's life quality. On the other hand, the high mutation rate of HIV virus has led to the generation of HIV-resistant strains [3]. The spread of drug-resistant strains causes a waste of national medical resources.
We want to know the effect of the free treatment policies, especially the drug resistance situation of HIV/AIDS by dynamic models. It is hard to avoid parameter acquisition when performing prediction with dynamics models. Some parameters are general parameters, which are common to both China and abroad. We can assign these values by searching relative literature. Some might be greatly different due to different countries or different treatment strategies, so they need be fitted by monitoring data. However, such areas happen to lack monitoring data except for some gene sequences of HIV patients being detected before treatment. In this manuscript, we will show how to combine the gene sequences of HIV and the dynamic model to study the dynamics of HIV spreading.
HIV is an RNA virus. RNA viruses have a short reproduction time and a high mutation rate, which can easily lead to a large number of genetic mutations during the transmission process. Therefore, during the spread of HIV, relevant signals will be left in the sampled virus sequence. In recent years, researchers have begun to conduct retrospective studies on the spread of the HIV virus on the genetic level. Most of these studies are based on genetic sequences to build molecular networks of transmission using Bayesian evolution analysis [4][5][6][7][8][9][10][11][12]. For example, in 2008, Lewis et al. used the Bayesian framework to conduct a phylogenetic analysis of 2126 HIV gene sequences in London, and they constructed a network for HIV transmission, revealing that the local HIV/AIDS epidemic can be traced back to the late 1990s [13]. In 2018, Zhao et al. constructed a molecular network of the spread of MSM in Beijing based on the HIV gene sequence, and they proposed the need to strengthen education and intervention for people with potential high risk. These findings have important implications for the parameterization of epidemiological models and the design of intervention strategies.
Traditional dynamic models can be used to predict the spread of virus in populations [14][15][16][17][18][19][20][21][22], and the Bayesian evolution analysis of virus separated from the infected people can trace the historical dynamics of virus transmission [13,23,24]. However, the work connecting the gene sequences and the dynamic model has not yet been seen. This manuscript will use Anhui province as an example, combine genetic sequence molecular evolution analysis and system dynamics models to conduct an in-depth research on the HIV/AIDS epidemic in the content of the HIV-free treatment policy in China, and thus predict the drug resistance spreading in the next few years.

Gene Sequence Analysis
To trace the history of transmission dynamics of HIV, we sampled totally 360 HIV-pol gene sequences (CRF 01AE (150), CRF 07BC (190), CRF 08BC (9) and CRF 5501B (11)) from 360 HIV patients who were infected with HIV through sexual behavior in Anhui Province. This data set is sampled between October 2017 and September 2018. Considering that the proportions of the CRF 08BC and CRF 5501B are too small, we only analyzed the subtypes CRF 01AE and CRF 07BC, comprising a total of 340 HIV virus gene sequences. We applied Mega for sequence alignment. The total length of the pol gene that we studied was 1075 bp.
The selection of a nucleotide substitution model is an important step during the phylogenetic analysis. Common nucleotide substitution models include the HKY (+Γ + I) model, GTR (+Γ + I) model, and so on [25]. These different substitution models determined different evolutionary distances and different phylogenetic trees [26]. Here, we used software Phylosuite to select the most suitable substitution model according to the Akaike Information Criterion (AIC) [27], where AIC was defined as: AIC = 2k − 2ln(L), in which k is the number of corresponding parameters and L is the likelihood value. Method-ofmoments was used to estimate the AIC method. The model with lower AIC value is the more suitable model. After calculating the AIC score, the HKY+Γ+I model and the uncorrelated exponential relaxed molecular clock are determined finally.

Bayesian Inference
The birth death skyline (BDSKY) model in BEAST2 is a random description of population change, which allows the extraction of virus transmission dynamics information from a phylogenetic tree [11]. This model is established on Bayesian inferences, which can estimate the effective reproduction number of HIV/AIDS transmission from gene sequences. The Bayesian approach allows us to continuously update our estimate of a set of parameters, Θ, as data become available. P(Θ|data) = P(data|Θ) · P(Θ) P(data) . P(Θ), the prior distribution, represents our prior estimates about the true value of Θ. P(data|Θ) is the likelihood distribution. It is also often written as L(Θ|data), which means the probability of observing the data given Θ. For the method to work, it is necessary to calculate the likelihood distribution for all possible values of Θ. P(data) is the model evidence, and it is the same for all possible hypotheses (values of Θ) being considered. P(Θ|data) is the posterior distribution and represents our updated estimate of the value of Θ given the observed data.
The main objective of Bayesian inference is to calculate the posterior distribution of our parameters using our prior beliefs updated with our likelihood. From the posterior distribution, we can determine the most likely values of Θ given the observed data. Since we are usually only interested in relative probabilities of different hypotheses, P(data) can be left out of the calculation, and we write the model form of Bayes' theorem as where ∝ means "proportional to".

The Effective Reproduction Number G R e (t) Infered from Gene Sequences
Since we will study two different subscript gene sequences, we use j = 07BC for the subtype CRF 07BC and j = 01AE for the subtype CRF 01AE. We will deduce the propagation dynamics of the two subtypes separately. For each subtype j, the BDSKY process can conclude three parameters: transmission rate λ j , removal rate γ j and sampling probability ψ j . This process allows individuals to "birth" (λ j ) or "death" (γ j and ψ j ) at any point in time. A "birth event" corresponds to an individual's infection, and a "death event" corresponds to an infected individual becoming non-infected (i.e., removed from the infected compartment because of individual death, successful treatment, or individual behavior change). Then, the effective reproduction number R e (t) derived from the gene sequences can be defined as [11] j G R e (t) = λ j (t)/(µ j (t) + ψ j (t)), in which the left-hand subscript G denotes that the effective reproduction number is derived from the gene sequences. For estimating j G R e (t) of subtype j = 07BC or 01AE, the Bayes' theorem that we use is In order to catch the characteristics of transmission in different historical stages, we divided the period from the origin of each subtype HIV sequences to the sampling time of the last sample into a total of six segments. Then, from each subtype gene sequence of j= 07BC and 01AE, we can infer the effective reproduction number j G R e (t) during each segment t, t = 1, 2, . . . , 6 through Equation (1). Hence, j G R e (t) is a piece wise function for each of the two subtypes, which is a constant during the same time segment t. It refers to the average number of new infections caused by an infected person at a certain time point t during the outbreak. We use the Markov Chain Monte Carlo (MCMC) method to achieve Bayesian phylogeny inference.

A Nested Model and Its Four Submodels
To predict the prevalence of HIV drug-resistant strains, we considered the transmission dynamics of HIV/AIDS through sexual route, including gay, straight and bisexual. The whole population was divided into four subgroups: heterosexual women who only have sex with men (marked as w), heterosexual men who only have sex with women (marked with mw), homosexual men who only have sex with men (marked with mm), and bisexual men who have sex with both women and men (marked as b). The population was divided into susceptible people S, infected people I (infected but not receiving treatment), treated people T, and drug-resistant people R according to the infection. Considering that the HIV infection process has obvious stage characteristics, according to the level of CD4 cells in the patient's body, we divide the infected people into three stages: the first infection stage I 1 (CD4 count ≥ 500 cells/mm 3 ); the second infection stage I 2 (200 cells/mm 3 < CD4 counts < 500 cells/mm 3 ); and the third stage of infection I 3 (AIDS stage, CD4 count ≤ 200 cells/mm 3 ). Similarly, T i and R i , i = 1, 2, 3 can be explained. The details of the model can be found in Figure 1. When the superscript g in the transmission process diagram Figure 1 is taken as w, mw, mm and b, it represents the four types of subgroups. See Appendix A.1 for the corresponding dynamic model.  Figure 1. Transmission process diagram. The superscript g can be w, mw, mm and b, respectively, which represents the four types of subgroup populations. Parameter a is an input of susceptible individuals; θ i , i = 1, 2, 3 are the disease progression rates in infection stage 1, 2 and 3, respectively; δ i , i = 1, 2, 3 are the treatment rates of patients in stage 1, 2 and 3, respectively; η is the percentage of patients who give up HAART; γ is rate of drug resistance after HAART; ρ is the percentage of patients who are no longer resistant.
The standard treatment of HIV/AIDS in China has gone through four periods: the period before 2002 featured no treatment; the period from 2003 to 2011 was when the treatment was started when the CD4 count was less than 200 cells/mm 3 ; the period from 2012 to 2015 was when the treatment was started when the CD4 count was less than 500 cells/mm 3 and the period after 2015 was when the treatment was started soon after discovery. We call these periods I Due to different prevention and control policies being held in different historical stages, the dynamics of HIV transmission have different stage characteristics. So, our model ( Figure 1) in fact is a nested model. For the four different historical treatment periods I, II, III and IV, we can simplify the model to be four submodels by letting some parameters be zero based on its history characters; see Table 1 for the four different submodels. According to the "next-generation operator" method in the literature [28], we can obtain the basic reproduction number k M R 0 for each submodel of corresponding historical period k = I, II, III and IV (see Appendix A.2). Here, the left-hand subscript M means that the basic reproduction number k M R 0 was obtained from k = I, II, III or IV, one of the four models respectively. Table 1. Submodels of different historical periods and corresponding assumptions.

Submodel Assumptions
Historical Period For a general dynamic model of an infectious disease, the basic reproduction number R 0 is the expected number of infections directly generated by one case given that all individuals are equally susceptible. As the infection spreads, the susceptibility of the population decreases. The effective reproduction number, R e (t), is related to the basic reproduction number R 0 , by R e (t) = R 0 · S(t)/N(t), where S(t)/N(t) is the average susceptibility of the population. R e (t) is often used as an indicator of the effectiveness of interventions, such as social distancing measures, to contain the spread of a virus. If R e (t) is greater than 1.0, the infection is growing at an exponential rate. If R e (t) is at 1.0, the spread is sustained at a linear rate. If R e (t) is less than 1.0, the infection is spreading at a slower pace and will eventually die out.
Specific to our study on HIV/AIDs spreading during four different historical periods k = I ( where the basic reproduction number k M R 0 was obtained from Section 2.2.1 and S k (t) and N k (t) represent the susceptible population and the total target population of each historical period k = I, II, III and IV, respectively [29].

Tracing Back the Dynamic History of HIV/AIDS through Bayesian Inference
The MCMC converged after 20 million iterations for both subtypes. We discarded the first 2 million iterations. The phylogenetic tree and model parameters were sampled every 1000 iterations. All parameter estimates showed the effective sample size (ESS) of more than 200. Some results are shown in Table 2. For each data set and substitution model, we analyze them with an uncorrelated exponential relaxed molecular clock and a strict clock model. In the study of virus evolution, we used nucleotide substitution models including the HKY (+Γ + I) model and GTR (+Γ + I) model [25]. The most appropriate combination was selected according to the Akaike Information Criterion (AIC) [27], where AIC was defined as AIC = 2k − 2ln(L), in which k is the number of corresponding parameters and L is the likelihood value. Methodof-moments was used to estimate the AIC method. Finally, the uncorrelated exponential relaxed molecular clock and HKY+Γ+I model is determined by calculating the AIC value. The important epidemiological parameter estimation and their 95% HPD of subtype CRF 01AE and CRF 07BC are shown in Table 2.
The results in Table 2 show that the most recent common ancestor (MRCA) to the 150 CRF 01AE subtype gene sequences was around 1982, and the 95% highest posterior density (HPD) is between 1973 and 1990. Correspondingly, the MRCA to these 190 CRF 07BC subtype gene sequences was relatively late, around 1992 (95% HPD: 1976-1998). Most importantly, we obtained the effective reproduction number j G R e (t) over time segment t from the two subtypes, respectively, in which t = 1, 2, . . . , 6 represents the time segment. The effective reproduction number j G R e (t) of the two subtypes showed some similar transmission characteristics in terms of time: both of them were in a relatively stable state in the early stage, kept around 1.6 with slight fluctuations, and showed a slight downward trend from 2008 to 2012. However, things changed a lot around 2012: for some reason, the effective reproduction number increased rapidly and then remained at about 2.2.
When we study the history of HIV transmission dynamics in the entire population through the two effective reproduction numbers j G R e (t), j = 01AE, 07BC(t), j = 1, 2, we will have to integrate them together. The two effective reproduction numbers originated at different times but ended at the same time. We took the common historical stage of them. So, the research period was from 1992 to 2018. During this common historical period, we denote the average of the effective reproduction number for the entire population as G R e (t), which changes along with time t.
In order to figure out the formula for G R e (t), we used the proportion of the number of sequences of each subtype in the total gene sequence as its weight in the effective reproduction number, since subtypes widely disseminated in the population should have more contribution for G R e (t). Let M j be the sequence number of gene subtype j (j = 01AE, 07BC). Then, the weighted average effective reproduction number of the entire population at time t can be described by Formula (3).
Use R software to visualize G R e (t); then, the result can be found in Figure 2. Figure 2 shows the median line of G R e (t) and its 95% HPD interval from 1992 to 2018. Before 2011, we can see that the epidemic showed a relatively stable transmission trend through sexual route. The value of G R e (t) remained stable and slightly decreased, with a median value of about 1.33 (95% HPD: 0.94-1.8). However, from 2011 to 2014, G R e (t) rose rapidly to 2.13 (95% HPD: 1.32-2.83) for some reason. After 2015, G R e (t) remained stable, with its median value maintained at approximately 2.20 (95% HPD: 1.67-2.85). Some explanations will be given later on the possibility of G R e (t) rising in this period. Re t Figure 2. The weighted average of the effective reproduction number G R e (t) and its 95% HPD interval.

Parameter Estimations
As we all know, an important step in the prediction of infectious disease spreading is to obtain (or estimate from data) parameters values. For most of the general parameters we treated them as constants from references. Details of the values and references can be found in Appendix A.4. However, there are still some epidemiological related parameters that we are not sure of and need to be estimated from data.
Unlike novel coronavirus patients, HIV-infected people do not have obvious infection symptoms. Many HIV infected people do not know that they have been infected for a long time. This prevents the government from obtaining information about people living with HIV as early as possible. Moreover, the Chinese government's screening efforts for HIV-infected people are far less than those for infected novel coronavirus. As a result, recorded HIV infection data, if available, may not be very accurate and may not be suitable for parameter fitting. In Anhui province, which we studied, there is not much surveillance data on HIV/AIDS that can be used. One form of reliable data available are 340 genetic sequences of HIV-infected people taken before treatment.
In the previous subsections, we have extrapolated the dynamics of HIV transmission ( G R e (t)) from 1992 to 2018 using the 340 genetic sequences. In order to use these gene data to estimate unknown parameters of the model (   (2). Obviously, k M R e (t) is functions of parameters and time t. Since the two effect reproduction numbers obtained through different methods should be the same at the same time t, i.e., M R e (t) = G R e (t) , we fit them at the same historical period. In this way, we explored the humanistic and statistical features in different historical stages. We used MCMC to realize this. The algorithm ran for 2.00 million iterations, and we adapted the proposal distribution after 2 million iterations using Geweke's method [30] to assess convergence. The fitted curves of the two effective reproduction numbers for four historical periods I, II, III and IV are shown in Figure A1 in Appendix A.4. The fitted parameters obtained from the above four periods are shown in Table 3. The value of these parameters will be referred to as baseline parameter values in the following research.

HIV/AIDS Epidemic Trend under Baseline Parameters
We first studied the dynamic process of the total number of infected people and the proportion carrying drug-resistant strains over time from 2015 to 2025 in the entire population of Anhui province under baseline parameters ( Figure 3). The results show that the total number of HIV/AIDS infections during this period appears to be an upward trend, and the total number of surviving infections will increase to 33,260 by 2025 (95% confidence interval (CI): 28,096-40,090, Figure 3a). The number of new infections is also increasing and will reach 4522 (95% CI: 3366-6130) by 2025. According to the Anhui Provincial Health and Family Planning Commission, from 2017 to 2019, the cumulative reports of HIV/AIDS surviving patients whose current address is Anhui province are 18619 [31], 17183 [32] and 19604 [33], respectively. They are indicated by red dots in Figure 3a. Considering that the detection rate of HIV infections in Anhui province has not reached 100%, the number of reported cases should be less than the actual number of HIV infections. Based on the reported HIV-positive people and the HIV testing rate [31] in the corresponding year, we have reason to say that our predicted results are basically consistent with the actual situation.  Regarding the subpopulations, although the HIV infection rate among heterosexual men has declined and tends to be stable, the rates in other populations have shown steady upward trends. By 2025, the HIV-positive rate among heterosexual women will reach 0.25% (95% CI: 0.21-0.28%), while the HIV-positive rate among heterosexual men will be lower, only about half of the positive rate of women (0.12%, 95% CI: 0.10-0.13%). In contrast, men who have sex with men belong to high-risk groups, and their HIV positive rate is dozens of times that of heterosexual men: the HIV infection rate of bisexual men is 2.44% (95% CI: 1.88-3.12%), while that of homosexual men is much higher, reaching 3.62% (95% CI: 2.81-4.64%) by 2025.
Considering that large-scale treatment for HIV-infected people in China has lasted for quite a long time, we are concerned that drug-resistant strains (secondary drug resistance) will appear in the patients due to poor medication compliance, which will lead to the wide spread of drug-resistant strains (primary drug resistance) in the population. Unsurprisingly, the proportion of HIV-positive people carrying drug-resistant strains in Anhui province has shown a rapid upward trend ( Figure 3b) and will basically reach the highest point of 9.88% by 2025, which corresponds to 3282 (95% CI: 2862-3812) in terms of quantity. The proportion of primary drug-resistant patients in the total newly infected population will also increase over time. By 2025, the number of primary drug-resistant patients will account for 9.63% of total patients carrying drug-resistant strains (95% CI: 6.56-14.2%). In addition, the proportion of primary drug-resistant patients among total new infections first showed an upward trend, then tended to be stable each year, and finally will increase to 7.01% (95% CI: 6.61-7.44%) by 2025.

Influence of Intervention Measures on HIV/AIDS Epidemic Trend
In order to implement the "Healthy China 2030" program and deepen the medical and health system reformation, China's "13th Five-Year Plan for Combating and Prevention of AIDS" in 2017 set the "90-90-90" goal. That is, the proportion of infected persons detected by testing should be above 90%; the proportion of diagnosed infected persons receiving antiviral treatment should be above 90%; the treatment success rate of infected persons receiving antiviral treatment should be above 90% [34] by 2020. In the baseline parameters, the treatment rates of the three different infection stages were 50.19% (95% CI: 21.55%, 78.83%), 75.28% (95% CI: 60.86%, 89.71%) and 74.94% (95% CI: 60.57%, 89.30%). These treatment rates are still far from the second "90%". Next, we consider what will happen if the treatment rate increases to the target state in 2020.
The total HIV infection rate and drug resistance of Anhui province from 2020 to 2025 under the "90%" treatment rate are shown in Figure 4, which is a comparison chart between the baseline and the target situation. The results show that the total infection rate still keeps increasing (the green curve in Figure 4a) even under the "90%" treatment rate, which is only slightly lower than the baseline parameters (the red curve in Figure 4a). Even more surprising, the total drug resistance rate after intensified treatment will increase more rapidly than the baseline situation, reaching 10.92% (95% CI: 10.69-11.12%) by 2025, which is seen as the red curve in Figure 4b. In addition, the proportion of primary drug resistance in new infections has also shown an increasing trend compared with baseline parameters.
In short, when we increase the treatment rate to 90%, the total infection rate can be reduced only by 0.017% by 2025, but in turn, the total drug resistance rate and the proportion of primary drug resistance in new infections will increase 1.06% and 1.12%, respectively. What prevents ART treatment from playing an effective role in the epidemic? Is the current drug treatment effect not good enough? If the new first-line drugs can improve the treatment effect (including reducing the infectivity and prolonging the lifespan), how much will the epidemic be improved?
To answer these questions, we assume the following two situations: (a) the new drug can only prolong the lifespan after treatment by 20% (it can survive for 24 years after extension); (b) the new drug can not only prolong the life after treatment by 20% but also reduce the infection rate by 20%. The results are shown in Figure 5. We were surprised to find that assumption (a) will make the epidemic worse instead (Figure 5a, green line). This may be because patients will have more risky sexual behaviors due to longer lifespan. However, assumption (b) will slow down the trend (Figure 5a, red line). In terms of drug resistance, the drug resistance rate will decrease slightly under assumption (a) (Figure 5b, green line) and increase under assumption (b) (Figure 5b, red line). The main reason may be that treatment only suppresses the non-resistant strains, which may instead give drug-resistant strains more space for transmission. This result suggests that prolonging the lifespan of the patients is a double-edged sword. When developing new drugs, we should not only consider how to prolong the life span of patients but also consider how to reduce the infectivity. This conclusion will continue to be discussed in the sensitivity analysis.

Sensitivity Analysis
In previous studies, we unexpectedly found that merely increasing the treatment rate to the target rate (90% treatment rate) could not lead to a significant improvement in HIV epidemic (Figure 5a). In order to find the reason, we did some sensitivity analysis. We find that only when shortening each infectious stage's life span to less than 25% of its baseline values can the epidemic be controlled. Details can be found in Appendix A.5. In the process of new drug development, if we cannot effectively reduce the infection rate while continuing to pursue prolonging the life-span of patients after treatment, the epidemic will keep rising as a consequence. We speculate that this may be one of the reasons that the effective reproduction R g e has been increasing since 2012 (Figure 2). In other words, it is difficult to control the epidemic only by developing new drugs in a short period. So, what are the key factors to control the HIV/AIDS epidemic?
To answer this question, we conducted a sensitivity analysis on the number of sexual partners in different populations. An interesting fact is that changing the number of heterosexual people's sexual partners has little impact on the epidemic, and it almost can be negligible while comparing with the two subpopulations of homosexual sexual behavior who contributed a significant effect to the epidemic. As long as the number of bisexual men's annual sex partners is controlled below 6, and meanwhile, the number of homosexual men's annual sex partners is controlled below 4, the epidemic can be controlled. Details can be found in Appendix A.6.

Discussion and Conclusions
Generally speaking, predictive models for infectious diseases generally cannot avoid the problem of parameter estimation. After all, not all parameter values are easy to obtain. Even if some foreign literature mentions the value range of certain parameters, there may still exist differences due to different races. To some degree, parameter values determine the credibility of predictions. Facing the complex history of HIV treatment in China, different historical stages may have different values even for the same parameter. We innovatively carried out data mining on virus gene sequence detected from the infected people, so as to fit some parameter values which are difficult to obtain by social surveys or laboratory tests, making our prediction results more reliable. However, other non-Bayesian research methods are also good choices for the mathematical modeling of infectious diseases with a large number of reported data [35].
The idea of combining gene sequences with dynamic models came from the absence of monitoring data when doing model fitting. Unlike novel coronavirus patients, HIVinfected people do not have obvious infection symptoms. Many HIV-infected people do not know that they have been infected for a long time. This prevents the government from obtaining information about people living with HIV as early as possible. Moreover, the Chinese government's screening efforts for HIV-infected people are far less than those for individuals infected with novel coronavirus. As a result, recorded HIV infection data, if available, may not be very accurate and may not be suitable for parameter fitting. In Anhui province, which we studied, there is not much surveillance data on HIV/AIDS that can be used. The only reliable data available for use are 340 genetic sequences of HIV-infected people taken before treatment. HIV is an RNA virus. This gives it the ability to mutate faster, and these mutations show up in the entire genetic sequence of the virus from generation to generation. In other words, the evolutionary history of HIV is hidden in its genetic sequences. Since both gene sequences and dynamic models can describe the transmission dynamics of viruses, it should be possible to combine them for the purpose of learning from each other. This is our motivation to finish this manuscript.
What is more, most dynamic models generally only consider heterosexuality and homosexuality when classifying the target population, and they will not further subdivide homosexual people. Our study divided the MSM population into homosexual men and bisexual men to analyze the importance of bisexual men in the spread of HIV/AIDS. These bisexual men may be biological and psychological bisexuals; also, it may be that homosexuals are forced to marry women under the pressure of public opinion to cover up their particular sexual orientation. We generally think that such people are a bridge that links HIV/AIDS from high-risk groups to the general population. Thus, cutting off this bridge will slow down the HIV epidemic. However, the result of our model shows an opposite result: if bisexual men with the unchanged number of sexual partners disconnect from men, the epidemic will slow down a lot; otherwise, it will lead to an increase in the epidemic. For more details, please read Appendix A.7.
In "China's 13th Five-Year Plan for Containment and Prevention of AIDS", China advocates that HIV/AIDS prevention and treatment should achieve three 90%: the proportion of infected people and patients who have been diagnosed and know about their infection status should reach 90% or more; the proportion of infected people and patients who meet the treatment conditions receiving antiviral treatment should be more than 90%; the success treatment rate of infected people and patients receiving antiviral treatment should be more than 90%. These three 90% involve screening, treatment and drug development. According to our findings, if no more effective drugs are available, simply increasing the treatment rate will cause a slight decrease in the number of new infections and the number of people with primary drug resistance. Moreover, the epidemic is still on the rise; that is to say, the increase of treatment rate cannot effectively control the epidemic, and the prevention and control effect is not ideal. A similar phenomenon was also found in the study [36] by Lou Jie et al. A plausible explanation is that after treatment, the life expectancy of HIV-infected patients is extended, but the current drug's effect is not so good at reducing the infection rate, thus creating a greater possibility of transmission instead. In the meantime, free treatments of HIV have not yet covered the entire course of patients, so once drug resistance develops, the treatment led to failure. Therefore, after the implementation of the "90-90-90" strategic measures, continuous improvements of the treatment rate did not bring the HIV epidemic under control but inversely caused more drug resistance. The Chinese government needs to intervene in various aspects of different groups of people in combination with other perspectives in order to curb the current domestic AIDS epidemic. The sensitivity analysis on the two critical parameters, infection rate after treatment and progress rate of disease, indicates neither one is the most critical factor to impact the epidemic. It would be easier to control the epidemic by reducing the number of sexual partners of the two subpopulations in the MSM group through publicity and other means. In addition, it is worth mentioning that considering the absolute leading role of homosexual people in the epidemic at the present stage, we should tilt most of the resources to this group so that we can benefit from spending money wisely.
In short, gene sequences can tell us the history of HIV/AIDS spreading and the dynamic model can tell us its future. Combining the two together is an innovative approach, especially for epidemics where reliable surveillance data are lacking. In addition, we believe that this new method is not only suitable for HIV but also for other RNA viruses, such as novel coronavirus. Of course, our method is not perfect, since the influencing factors considered by the Bayesian model are relatively simple. However, at the very least, genetic sequence data can be used as a supplement to macro surveillance data for epidemic prediction.
Author Contributions: Z.W. and J.L. conceptualized the study and wrote the manuscript. Z.W., X.J. and Z.Z. analyzed the data and presented the results. Z.Z. and C.Z. was involved in manuscript integration. J.L. provided the epidemiological significance of the study, while Y.R. facilitated access to the data used in this study. C.Z., H.X., J.W., B.S. and Y.S. contributed equally. All authors have read and agreed to the published version of the manuscript.
Funding: This work has received no external funding.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by The Institutional Review Board of National Center for AIDS/STD Control and Prevention (NCAIDS) of the China Center for Disease Control and Prevention (China CDC) (protocol code X140617334, approval date: 7 December 2021).

Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: All relevant data are within the paper. Additional data cannot be shared publicly based on regulations of the National Center for AIDS/STD Control and Prevention (NCAIDS) of the China Center for Disease Control and Prevention (China CDC). These data are available from the the Institutional Review Boards of the NCAIDS for researchers who meet the criteria for access to confidential data (Tel.: 58900922, Fax: 58900920, E-mail: office606@chinaaids.cn).

Acknowledgments:
The authors would like to thank the Chinese Center for Disease Control and Prevention for approving accessing to data and other valuable documents used in the write-up of this manuscript. The manuscript preprint has been presented.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Appendix A.1. The Dynamic model where Ω g (t) and Λ g (t)(g = w, mw, mm, b) denote the number of newly infected people at the time of t at subgroup g infected with wild HIV strains and resistant HIV strains, respectively.
Here, we only show how to calculate the most complex basic reproduction number in the IV period, R d 0 (IV). The R d 0 of the other stages (I, I I, I I I) can be simplified from R d 0 (IV). We use the "regeneration matrix" method to find the basic reproduction number R d 0 (IV) of the model (A1); the process is as follows: Step 1: Construct matrices f 0 and v 0 , where f 0 is a column vector of n × 1, representing a new infection in each infection state, and v 0 is a column vector of n × 1, representing that each infected state has been transported (not newly infected), where n is the number of infected states. f g 0 = (Ω g , 0, 0, 0, 0, 0, Λ g , 0, 0) (A4) Step 2: Construct the matrix F and V, and calculate the value of the Jacobian matrix of f 0 and v 0 at the disease-free equilibrium, respectively.
We can easily find the disease-free equilibrium of the model (A1) (disease-free equilibrium) E 0 is The matrix F, V is the Jacobian matrix of f , v. Now, the matrix F, V is represented by a block matrix of 9 × 9. So, we can obtain F E 0 , V E 0 as follows.
Step 3: Obtain basic reproduction number where ρ represents the maximum eigenvalue.

Appendix A.3. Initial Population
According to the statistical results of all groups of people in Anhui in 2015, the initial value is shown in Table A1.

. Fitted Parameters
Details of the values and references can be found in Tables A2-A4. The details of other fitting parameters are shown in Table A5. The fitted curves of the two effective reproduction numbers G R e (t) and M R e (t) for four historical periods I, II, III and IV are shown in Figure A1.      Figure A1. The fitted curves of the two effective reproduction numbers G R e (t) and M R e (t) for four historical periods I (a), II (b), III (c) and IV (d), respectively.
Appendix A.5. Sensitivity Analysis on Therapeutic Effect There are two parameters related to the therapeutic effect: the post-treatment transmission rate (β T ) and the course of disease (1/θ T ). Next, we will discuss the effect of changes in these parameters on the effective reproduction number M R 0 (IV). We keep other parameters unchanged during this process. The sensitivity analysis results are shown in Figure A2b.  If only changing the infection rate after treatment at each infection stage, that is, reducing or increasing β T , the result is shown by the solid line in the Figure A2b. The abscissa is the multiple of the infection rate after the current baseline treatment (indicated by p, that is, the infection rate β T becomes p * β T after each stage of treatment), and the ordinate is the corresponding M R 0 (IV) value. The result shows that under the condition that other parameters remain unchanged, only when the infection rate of each infection stage after treatment simultaneously reduces to about 10% of the baseline infection rate after treatment β T can M R 0 (IV) drop below 1 (cyan solid line). Moreover, among the three infection stages, the infection rate in the second stage is the most sensitive to M R 0 (IV) (solid green line), while the infection rate after treatment in the third stage is the least sensitive to M R 0 (IV) (solid red line). This is because the course of the second stage is the longest among the three stages, and the patients in the third stage of infection have weaker sexual ability.
If only changing the course of the disease after treatment at each infection stage (1/θ T ), the result is shown as the dotted line in Figure A2b, where the abscissa is p, and it is the value of a multiple of the course parameter value after current baseline treatment. We are surprised to find that under the condition that other parameters remain unchanged, M R 0 (IV) is positively correlated with the course of disease after treatment, that is, the longer the course of disease after treatment, the higher the M R 0 (IV) will be. Moreover, only when the course of the disease after each stage of treatment is reduced to less than 25% of its baseline parameter value at the same time can M R 0 (IV) drop below 1 (cyan dotted line). The order of sensitivity to M R 0 (IV) in the course of the disease after treatment at different infection stages is consistent with the sensitivity to drug infectivity.
Appendix A.6. Sensitivity Analysis on the Number of Sexual Partners If other parameters remain unchanged, and only the number of annual sexual partners of various groups of people is changed, the result is shown in Figure A3a, where the abscissa is the multiple of the current parameter change, and the ordinate is the corresponding M R 0 (IV) value. The greater the slope of the straight line, the greater the influence of this parameter is on M R 0 (IV). The results show that the disease transmission can be effectively controlled when the annual number of sexual partners in all populations is reduced to 0.4 times the current number of sexual partners (cyan line). The order of the influence of the number of different groups annual sexual partners on M R 0 (IV) is: homosexual male (green line)> bisexual male (red line)> heterosexual people (the blue line, see the illustration on the Figure A3a). An interesting fact is that changing the number of heterosexual people's sexual partners has little impact on the spread of HIV/AIDS (blue line): almost negligible. The two types of people with homosexual sexual behavior (green line and red line) contributed a significant effect.
We further analyzed two types of MSM population and explored the impact of the number of sexual partners of bisexual men (c b ) and the number of heterosexual men (c mm ) on M R 0 (IV) ( Figure A3b). Traditionally, we believe that bisexual men play a bridge role in HIV transmission because they are not only involved in the transmission between homosexual groups but also in the transmission between heterosexual people. We would like to know what impact bisexual men would have on the HIV epidemic if they become purely homosexual or heterosexual. Figure A4a is the curve of the number of HIV-positive patients in the whole population of Anhui province from 2020 to 2025 over time after changing the nature of bisexual males. The results show that under the assumption that bisexual men are the pure MSM population, the total number of HIV/AIDS infections in Anhui province has shown a rapid upward trend, far exceeding the baseline status, and the number of surviving infections will increase to 38,171 by 2025 (95% CI: 34,694-42,078, the red curve in Figure A4a), and the number of new infections will also increase to 6333 accordingly. However, the resistance rate is lower than the baseline status and will decrease to 9.41% by 2025 (95% CI: 8.83-9.88%, the red line in Figure A4b). On the contrary, if you disconnect from male sex and become purely heterosexual, the number of HIV infections will be significantly lower than the baseline status by 2025, it will be as low as 26,957 by 2025 (95% CI: 25,890-28,633, Figure A4a green curve), and the number of new infections will reduce to 2465. Meanwhile, its resistance rate will exceed the baseline status, increasing to 10.59% (95% CI: 10.14-10.93%) (the green curve in Figure A4b) in 2025.
(a) (b) Figure A4. The impact of changing the sexual preferences of bisexual men on the on the number of infected people (a) and the rate of drug resistance (b). The blue box plots are the situation of no change, the green ones are the situation that disconnect from male sex, and the red ones are the situation that disconnect from female sex. The continuous blue, green and red curves are their median lines respectively.