Forecasting the Potential Number of Inﬂuenza-like Illness Cases by Fusing Internet Public Opinion

: As inﬂuenza viruses mutate rapidly, a prediction model for potential outbreaks of inﬂuenza-like illnesses helps detect the spread of the illnesses in real time. In order to create a better prediction model, in this study, in addition to using the traditional hydrological and atmospheric data, features, such as popular search keywords on Google Trends, public holiday information, population density, air quality indices, and the numbers of COVID-19 conﬁrmed cases, were also used to train the model in this research. Furthermore, Random Forest and XGBoost were combined and used in the proposed prediction model to increase the prediction accuracy. The training data used in this research were the historical data taken from 2016 to 2021. In our experiments, different combinations of features were tested. The results show that features, such as popular search keywords on Google Trends, the numbers of COVID-19 conﬁrmed cases, and air quality indices can improve the outcome of the prediction model. The evaluation results showed that the error rate between the predicted results and the actual number of inﬂuenza-like cases form Week 15 to Week 18 fell to less than 5%. The outbreak of COVID-19 in Taiwan began in Week 19 and resulted in a sharp rise in the number of clinic or hospital visits by patients of inﬂuenza-like illnesses. After that, from Week 21 to Week 26, the error rate between the predicted and actual numbers of inﬂuenza-like cases in the later period dropped down to 13%. It can be conﬁrmed from the actual experimental results in this research that the use of the ensemble learning prediction model proposed in this research can accurately predict the trend of inﬂuenza-like cases.


Introduction
The COVID-19 pandemic broke out at the end of 2019. It has spread all over the world at lightning speed. Large-scale pandemics, such as SARS, H1N1, Influenza A, and MERS, etc. have drawn much global attention to the damage that pandemics can bring to the world [1,2]. Due to the convenience in public transport, viruses these days can be easily spread to every corner of the world these days [3,4]. An influenza-like illness means any illness caused by a virus with symptoms similar to those that are caused by influenza viruses ("flu"), including symptoms such as fever, respiratory symptoms, muscle pain, and fatigue, etc. If they are not diagnosed as influenza, they are called influenza-like illnesses. In this research, an "influenza-like illness" is defined as a sudden onset of illness with a fever over of 38°or more, accompanied by respiratory symptoms, muscular soreness, headache, or extreme fatigue, excluding mild rhinitis, tonsillitis, and bronchitis [5]. The results of many studies have shown the correlation between survival rates and outbreak periods of most viruses and seasonal climate changes. Prel et al. [6] explored the effects of different climates on acute respiratory tract infections (ARI) and found that different

Related Works
In this section, the past related works containing discussions and reviews on factors affecting influenza-like illnesses prediction models for outbreaks of influenza-like illnesses in machine learning will be reviewed.

The Definition of Influenza-Like Illnesses
Influenza is an acute viral respiratory illness, often accompanied by fever, cough, headaches, muscle pain, and other symptoms. It is mainly transmitted from person to person by droplets produced while coughing or sneezing or by touching a contaminated object or surface. It is impossible to accurately diagnose whether patients with influenzalike symptoms, severe community-infected pneumonia, or other similar illnesses are caused by influenza viruses or other pathogens from their clinical symptoms, routine examinations, and chest X-rays, etc. [19][20][21]. There are four types of influenza viruses: influenza A, B, C, and D. However, influenza A (H1N1 and H3N2) and influenza B are the main influenza viruses that cause current seasonal influenza [22][23][24]. Although in clinical diagnosis, influenza cannot be easily distinguished from other acute respiratory illnesses, such as common cold, bronchitis, or viral pneumonia, etc., influenza is usually more serious than the common cold, and the duration of treatment is longer than the common cold. Table 1 shows a comparison between influenza and the common cold [5]. Table 1. A comparison of influenza and the common cold [5].

Pathogens Influenza viruses
More than 200 viruses, such as commonly seen respiratory syncytial viruses and adenoviruses, etc.

Modes of transmission Droplet and contact transmission
Droplet and contact transmission Seasonal viruses cause respiratory illnesses when human bodies are infected by influenza viruses. In most countries, there are repeated periodic epidemics every year. The timing for seasonal influenza outbreaks is different between the southern and northern hemispheres. In the southern hemisphere, seasonal outbreaks occur between June and September every year, whereas in the northern hemisphere, they occur between November and March [5]. Seasonal outbreaks in Taiwan occur between November and March (winters) every year, as it is in the northern hemisphere.

The Selection of Training Features
There are four possible modes of transmission of influenza viruses [25]. They are: (1) transmission through direct physical contact with an infected person; (2) transmission through mediums, usually inanimate objects (such as droplets on objects or surface); (3) transmission through droplets of an infected person produced through sneezing, coughing, etc., which are transmitted to the nasal cavity or oral mucosa of a recipient; and (4) transmission through particles of a radius of 2.5 µm propelled by coughing or sneezing into the air. Viruses can survive in particles that float in the air for a long time and be transmitted through the particles.
The relative importance among the four transmission modes is a controversial issue. Lowen et al. [26] used guinea pigs as mammalian test objects to test the hypothesis that temperature and relative humidity would affect the transmission rate of influenza viruses. They found that guinea pigs are very sensitive to influenza viruses that infect humans and that the pups of guinea pigs exposed to the viruses are more likely to be infected. They used a variety of relative humidity and temperature conditions and various combinations of them to evaluate the transmission rates of influenza viruses and found that guinea pigs were very sensitive to influenza viruses that infected humans and that the pups of guinea pigs exposed to the viruses were more likely to be infected. They used a variety of relative humidity and temperature conditions and various combinations of them to evaluate the transmission rates of influenza viruses. They found that transmission speeds of influenza viruses depended on the temperature and relative humidity of the environment. Their findings support the hypothesis that meteorological conditions affect the spread of influenza viruses and help establish the link between meteorological factors and the spread and evolution of viruses, which was troublesomely uncertain in the past.
In influenza-related prediction studies, people tend to associate them with the climate and hydrological information. Prel et al. [6] explored the impacts of the climate on acute respiratory tract infection (ARI) hospitalization. Globally, ARI-related pneumonia is the leading cause of childhood deaths. It is worth noting that not all known ARI viruses cause epidemics in cold seasons, and many countries regard ARI as a common cold. The survival rates of ARI viruses may be influenced by the cold air, but the cold air is by no means the main reason that determines the survival of the viruses. Low temperature and other climatic factors may cause the viruses to increase their activity levels, adaptability, infection rates, and degrees of infection in virus hosts, pathogens, and the environment. For example, activity levels of influenza A, respiratory syncytial viruses, and adenoviruses are related to temperature, and rhinoviruses are related to relative humidity. In a study conducted by Cox and Subbarao [27], they pointed out that influenza had an obvious and consistent seasonal distribution in temperate regions and that peak outbreak seasons in winter were from November to March in the northern hemisphere and from May to September in the southern hemisphere for 5-10 weeks. Yap et al. [28] proposed that in tropical and subtropical regions, influenza-prone periods varied greatly, and there might be several peak periods within a year. Chan et al. [7] investigated the relationship between influenza activity and two key meteorological factors, namely, temperature and relative humidity, in Hong Kong from 1997 to 2006.
There are many controversies about the impacts of wind speed on the spread of viruses. Xiao et al. [29] used multiple sets of climatic conditions to conduct their research and found that slow wind speeds helped the spread of influenza A virus pandemics. Sundell et al. [30] conducted a study on the impacts of four seasons on the transmission rates of influenza A virus pandemics in temperate climates. They speculated that when an infected person coughed, certain wind speed conditions helped spread particles of droplets that contained the virus for a longer time. In addition, wind speed can help lower outdoor temperature and reduce outdoor humidity. These two effects of wind speed increase the speed of the spread of influenza A virus pandemics. However, there are other scholars who do not consider that wind speed affects the spread of viruses. Peci et al. [31] used a variety of climatic factors to conduct their research. They found that there was no correlation between wind speed and any influenza virus test results, so wind speed did not affect influenza transmission.
Air quality is currently a public health issue. Air pollution is a by-product of a civilized society. Many studies have shown that air pollution causes a variety of diseases that are harmful to the human bodies [32][33][34]. There are significant interactions between different types of air pollutants and respiratory diseases. Influenza-like illnesses are respiratory illnesses. Influenza-like viruses spread through air transmission or droplets, so suspended particles in the air are also one of the factors that affect influenza. Huang et al. [35] used the wavelet coherence analysis method to explore the possible correlation between suspended particles and influenza-like illnesses. Their results showed that that there was a significant correlation between suspended particles PM2.5, PM10, and NO 2 and influenza-like illnesses but that there was no correlation between suspended particles and a crowd of people over 25 years old in Nanjing, China during a peak season of influenza. Contrary to Huang's finding, Feng et al. [36] found that PM2.5 particles had a positive correlation with influenza-like illnesses in all age groups, which was most evident for the age group of 25-29 years old, followed by the age group of 15-24 years old and then the 5-14 years old and the over 60 years old groups. It had the least impact on children under 5 years old. Su et al. [37] explored the potential relationship between air pollutants and influenza-like illnesses in Jinan, China. They found a potential correlation between PM2.5, PM10, and SO 2 particles and peak periods of influenza-like illnesses. However, they found no correlation between NO 2 and O 3 particles and influenza-like illnesses. Xu et al. [38] discussed the impacts of air pollution and temperature on the occurrences of influenza cases for people aged between 0 and 14 years old in Brisbane, Australia. They used a regression model to analyze the correlation between occurrence rates of influenza cases in winter and air pollution and temperature. Studies have shown that temperature is negatively correlated with occurrence rates of influenza cases, and highly concentrated O 3 and PM10 have a significant correlation with occurrence rates of influenza cases. Therefore, O 3 and PM10 are also important indicators when assessing occurrence rates of influenza cases.

Machine Learning Models for Predicting Outbreaks of Influenza-Like Illnesses
Cheng et al. [39] used four machine learning algorithms, namely, ARIMA, Random Forest, SVM, and XGBoost, to establish a real-time national system to monitor influenza outbreaks and predict influenza-like cases for a four-week period for the Taiwan Centers for Disease Control (CDC). To combine the prediction results of the four different machine learning models, a stacking ensemble learning method was used to form the final prediction model. Its most accurate prediction result for a week scored a MAPE of less than 0.75 and a hit rate 0.75. Darwish et al. [40] used machine learning and deep learning multiple algorithms to establish a model to predict the number of influenza-like cases in Syria. The lowest MAPE of its prediction results was 3.52% and the lowest RMSE 0.01662. Chen et al. [41] used the Seasonal Autoregressive Integrated Moving Average (SARIMA) to predict outpatient rates of the influenza-like illnesses in Shenyang, China. The authors mentioned that the predicted values of influenza-like illnesses could be used as a reference for outbreaks of influenza-like cases in the short term, but other factors should be taken into consideration when forming strategies for influenza prevention and control. Hu et al. [42] proposed an IAT-BPNN model to predict the number of influenza-like illnesses in different regions of the United States. They used the artificial tree (AT) algorithm to train the model, which optimized the initial parameters of the BP neural network. They used BPNN, AT-BPNN, and IAT-BPNN in their experimental tests and comparisons. Their results showed that IAT-BPNN reduced the error rates and produced the most accurate predictions. Tapak et al. [43] used support vector machine (SVM), artificial neural-network, and Random Forest time series models to predict weekly influenza-like illnesses in Iran. The results showed that the Random Forest time series models outperformed the other three methods in simulating the weekly ILI frequencies. The comparison of related works on using machine learning for predicting outbreaks of influenza-like illnesses in Table 2. The definition of influenza, how to choose eigenvalues, and the proposed influenza prediction system have been discussed in this Section. As stated above, some scholars have proposed different methods to build an influenza-like illness prediction system, but in terms of feature selections, few studies have included weather, air pollution factors, public holidays, and other data into their model training. Therefore, different features have been incorporated in the experiments and discussions of this research.

Prediction Framework
In this section, the methodology and techniques used in this research on the prediction of outbreaks of influenza-like illnesses will be described. Hydrometeorological data taken from meteorological observation data, statistics on emergency infectious diseasesinfluenza-like illnesses, data on keyword search volumes on Google Trends, air quality indices, data on total population, population density, and Taiwan public holiday information were used in this research. XGBoost, Random Forest, SVR, and ensemble learning were selected for experiments and verifications in this research. Descriptions of the experimental environment and related package versions are shown in Table 3: The observation stations, where the hydrometeorological data were taken for this research, were the ground weather observation stations and the automatic weather/rainfall observation stations of the Central Weather Bureau, Taiwan. The observation data consisted of several parts. The first part comprised the data taken from the "Data Bank for Atmospheric and Hydrologic Research" in Taiwan up to April 2020 and the data taken from the "Open Weather Data" in Taiwan from that date up to the date of this research. This research focused on predicting the number of influenza-like cases in counties and cities in Taiwan. However, as there were no ground weather observation stations of the Central Weather Bureau in some counties and cities, such as Miaoli County and Chiayi County, etc., the data for those regions were taken from the automatic weather/rainfall observation stations to fill in the missing data of these counties and cities.
Data on statistics on emergency infectious diseases-influenza-like illnesses-were taken from the "Taiwan National Infectious Disease Statistics System" of the Taiwan Centers for Disease Control (CDC), which contained statistical data on the number of visits to emergency departments at hospitals by patients with influenza-like illnesses of every age in every county/city in every week of the year.
Data on keyword search volumes were based on Google keyword searches. Various flu symptoms were selected as keywords, and their search volume values on Google were collected. The search volume values are relative values and refer to the popularity of a search term in a specific area within a specific period. The value range was set at [0, 100].
Monitoring data of the Environmental Protection Administration of the Executive Yuan in Taiwan were used as air quality indices in this research. Data on total population and population density were based on the statistical data of all counties, cities, towns, and villages in Taiwan as provided by the Department of Statistics of the Ministry of the Interior, Taiwan. The total population was the statistical data of the statistical population, and the population density was the population indicator data. The Taiwan public holiday information was taken from the open government data platform at "data.gov.tw" (accessed on 1 August 2021).
Datasets required for this research were first imported from their sources. They were then pre-processed using its applicable data processing method, and then all the processed data were grouped into its applicable training and testing datasets. Assuming prediction took place in week 0 (lag0) to predict the number of influenza-like cases in the following week, as the features used in this research to predict outbreaks of influenza-like cases did not predict outbreaks for the same week but they lagged behind for a week or longer, data of the week before (lag1) were used for week 0 for prediction. Machine learning was then used to predict the number of influenza-like cases for the following week. Figure 1 shows the framework used in this research for predicting influenza-like cases. The following subsection in this section will discuss the techniques used and the reasons why they were chosen for this research.

Data Pre-Processing
The pre-processed datasets discussed in this section are datasets that are publicly available as mentioned in the section above. The features required for this research were selected from among the datasets. The features contained in Table 4 are the original features used in this research. The features contained in Table 5 are the features related to influenzalike symptoms subsequently added to this research, as mentioned in Section 2.2, and the additional features will be compared at the end of this section. PP and RH may show negative values in the observatory instruments due to various reasons, and TX, WD may show abnormal negative values in the observation instruments due to various reasons. Please refer to Table 6. However, anomalous values for PM10, PM2.5, SO 2 , O 3 , and NO 2 will be removed, as mentioned in Table 7.

PP Millimeter (mm) Precipitation
The minimum value of precipitation in this research is 0. All negative values that may be caused by instrumental and human factors are replaced by 0. After excluding outliers of a distance greater than three standard deviations, the average value for a week is calculated (using one week as a unit), and a log value is taken.

RH Percentage (%) Average relative humidity
After excluding outliers of a distance greater than three standard deviations, the average value for a week is calculated (using one week as a unit), and a log value is taken.
After excluding any average temperature of a negative value and obvious outliers of a distance greater than three standard deviations, the average value for a week is calculated (using one week as a unit), and a log value is taken.

TD
Celsius( • C) Daily temperature differences A week is taken as a unit. After the data for a week are tallied up, a log value is taken.

WD Meter per second (m/s) Average wind speeds
After excluding outliers of a distance greater than three standard deviations, the average value for a week is calculated (using one week as a unit), and a log value is taken.

ILI Number of people The number of influenzalike cases
The number of emergency visits by patients of influenza-like illnesses is obtained and tallied up for all age groups in each county/city. The weekly number of public holidays in Taiwan in five years The statistics on total population in each county/city/town/village for every four quarters are obtained and converted into the weekly population data.

PD Population density Population Density
The population indicators in each county/city/town/village for every four quarters are obtained and converted into the weekly population density data.

PDoI
Population density Population density of influenza The number of patients with influenza this week (ILI) is divided by the total population of a county/city (PIR) and then multiplied by the population density of the county/city (PD) to obtain the weekly population density of influenza cases.
PM10 µg/m 3 Air quality index -PM10 The PM10 data are obtained from the air quality index and computed to obtain weekly averages.
PM2.5 µg/m 3 Air quality index -PM2.5 The PM2.5 data are obtained from the air quality index and computed to obtain weekly averages.

SO 2 ppb
Air quality index -SO 2 The PMSO2 data are obtained from the air quality index and computed to obtain weekly averages.
The O 3 data are obtained from the air quality index and computed to obtain weekly averages.

NO 2 ppb
Air quality index -NO 2 The NO 2 data are obtained from the air quality index and computed to obtain weekly averages.

Cov19
Number of people The number of confirmed cases of COVID-19 The number of confirmed cases of COVID-19 is obtained and tallied up for all age groups in each county/city. Traces of rain −9999 No data due to no observation

# Indicates an invalid value after instrument checks *
Indicates an invalid value after program checks x Indicates an invalid value after manual checks NR Indicates no rain fall blank Indicates no value 888 Indicates no wind 999 Indicates instrument failures Lastly, to consolidate the knowledge about influenza-like illnesses, we selected the features introduced at Tables 4 and 5 to perform different processing on different data according to the categories they belonged to. The features of hydrometeorological data were combined and calculated according to the observation stations of the county and city, from which the data was collected. The minimum value of the PP data was 0, and there should be no negative value for RH data. As to the TX data, as only the mountainous areas at a high altitude might be subjected to a temperature below 0 degree Celsius in winter while the rest of the areas should be above 0 degree Celsius, negative values of this set of data were also excluded. There should be no negative values for the WD set of data either. All anomalous negative values of these four sets of data were caused by instrumental or human factors (please refer to Table 6, and therefore, the negative values of the PP data were replaced by 0; those of the RH, TX, and WD data were excluded due to their characteristics. Additionally, outliers of a distance greater than three standard deviations were excluded. Weekly average values were calculated, and log values (using one week as a unit) were taken for the PP and RH data. TD was the most special among the hydrometeorological features. TD represented the differences between the highest and the lowest temperature of TX, which were then calculated on a weekly basis to take logs. ILI represented the emergency infectious disease monitoring statistics-the influenza-like illness, the aggregate data of the total number of emergency visits by patients of influenzalike illnesses for all age groups in each county/city. ILI_D represented the weekly changes in the number of emergency visits by patients of influenza-like illnesses by deducting the number of emergency visits in this week with the number of visits in the week before. ILI_D was the total number of emergency visits by patients of influenza-like illnesses from the week before. Each feature value of GT was the numerical data of a Google Trends search volume within 5 years. The numerical values were floating values, not absolute values. HoC was the number of public holidays per week, from Sunday to Saturday, in Taiwan within 5 years. PIR was the statistical data on the total population in each county/city in Taiwan, converted from the data for four quarters into weekly units. Data on PD were the statistical data on the population density of each county/city in Taiwan and were converted, like PIR, from the data for four quarters into weekly units. PDoIs represented the values of an IL, divided by a PIR and multiplied by a PD. As mentioned earlier, there are six severity levels of air quality indices, i.e., PM10, PM2.5, SO 2 , O 3 , and NO 2 . Anomalies (see Table 7 for anomalies) were excluded to calculate weekly averages. Cov19 represented the number of confirmed cases of COVID-19. As symptoms of COVID-19 are similar to those of influenza, the public often finds it difficult to tell whether they catch an influenza or COVID-19 virus. It is considered that the number of confirmed cases of COVID-19 has an impact on the number of influenza-like illness cases. Therefore the number of confirmed cases of COVID-19 was taken into account in this research.

Keyword Volumes from Google Trends
Google Trends displays the search volumes of users on the Google search engine within a specific geographic region in time series indices. A keyword search index is based on its search volume proportion, meaning a search volume of a keyword is divided by the total search volume in the geographic region within a specific period to compare the relative popularity of its discussions. The percentage of a total volume of a keyword search in a designated region within a designated period is normalized to the range of [0, 100]. The maximum search volume percentage is 100, and the contrary to that is 0 [45,46]. Figure 2 is the visual presentation of keyword search volumes of "common cold" on Google Trends, adjusted to show the search popularity of the keywords in Taiwan, as the designated region, in the last 5 years. Google Trends can also use keywords to view the search popularity in each sub-region in a specific geographic area within a specific period. The search popularity in this research was calculated in the range of [0, 100]. If a search volume of a keyword in a sub-region showed the highest popularity in the total search volume in the relevant geographic region, that sub-region was marked 100. If a search volume of a keyword in a sub-region occupied only half of the total search volume in the relevant geographic region, that sub-region was marked 50. If a search volume of a keyword in a sub-region was insufficient, that sub-region was marked 0. Figure 3 presents the search popularity of the keyword of "common cold" in each sub-region in 5 years in Taiwan. The sub-regions are the counties and cities, such as Taipei City, New Taipei City, and Taichung City, etc., in Taiwan. It shows in Figure 3 that the keyword, "common cold," is the most searched in New Taipei City, whose search popularity is marked 100.

Lag Features
Individual data in a data series are arranged according to their time sequence (e.g., a second, minute, day, week, or month apart) in a chronological order in the time series. Time series can be divided into two types: systematic and non-systematic time series. A non-systematic time series contains random data changes, called noise. A systematic time series is divided into two types: a trend and a periodic time series. A trend time series refers to a trend of changes according to time periods, e.g., linear or exponential increases or decreases. A periodic time series refers to periodic changes, e.g., seasonal increases or changes in peak and off-peak seasons.
The datasets used in this research to predict the weekly numbers of influenza-like cases were all time series data. The data were organized into weekly units, as it did not have any impact on the daily numbers of influenza-like cases but usually had a delay impact on the number of influenza-like cases in a week later or longer. As shown in Table 8, assuming that the target prediction week of this research is Week 13 of 2021, the data of Week 11 (Lag1) of 2021 are used to fill in the data in Week 12 (Lag0), so that the data of Week 11 are used to predict the number of influenza-like cases for Week 13.

Periodicity
In this section, the steps of feature selections are discussed. There were 34 features in total selected for this research, as mentioned in the previous section. Considering the periodic issues associated with influenza-like illnesses, one-hot encoding was used to process the time data of the numbers of years and weeks to deal with periodicity. There are 6 years from 2016 to 2021, and there are 52 or 53 weeks in each of those years, as shown in Table 9. Table 9. The illustration of using one-hot encoding to process the time data of the numbers of years and weeks to deal with periodicity.

Year
Week

Machine Learning Models
Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Support Vector Regression (SVR), and ensemble learning were selected as training models in this research. Random Forest affects proportions of features and facilitates the verification of hypotheses and ideas. Similar principles are used in XGBoost and Random Forest to predict results more accurately. SVR is different from the previous two. SVR was used as a control group. The final prediction results of models were compared with the past data to obtain the accuracy rates. In this research, the values of the XGBoost and Random Forest models were combined by ensemble learning as the final prediction results of this research.

eXtreme Gradient Boosting (XGBoost)
XGBoost [47] is a scalable machine learning system, used for tree boosting and based on the extension and improvement in the Gradient Boosted Tree (GBDT), while retaining the original model. New functions can be added to XGBoost to adjust mistakes of the last tree, i.e., to add a new tree to the last tree to rectify the insufficiency of the last tree to boost the overall efficiency, known as additive training, the formula of which is shown in Equation (1). Features are segmented each time a tree is constructed. Method (1): A greedy algorithm is used to obtain a best segmentation point. After all features are listed, a feature is used as a segmentation point, and parameters, to which features are corresponded, are calculated in accordance with Equation (2). The larger the value is, the more the loss value decreases. The best segmentation point can then be found. Method (2): A proximity algorithm is used to select features to assemble quantiles of feature distributions into sets of split points. Features of continuous information are assembled into corresponding buckets according to their segmentation points, and then samples in the buckets are accumulated, and the best split point is found by its accumulated value. Method (3): A weighted quantile algorithm is used to solve the problems that the data cannot be accessed in one go or the low efficiency of the greedy algorithm. Method (4): A sparse perception algorithm is used when the content of a dataset is sparse as there are missing values in most datasets. The datasets without missing values are used for node branching. When a feature with missing data is to be placed on a node, it will directly determine which branch node the missing value should be assigned to [48]. XGBoost is used to solve classification and regression problems.
It can generate a set of classification and regression trees (CART). Each leaf of the CART corresponds to a set of scores, which is used as the basis for classification.
3.6.2. Random Forest (RF) Random Forest [49] consists of decision tree classifiers. Each of the classifiers is generated independently from random vectors in input vectors. A bagging algorithm is used for each feature or each feature combination, i.e., samples are randomly taken from the training data to train multiple classifiers. Gini coefficients are used to select features, to measure the impurity of the features to their categories, and to segment each feature. The smallest Gini is selected for segmentation. In the end, if it is the classified data, weights are used to vote. The averaging method is used in the regression model to obtain results [50].

Support Vector Regression (SVR)
SVR [51] is an extension of the support vector machine (SVM). SVR can handle continuous prediction problems. Consistent with the classification method, SVR is characterized by the use of kernel functions, sparse solutions, VC marginal controls, and the number of support vectors. One of the main advantages of SVR is that its complex calculation does not depend on the dimensionality of the input space. It has an excellent generalization ability and a high prediction accuracy [52].

Model Evaluation
RMSLE was used in this research to measure the effects of the models. RMSLE is RMSE in the log form. It considers relative errors in the same ways as MSPE and MAPE, but RMSLE error curves are asymmetrical. The closer its value is to 0, the less often errors occur. The RMSLE formula is shown in Equation (3). The error rates of this research for influenza-like prediction results were calculated by the differences in percentage between the predicted number of cases and the actual number of confirmed cases of influenza-like illnesses. The formula for calculating the prediction error rates is shown in Equation (4).

Model Testing and Adjustments
When conducting model testing and adjustments in this research, much time was spent in the data pre-processing stage to carry out the numerical processing of the data and sorting out the data from different time periods, such as excluding outliers of distances greater than three standard deviations, replacing negative PP values with 0, and excluding other anomalous data of negative values. After multiple adjustments, the processing method that produced better results was selected. During the adjustment process, it was discovered that the number of predicted cases was suddenly reduced. During the anomaly exclusion process, it was discovered during the data pre-processing that the data from the "Data Bank for Atmospheric and Hydrologic Research" in Taiwan were missing from the datasets received from the Hsin-Wu Observation Station, which was subsequently supplemented by the data taken from the "Open Weather Data" in Taiwan. After the use of the supplementary data, the problem of the sudden reduction in the predicted result was resolved. As it was found that, during the consolidation of the data, the data from Miaoli County and Chiayi County were missing, the data were then taken from the automatic weather/rainfall observation stations of the Central Weather Bureau to fill in the missing data.
The data taken from 10 remote observation stations, such as Wu-Fen-Shan Radar Station and An-Bu-Peng-Jia-Yu, etc., were first excluded, and the data taken from stations located in more populated areas were retained. Then, features, such as temperature differences (TD), differences in the number of influenza-like cases (ILI-D), and the number of public holidays in Taiwan per week (HoC), were then added. The process of additions and adjustments of various features was recorded. The first step was to add features, such as daily temperature differences (TD) and the number of influenza-like cases from the previous week. The results showed that some data reflected more accurately the actual number of confirmed cases, but some deviated more from it. However, the average differences between the predicted results and the actual confirmed cases of influenza-like illnesses of those after the addition of these features were slightly smaller than those before the addition. In the second step, the data taken from remote observation stations were excluded, and the number of Taiwan public holidays (HoC) was added. After the additions and adjustments of the features, the best parameters for the models of this research were selected. The predicted results were compared with the results in step two, and it showed better results. The average differences are relatively reduced [39,53].
Previous studies have shown that relative humidity affects the spread and survival rates of influenza viruses. Shaman and Kohn [25] mentioned in their study that absolute humidity had more obvious impacts on the spread rates and activity of viruses than relative humidity. In temperate regions, there are strong seasonal cycles of both absolute humidity indoor and outdoor. These seasonal cycles are consistent with the increases in virus activity and transmissions in winter and can be used to explain the seasonality of influenza. Therefore, differences in absolute temperature provide single, coherent, and more physical explanations for observed changes in activity, transmission, and the seasonality of influenza viruses in temperate regions. However, absolute humidity was not included as a feature in this research as there was insufficient hydrometeorological data to calculate absolute humidity.
As to the feature of the average temperature differences, Suntronwong et al. [54] explored the relationship among each influenza virus, influenza activity, and meteorological variables. After analyzing average temperature differences, relative humidity, and accumulated rainfall, it was found that all flu activity is positively correlated with average temperature, relative humidity, and rainfall. Kamigaki et al. [55] found that in the Philippines, average temperature differences were positively correlated with respiratory infections. Therefore, average temperature differences were added as a feature in this research.
As to the choices for the data for Week 0 (Lag0), assuming the number of influenza-like cases was to be predicted for Week 1, as there were no complete data for Week 0, there were a few options. Option 1 was to use the known data from the week before Week 0 to fill in the data for Week 0. Option 2 was to use the average data of the data from the month before to fill in the data for Week 0. After testing and the adjustment of parameters and the comparison with historical data, it was found that Option 1 brought the predicted results closer to the actual confirmed cases than Option 2. Option 1 was therefore adopted in this research.
Features mentioned in Section 2.2 were also selected. Features, such as the popularity or search volumes of keywords selected from influenza-like symptoms on Google Trends, various air quality indices, the data on total population and population density in each county/city, and the number of confirmed cases of COVID-19 were also adopted as features in this research. All these factors have effects on the outbreaks of influenza-like cases. These features were used in our experiments as discussed in the following section.

Experimental Design
In terms of predicting the number of clinic or hospital visits by patients of influenzalike illnesses, it is known from previous studies that the number of the clinic or hospital visits is affected by atmospheric and hydrological factors, such as precipitation (PP), relative humidity (RH), temperature (TX), temperature differences (TD), and wind speed differences (WD), etc. A new factor, keyword search volumes on Google Trends, that has not been included as a factor affecting the number of outbreaks of influenza-like cases in previous predictions of influenza-like cases was added as a feature in this research. The use of search volumes on Google Trends is the highlight of this research in the prediction of influenza-like cases. The popularity of keyword searches of influenza-like illnesses and symptoms in specific regions within specific periods was calculated through search volumes on Google Trends. The values of the popularity of keyword search volumes were set to range from 0 to 100. The larger the number was, the more popular was a keyword being discussed on Google. The smaller the number was, the less popular was a keyword being discussed on Google. The number of Taiwan public holidays was added to this research as a feature, excluding the data taken from remote observation stations and adding features of the total population and population density. Air quality indices were added as a feature as well, as they too would affect the speed of the spread of influenza-like illnesses accordingly to previous studies.
In Section 2.2, factors that affect transmission rates of influenza-like illnesses were discussed. Some scholars consider that wind speed is not significantly related to the spread of influenza-like illnesses. Some study air quality indices and discover that air pollution is significantly related to outbreaks of influenza-like illnesses. The symptoms of COVID-19 in 2019 were very similar to those of influenza-like illnesses. Most people could not tell them apart. Additionally, people had different symptoms, which made it even harder for them to diagnose themselves with the virus that caused their illness. Therefore, COVID-19 was included as a feature in this research. In order to evaluate and select features, various combinations of features were experimented in this research to conduct training, testing, and subsequent evaluation.
Random Forest, XGBoost, SVR, and ensemble learning, the most seen models used in the prediction of data, were used to construct models in this research to predict the number of influenza-like cases. Random Forest and XGBoost were used in this research to carry out the prediction and then to carry out comparisons through SVR. Random Forest and XGBoost can be used to output weights that affect features and are convenient in verifying hypotheses and ideas. Similar principles in carrying out the prediction of data are used in Random Forest and XGBoost, but, generally speaking, XGBoost is more accurate in its prediction outputs than Random Forest. To this end, in this research, an ensemble learning model combining Random Forest and XGBoost was used to obtain more accurate results. In addition to using these three models to carry out predictions, SVR was used to predict results, and its results were compared with the other three in this research. As the SVR prediction results were worse than the other three, only the ensemble learning model of Random Forest and XGBoost were used as the prediction models. The SVR prediction results were not adopted.

Datasets
The data used in this research were all taken from the open data provided by various government agencies or the data publicly available on the Internet, including the hydrometeorological data, the emergency infectious disease monitoring statistics-the influenza-like illness, the Google Trends search volume data, the Taiwan public holiday information, the data on air quality indices, and the data on the total population and population density.
There were a total of 35 selected features in this research, as listed in Tables 4 and 5. Adding periodic features of 52 or 53 weeks for 6 years that were converted by one-hot encoding, and four columns of "City," "Year," "Month," and "Week," it tallied up to a total of 97 features in this research. As discussed in Section 2.2, past scholars have proposed that various hydrometeorological factors and air quality indices have significant or insignificant effects on the number of influenza-like cases. Therefore, models were trained with different combinations of features using the above-mentioned research methods and processes in our experiments. The effects of different feature combinations will be discussed at the end of this section.

Comparisons of Different Combinations of Features
As mentioned before, features used in this research were adjusted many times before being finalized. Some previous studies have suggested that many features directly affect the number and the spread of influenza-like cases, whereas some are not significantly related to outbreaks of influenza-like illnesses. Various combinations of features were therefore used in this research to conduct training, testing, and subsequent evaluation, as shown in Table 10.

Combinations Features Used
Org_df Atmospheric hydrological data, the number of influenzalike cases, Google Trends search volumes: influenza, public holidays, as listed in Table 4 GT_df Similar to Org_df, but adding the features of 17 keywords relating to influenza-like symptoms on Google Trends, such as runny nose, common cold, sore throat, etc., also adding the population data GT_noWD_df Similar to GT_df but excluding the feature of wind speed differences (WD) AQI_df Similar to GT_df but adding air quality indices, e.g., PM10, PM2.5, NO 2 , SO 2 and O 3 AQI_noWD_df Similar to AQI_df but excluding the feature of wind speed differences (WD) Covid_noWD_df Similar to AQI_noWD_df but adding the feature of the number of COVID-19 confirmed cases Two RMSLE evaluation indicators were used for the performance comparison of this model. As mentioned above, various feature combinations were used for training. The numbers of influenza-like cases were predicted from Week 15 to Week 28 of 2021. Next, performance evaluations were performed on six different feature combinations. The results are shown in Tables 11-16. Predictions on the number of influenza-like cases were carried out, using three RMSLE models each week. The average method was used to calculate the error rates for these 14 weeks. Table 17 shows a comparison of the performance evaluation of the six feature combinations.
According to the error rates of all feature combinations, it was found that most of the error rates of SVR were higher than those of XGBoost and Random Forest. As a result, SVR-predicted results were not used in this research. Only the predicted results of XGBoost and Random Forest were used in this research. Table 17 shows the overall comparison of average evaluation results of XGBoost and Random Forest for each feature combinations. When the RMSLE values were used for comparison, the top three feature combinations with the lowest error rates were Covid_noWD_df, GT_noWD_df, and AQI_noWD_df. Therefore, in Section 4.3, only these three prediction results are discussed.

Evaluation Results and Discussion
In terms of results, the ensemble learning model combining RandomForest and XG-Boost was used as a predictive model, and the data of the week before Week 0 were used as the data of Week 0; the number of influenza-like cases for Week 1 was predicted. It was mentioned in the previous section that the three feature combinations with the smallest error rates were: Covid_noWD_df, GT_noWD_df, and AQI_noWD_df. So, next, the results of these three feature combinations will be discussed.
The weekly numbers of influenza-like cases for 14 weeks of 2021 were examined and predicted in this research. As seen from Figures 4-6, the prediction results from Week 15 to Week 18 were close to the actual number of influenza-like cases, and the prediction error rate for that period was only about 5%. From Week 19, the difference between the predicted and the actual number of influenza-like cases was huge, and the prediction error rate increased to 50%. The reason for the increase in the error rate was that in Week 19, there was an outbreak of COVID-19 in Taiwan, and the number of confirmed cases of COVID-19 surged. However, the symptoms of COVID-19 are similar to those of influenza, such as fever, coughs, fatigue, etc.. The incubation period of influenza is 1-4 days, and it takes approximately 2-14 days for COVID-19 symptoms to appear [56]. Therefore, it is difficult for people to quickly tell whether they catch a COVID-19 or an influenza-like virus. According to the guidelines issued by the Taiwan Centers for Disease Control, all patients who exhibit symptoms of COVID-19 or influenza-like illnesses must report to the relevant authorities, take appropriate protective measures, and seek medical treatment. When people find that they have similar symptoms, they choose to seek medical treatment directly, which leads to an increase in the number of patients of influenza-like illnesses. At present, the number of confirmed cases of COVID-19 in Taiwan has gradually slowed down, and the number of clinic or hospital visits by patients of influenza-like illnesses has also decreased. From Week 21 to Week 26, the predicted and the actual numbers of influenza-like cases gradually became close to each other, and the prediction error rate for that period was reduced to 13%.

Conclusions and Future Work
Due to the convenience in public transport, viruses can be easily spread to every corner of the world these days, especially as it can be seen in the COVID-19 pandemic. However, the symptoms of COVID-19, SARS, H1N1, and Influenza A are very similar. Before pandemics are said to be caused by influenza viruses, they are called influenza-like illnesses. If the potential number of influenza-like illness cases can be predicted earlier and accurately, the predicted results can help the government, hospitals, pharmacies, and companies quickly prepare for the spread of influenza-like cases, as they can help form informed decisions and take preventive measures. In this research, an ensemble learning approach, fusing Random Forest and XGBoost learning models, is proposed. Multiple features, such as the hydrometeorological data, the emergency infectious disease monitoring statistics on influenza-like illnesses, Google Trends keyword search volumes, the Taiwan public holiday information, the population density, average temperature differences, air pollution indices, and the number of COVID-19 confirmed cases, were used in the proposed model.
In our experiments, the weekly numbers of influenza-like cases were predicted for 14 weeks in 2021. The experimental results were compared with the actual numbers of influenza-like cases. The error rate for the period from Week 15 to Week 18 was within 5%. In Week 19, there was a sudden surge in the number of influenza-like cases. According to seasonal flu periods in Taiwan, outbreaks do not occur in summers. Nevertheless, the number of COVID-19 confirmed cases suddenly increased at that time. It is speculated that it is because people cannot tell for sure whether they have a common cold, a flulike illness, or COVID-19. Moreover, the Taiwan Centers for Disease Control requires all patients who exhibit symptoms of COVID-19 or influenza-like illnesses to seek medical treatment. This led to an increase in the number of patients of influenza-like illnesses being reported at that time. Three weeks later, the number of confirmed cases of COVID-19 in Taiwan gradually slowed down, and the number of clinic or hospital visits by patients of influenza-like illnesses also decreased. The prediction error rate between the predicted and the actual number of influenza-like cases for that period was reduced to 13%, getting closer to the actual number of influenza-like cases. The experimental results showed that our proposed ensemble learning approach could accurately predict the number of influenzalike cases. The outcomes of our experiments can be practically useful and applied widely, as they can provide an early warning of an influenza outbreak. The proposed model can be used to prevent possible threats from these illnesses in a timely manner, to allocate medical resources reasonably to reduce morbidity and mortality, and to reduce the risk of transmission of these illnesses.
In the future, the model will be built into a prediction system, which will be provided to the government, hospitals, pharmacies, and companies to predict the number of influenzalike illnesses at any time. This will enable them to quickly understand the spread of influenza-like cases in the future, so that they can form informed decisions and take preventive measures. This can also help the public understand, through the government and hospitals, potential large-scale outbreaks of influenza-like illnesses in the near future, so that they can take measures to protect their own health and safety.