Semi-Supervised Text Classification Framework: An Overview of Dengue Landscape Factors and Satellite Earth Observation

In recent years there has been an increasing use of satellite Earth observation (EO) data in dengue research, in particular the identification of landscape factors affecting dengue transmission. Summarizing landscape factors and satellite EO data sources, and making the information public are helpful for guiding future research and improving health decision-making. In this case, a review of the literature would appear to be an appropriate tool. However, this is not an easy-to-use tool. The review process mainly includes defining the topic, searching, screening at both title/abstract and full-text levels and data extraction that needs consistent knowledge from experts and is time-consuming and labor intensive. In this context, this study integrates the review process, text scoring, active learning (AL) mechanism, and bidirectional long short-term memory (BiLSTM) networks, and proposes a semi-supervised text classification framework that enables the efficient and accurate selection of the relevant articles. Specifically, text scoring and BiLSTM-based active learning were used to replace the title/abstract screening and full-text screening, respectively, which greatly reduces the human workload. In this study, 101 relevant articles were selected from 4 bibliographic databases, and a catalogue of essential dengue landscape factors was identified and divided into four categories: land use (LU), land cover (LC), topography and continuous land surface features. Moreover, various satellite EO sensors and products used for identifying landscape factors were tabulated. Finally, possible future directions of applying satellite EO data in dengue research in terms of landscape patterns, satellite sensors and deep learning were proposed. The proposed semi-supervised text classification framework was successfully applied in research evidence synthesis that could be easily applied to other topics, particularly in an interdisciplinary context.


Introduction
According to the World Health Organism (WHO), dengue affects over half of the global population, with an estimated 100-400 million infections each year worldwide [1]. In recent years, dengue has Int. J. Environ. Res. Public Health 2020, 17, 4509 3 of 29 classification framework of literature by integrating the review process and text classification algorithms and provides an overview of dengue landscape factors and satellite EO data. The proposed framework allows for rational and effective selection of literature relevant to our objective from bibliographic databases.

Towards a Semi-Supervised Classification Framework of Literature
The framework of semi-supervised text classification integrating the review process and semi-automatic text classification (Figure 1), includes: (1) defining the research question and specifying the inclusion criteria (Section 2.1); (2) conducting a board search and removing the duplicates (Section 2.2); (3) screening titles and abstracts based on text scoring (Section 2.3); (4) preparing relevant and irrelevant samples, and conducting the BiLSTM-based active learning (Section 2.4); (5) verifying the performance of text scoring and BiLSTM-based active learning (Section 2.5); and (6) extracting dengue landscape factors and satellite EO data and charting the results (Section 2.6).
To implement the text scoring in step 3, it is necessary to remove the records that are definitively irrelevant to our topic, which also reduces the amount of data for the BiLSTM-based active learning in step 4. It should be noted that the BiLSTM model was developed and implemented based on titles and abstracts that are different from the full-text assessment in the eligibility step of the review. The detailed information is presented hereafter and no ethics approval is needed as this method is based on published journal articles.

Research Question and Inclusion Criteria
The objective of this study is to provide an overview on landscape factors related to dengue transmission and satellite EO data used in the identification of dengue landscape factors. Relevant records should satisfy the following criteria: (1) being an original journal article published in English; (2) highlighting landscape factors derived from satellite EO data or geographic information system (GIS) techniques; (3) being applied to dengue cases or dengue vectors; (4) modelling or correlating dengue with landscape factors. These were defined based on our objective and expert knowledge, and were used for text scoring and record sample selection for BiLSTM models.

Board Searches and Removal of Duplicates
The searches were performed from inception to 31 December 2019 in four databases: Science Direct, Web of Science, PubMed and Scopus, by considering the titles and abstracts of English journal articles. The queries were formed by combining dengue-related terms (i.e., dengue and Aedes) and the words related to "remote sensing", "landscape" and "weather" (i.e., remote sensing, satellite, earth observation, landscape, land cover, land use, household, dwelling, habitation, precipitation and temperature) using the Boolean operator "AND" (see more details in Table A1). All search records were combined together and the duplicate records were removed using the MySQL database. The remaining records were organized in alphabetical order for further analysis.

Text Scoring
To efficiently eliminate the definitely-irrelevant records, we used text weighting and text scoring for ranking all the records. First, we pre-set some terms KEY i (i = 1, . . . , m) and their priority levels (i.e., high, medium and low) ( Table 1) according to the criteria in Section 2.1. Each of them was randomly assigned a weight value WEIGHT i (i = 1, . . . , m) from the interval of weights that was set according to its priority level. The higher the priority level of a term, the greater its weight value. We then extracted the key terms K j (j = 1, . . . , n) and their corresponding weight values W j (j = 1, 2, . . . , n) from the title and abstract using the Natural Language Toolkit (NLTK) in Python. If K j contains pre-set terms in KEY i , we calculated the score of a record as Score = WEIGHT i * W j (i = 1, . . . , m; j = 1, . . . , n). For example, through keyword extraction using NLTK, a bibliographic record has two key terms "dengue" and "satellite", and their weights are W(dengue) and W(satellite). According to Table 1, the weights of these two terms were randomly assigned to 8 and 5. In this case, the score of this text is W (dengue)*8 + W (satellite)*5.   All the records were then ranked in decreasing order according to the scores, and the top 1000 records were selected and merged into a subset denoted as U k . Finally, we iterated the second step 20 times, and the records in the 20 subsets U k (k = 1, . . . , 20) were combined together, and were used for the next analysis. It should be noted that random assignment of weights allows multiple iterations of text scoring that should make the results more reliable.

BiLSTM-Based Active Learning
To efficiently and accurately select relevant records in the absence of sufficient labelled samples, we performed a BiLSTM-based active learning based on the titles and abstracts of the records derived from text scoring ( Figure 1).
Prior to training the BiLSTM model (see more details in Appendix C) [23], we created an initial training dataset by selecting 15 relevant samples and 30 irrelevant samples from the results of text scoring based on the criteria in Section 2.1. The initial training dataset was used to train the BiLSTM model.
Based on the word embedding derived from the unlabelled data using the Word2Vec CBOW model [24] (see more details in Appendix B), the BiLSTM model was used to identify the "potential" records from unlabelled data, which were then manually labelled as either relevant or irrelevant based on the four criteria in Section 2.1. Meanwhile, we improved the training dataset by combining the selected relevant records and previous relevant samples, and randomly selected irrelevant records from the results of text scoring in order to keep the ratio of relevant and irrelevant samples at 1:2. Finally, the BiLSTM model was re-trained using the new training dataset to identify the potential citations from the remaining unlabelled data. The parameters of the BiLSTM architecture were updated by training the results from the previous round. BiLSTM learning and active learning were alternately implemented until we could not find any relevant records.

Inclusion, Perfomance and Rationality
Because all the algorithms were implemented based on the titles and abstracts, we evaluated the full-texts of the records derived from BiLSTM-based active learning for final inclusion of the articles that met the criteria in Section 2.1. In fact, bibliographic databases might misclassify some records as English journal articles and store their English titles and abstracts.
To verify the performance of the algorithms of text scoring and BiLSTM-active learning, we randomly selected 10% of unlabelled records derived from BiLSTM-based active learning and manually interpreted them as either relevant or irrelevant. This step was iterated three times. Moreover, to verify the rationality of text scoring and BiLSTM-based active learning, we computed the number of relevant records per score rank interval. Generally, the more relevant a record is to the topic in question, the greater the possibility it will receive a high score.

Information Extraction and Analysis
The satellite EO data and landscape factors were extracted manually and synthesized narratively in two ways: (1) charting the dengue landscape factors and their typologies in order to appraise the current situation, regardless of the differences in study areas, methods and materials; (2) tabulating the key characteristics of satellite EO data. Table 2 presents the number of records for each step of semi-supervised text classification. A total of 13,893 bibliographic records were obtained after the broad search, and 7696 records were included after the removal of duplicates. Then, based on text scoring, we identified 2034 possible records, and 131 records were included after the BiLSTM-based active learning that met the inclusion criteria in Section 2.1. Finally, by reading the full texts, we included 101 articles (see more details in Appendix C). The non-English articles (e.g., Chinese, Spanish and Portuguese) and non-journal articles (e.g., book chapters, reviews or conference papers) were excluded.  Table 3 presents the results of each cycle of BiLSTM-based active learning. Evidently, all the relevant records were identified after the fourth cycle. Throughout the process of semi-supervised text classification, we manually evaluated 1056 titles/abstracts (Table 3). Moreover, the accurate and rational identification of relevant records can be indicated by the following two facts. First, no relevant records were found by manually evaluating the records selected randomly from the unlabelled dataset (i.e., 925 records after BiLSTM-based active learning). This indicated a good performance of the semi-supervised text classification. Second, although each record probably received different scores in 20 text scoring experiments, the number of relevant records per score rank interval showed a consistent decreasing trend ( Figure 2). This indicated the rationality of text scoring using the preset terms and priority levels, that is, the more relevant a record is to the topic question, the greater the possibility it will receive a high score. selected randomly from the unlabelled dataset (i.e., 925 records after BiLSTM-based active learning). This indicated a good performance of the semi-supervised text classification. Second, although each record probably received different scores in 20 text scoring experiments, the number of relevant records per score rank interval showed a consistent decreasing trend ( Figure 2). This indicated the rationality of text scoring using the preset terms and priority levels, that is, the more relevant a record is to the topic question, the greater the possibility it will receive a high score.  The accurate and rational identification of relevant records can be explained by the facts: (1) A clear topic was defined. In fact, modelling or correlating dengue epidemiological or entomological variables with landscape factors in different geographic contexts often includes the identification of landscape factors, landscape characterization and spatio-temporal analysis of dengue cases or vectors. This interdisciplinary topic provides evident features that meet the definition of appropriate inclusion criteria. These criteria then help to define terms and priority levels for text scoring and active learning. (2) The union of the results of 20 text scoring experiments enable the inclusion of potential records as much as possible, and greatly exclude the irrelevant records. (3) BiLSTM has proved to be especially useful in understanding the context of words [23], and active learning based on clear and appropriate inclusion criteria allows for the accurate selection of relevant records and for the control of the balance of positive and negative samples in training datasets for each cycle in BiLSTM learning. Moreover, it should be noted that other models are possible, such as BiLSTM with attention mechanism (AC-BiLSTM) [16] or a combination of CNN and LSTM (C-LSTM) [25], which might generate a high accuracy of text classification.

Dengue Landscape Factors
Due to the different study objectives, study areas and spatio-temporal scales, it is difficult to compare the 101 selected studies to find any underlying common viewpoints on the role of landscape factors in dengue transmission. The detailed landscape factors for each study are listed in Table A2. Here, we simply grouped these landscape factors into four categories according to the study [26] ( Figure 3):

1.
Land cover (LC) refers to the physical and biological cover over the land surface, including built-up areas, vegetation, water/wetlands, open land and savannah. Among them, vegetation often has an association with the vectors' behaviours and biological cycles, which could be linked with the spatial and temporal dynamics of vectors or the potential resting and breeding sites.
Water and wetlands often provide information of places of stagnant water, which are potential breeding sites for dengue vectors.

2.
Land use (LU) refers to a territory characterized by current and future planned functional or socio-economic purposes, including agricultural areas, commercial areas, construction areas, industrial areas, ponds, religious areas, residential areas, transport, unused areas, urban areas and rural areas. LU types not only indicate whether the areas are favourable to vector breeding, but also provide information of human behaviour and activities in the areas, the levels of human-Aedes encounters, dispersal of mosquitoes and people movement, which are significantly related to dengue epidemics.

3.
Topographic factors may provide a proxy of habitat suitability or climate conditions, including elevation, aspect, slope, drainage network, and flow accumulation.

4.
Spatially continuous land surface features include spectral indices of vegetation, water and built-up areas (e.g., normalized difference vegetation index (NDVI), enhanced vegetation index (EVI), vegetation fraction index (VFC), normalized difference water index (NDWI), and normalized difference built-up index (NDBI)). Moreover, land surface temperature (LST) refers to a measure of radiative skin temperature of the land surface, which is a significant factor affecting the dengue transmission.

Satellite Earth Observation Data
Among the 101 included articles, only 64 studies used satellite EO data. Table 4 presents the satellite EO sensors, derived products and spatio-temporal resolutions used for identifying dengue landscape factors in selected studies. Evidently, for LU/LC mapping, most studies used very fine (i.e., pixel size < 10 m) and fine (i.e., 10 m ≤ pixel size < 100 m) spatial resolution data, including multi-spectral bands derived from Radiometer (ASTER) mission were used to extract topographic features. For continuous land surface features, moderate resolution imaging spectroradiometer (MODIS) products with coarse resolution (i.e., 1000 m ≤ pixel size < 10,000 m) and moderate resolution (i.e., 100 m ≤ pixel size < 1000 m) were widely used to characterize them. In addition, some EO data with fine resolution (i.e., 10 m ≤ pixel size < 100 m) have also made a contribution, such as data from Landsat 5, 7 and 8, SPOT 5 and GeoFen-1.
Although satellite EO sensors and products are pointed out, we do not explain what should be considered while choosing satellite EO data, and making effective use of them. This is an important issue, especially for non-specialized users. Hamm et al. [26] proposed that spatio-temporal scales, uncertainty, spatial quality of EO data and the interaction between uncertainty in EO and disease data should be considered when using EO data for the study of neglected tropical diseases (NTD) (e.g., echinococcosis, schistosomiasis and leptospirosis). This is useful for evaluating EO data in dengue research.

Satellite Earth Observation Data
Among the 101 included articles, only 64 studies used satellite EO data. Table 4 presents the satellite EO sensors, derived products and spatio-temporal resolutions used for identifying dengue landscape factors in selected studies. Evidently, for LU/LC mapping, most studies used very fine (i.e., pixel size < 10 m) and fine (i.e., 10 m ≤ pixel size < 100 m) spatial resolution data, including multispectral bands derived from with coarse resolution (i.e., 1000 m ≤ pixel size < 10,000 m) and moderate resolution (i.e., 100 m ≤ pixel size < 1000 m) were widely used to characterize them. In addition, some EO data with fine resolution (i.e., 10 m ≤ pixel size < 100 m) have also made a contribution, such as data from Landsat 5, 7 and 8, SPOT 5 and GeoFen-1.

In Terms of Landscape Patterns
More in-depth landscape features (e.g., compositional and configurational patterns) could be explored in future studies. Our previous studies characterized forest/non-forest landscapes by computing various landscape metrics and established their links with malaria cases for understanding the contribution of Amazon deforestation on human-vector contact [28,29]. We found very few examples that used landscape metrics in dengue epidemiology, although these metrics have been widely applied in the assessment of LULC changes.

In Terms of Satellite Sensors
LU/LC mapping has continued to be an important research area in recent years, in particular urban LU/LC mapping. Gong et al. [30] proposed the two-level essential urban land use categories (EULUC) and archived the preliminary results of 30 m in China for 2018 using Sentinel-2 images, Luojia night time light data, mobile phone locating request data and point of interests (POI) data. According to our findings (Figure 3), EULUC classes were mostly related to dengue transmission (e.g., residential, commercial, industrial and transportation). Global essential urban land use maps with fine spatial resolution could be useful for landscape-related studies of dengue. Moreover, developing LU/LC maps and integrating them for dengue research in tropical and subtropical regions is difficult due to the presence of clouds and cloud shadows. Synthetic aperture radar (SAR) images could penetrate such barriers and have recently been used for vector-borne disease application [31,32]. However, we found no specific study that used SAR data in dengue research. Third, deep learning frameworks have been increasingly used to predict dengue outbreaks. Many studies have used weather data (e.g., temperature, wind speed, precipitation, humidity), population data and previous dengue cases in deep learning models [33,34].

In Terms of Deep Learning
More recently, one study extracted landscape features (e.g., building, roads, trees, crops, waterway and standing water) from high resolution satellite EO data using CNN models and transfer learning, and added them into time series prediction of dengue outbreaks based on weather data and population density for improving the performance of prediction [35]. This would be a new direction that is practical for identifying the landscape factors with limited labelled data, understanding the landscape-dengue relationships or improving the deep learning-based temporal prediction of dengue risk.

Conclusions
Satellite EO has been increasingly used in dengue research over the past years, especially for the identification of dengue landscape factors. During that time, various types of landscape factors were considered while the study areas and research objectives have become more complex, and the variety and volume of satellite EO data have been growing over these years. There is an increasing need to know what dengue landscape factors have been studied and what dengue landscape factors have been derived from satellite EO data during the past years. In this study, by integrating the review process, AL mechanism, text scoring and BiLSTM model, we propose a semi-supervised text classification framework that enables the efficient evaluation of bibliographic records derived from bibliographic databases and accurately selects the articles relevant to the research objective. In this study, 101 relevant articles were efficiently selected from bibliographic databases using the proposed approach. Among them, 64 articles used satellite EO data. Valuable information on dengue landscape factors and current satellite EO data was reported. A catalogue of essential dengue landscape factors were identified that were divided into four categories: LU, LC, topography and continuous land surface features. These factors were considered as the direct or indirect proxies of Aedes breeding and resting sites, human-Aedes encounters, human mobility and virus replication in dengue transmission. Moreover, future research directions on how to integrate satellite EO data in dengue research were proposed in terms of landscape patterns, satellite sensors and deep learning. This study is an important step towards an efficient method for research evidence synthesis that could be easily applied to other topics, particularly in an interdisciplinary context.

Conflicts of Interest:
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix B. Word Embedding
Word2Vec [24] is based on deep learning, which could learn grammar and semantic information from a large amount of unlabelled data. Word2Vec Continuous Bag-Of-Words Model (CBOW) model maps each word to a V-dimensional word vector by training, and can calculate the similarity between word vectors to represent the semantic similarity of the text. Word2Vec CBOW architecture predicts the current word based on the context. The input layer here is composed of one-hot encoded input contexts X1,...,Xc, where the window size is C, the glossary size is V and the hidden layer is an N-dimensional vector. The final output layer is the output word y that is also encoded by one-hot. The input vector encoded by one-hot is connected to the hidden layer by a V × N-dimensional weight matrix W and the hidden layer is connected to the output layer by an N × V weight matrix W .

Appendix C. Bidirectional Long Short-Term Memory Model
Generally, LSTM-based RNNs consist of three gates: one input gate it with corresponding weight matrix W xi , W hi , W ci , bi; one forget gate f t with corresponding weight matrix W xf , W hf , W cf, bf ; one output gate ot with corresponding weight matrix W xo , W ho , W co , bo. The operation can be summarized as the process of forgetting old information and memorizing new information in the state of the cell, so that information useful for subsequent process operations is passed, and useless information is discarded. The hidden layer state hi is output at each time step. In the process, all gates are set to generate some parameters, using current input xi, the state hi-1 that the previous step generated, and current state of this cell ci-1 (peephole), for the decisions whether to take the inputs, forget the memory stored before, and output the state generated later. The computation can be explained by the following equations: The BiLSTM uses two independent LSTMs to process the data in both directions and then connects the two final output vectors from both directions.      Analyzing the spatio-temporal relationship between dengue vector larval density and land-use using factor analysis and spatial ring mapping ---