Visually-Enabled Active Deep Learning for (Geo) Text and Image Classiﬁcation: A Review

: This paper investigates recent research on active learning for (geo) text and image classiﬁcation, with an emphasis on methods that combine visual analytics and/or deep learning. Deep learning has attracted substantial attention across many domains of science and practice, because it can ﬁnd intricate patterns in big data; but successful application of the methods requires a big set of labeled data. Active learning, which has the potential to address the data labeling challenge, has already had success in geospatial applications such as trajectory classiﬁcation from movement data and (geo) text and image classiﬁcation. This review is intended to be particularly relevant for extension of these methods to GISience, to support work in domains such as geographic information retrieval from text and image repositories, interpretation of spatial language, and related geo-semantics challenges. Speciﬁcally, to provide a structure for leveraging recent advances, we group the relevant work into ﬁve categories: active learning, visual analytics, active learning with visual analytics, active deep learning, plus GIScience and Remote Sensing (RS) using active learning and active deep learning. Each category is exempliﬁed by recent inﬂuential work. Based on this framing and our systematic review of key research, we then discuss some of the main challenges of integrating active learning with visual analytics and deep learning, and point out research opportunities from technical and application perspectives—for application-based opportunities, with emphasis on those that address big data with geospatial components.


Introduction
Big data are leading to dramatic changes in science (with the advent of data-driven science) and in society (with potential to support economic, public health, and other advances).Machine leaning and deep learning technologies are central to leveraging big data for applications in both domains.Recent advances in machine learning and especially in deep learning, coupled with release of many open source tools (e.g., Google TensorFlow [1]-an open-source software library for machine intelligence), creates the potential to leverage big data to address GIScience and Remote Sensing (RS) research and application challenges.But, doing so requires an in-depth understanding of the methods, their limitations, and strategies for overcoming those limitations.Two primary goals for this paper are: (1) to synthesize ideas and results from machine learning and deep learning, plus visual analytics, and (2) to provide a base from which new GIScience and RS advances can be initiated.
Machine learning (ML) and deep learning (DL), where DL is a sub-domain of ML, are increasingly successful in extracting information from big data (when mentioned together subsequently, we use the abbreviation of M&DL).The primary focus of research in M&DL has thus far been accurate results, often at the expense of human understanding of how the results were achieved [2][3][4][5][6].However, accurate results often depend on building large human-generated training data sets that can be expensive in both financial and person cost to create [7][8][9][10][11][12][13]. As a result, there remain several impediments to broader adoption of M&DL, along with a range of concerns about potential negative outcomes related to the explainability of results produced.We agree here with a range of authors who have pointed to the need for human-in-the-loop strategies to both improve performance of the methods for complex problems and to increase explainability of the methods and their results [2,4,5,11,[13][14][15].There is a clear need for methods that allow human decision-makers to assess when to accept those results and when to treat them with caution or even skepticism.
Further, we contend that advances in visual analytics offer a broad framework for addressing both the performance and explainability needs cited above.Visual analytics provides systems that enable analytical reasoning about complex problems [16].They accomplish this through close coupling of computational data processing methods with visual interfaces designed to help users make efficient choices: in building training data, in parameterizing and steering computational methods, and in understanding the results of those methods and how they were derived (further details about why and how visual analytics can aid M&DL, are elaborated in Section 3.2).
One rapidly developing ML method, active learning (Section 3.1), aims at achieving good learning results with a limited labeled data set, by choosing the most beneficial unlabeled data to be labeled by annotators (human or machine), in order to train and thus improve ML model performance [17,18].Active deep learning (Section 3.4) is a method introduced to help cope with the tension between the typical DL requirement to have a very large gold standard training set and the impracticality of building such a big training set initially in domains that require expertise to label training data.As we elaborate below, recent developments in visual analytics offer strategies to enable productive human-in-the-loop active learning.
In this paper, we argue specifically for taking a visual analytics approach to empowering active deep learning for (geo) text and image classification; we review a range of recent developments in the relevant fields that can be leveraged to support this approach.Our contention is that visual anaytics interfaces can reduce the time that domain experts need to devote to labeling data for text (or image) classification, by applying an iterative, active learning process.We also contextualize the potential of integrating active learning, visual analytics, and active deep learning methods in GIScience and RS through discussion of recent work.
Here, we provide a road map to the rest of the paper.Section 2 outlines the scope of this review and our intended audience.Section 3, is the core of the paper, focused on synthesizing important and recent developments and their implications and applications.Here, we focus on recent advances in several subfields of Computer Science that GIScience and RS can leverage.Specifically, we examine and appraise key components of influential work in active learning (Section 3.1), visual analytics (Section 3.2), active learning with visual analytics (Section 3.3), and active deep learning (Section 3.4), respectively.In Section 4, we review recent GIScience and RS applications in (geo) text and image classification that take advantage of the methods from one or a combination of different fields covered in Section 3. The paper concludes in Section 5 with discussion of key challenges and opportunities-from both technical (Section 5.2.1) and application (Section 5.2.2, particularly for GIScience and RS) perspectives.The paper covers a wide array of recent research from multiple domains with many cross-connections.Given that text must present the sub-domains linearly, we start with a diagrammatic depiction of the domains and their relations to preview the overall structure of the review and the key connections.Specifically, Figure 1 illustrates the links between different fields covered in this paper and the flows that can guide the reader through the core part of this review.To provide background for readers (particularly those from GIScience and RS) who are new to M&DL, in Appendix A, we introduce essential terms and the main types of classification tasks in M&DL.

Scope and Intended Audience
The potential to bring the advances in M&DL to GIScience is reflected in a fairly long history of work on spatial and spatio-temporal data mining.In 2001, for example, Han and Miller [19] provided a broad introduction to data mining and knowledge discovery methods for geographic data.In a second edition in 2009 [20], with reversed authorship, multivariate spatial clustering was discussed and visual exploration and explanation in geospatial analysis was touched upon.Directed to a broader audience, Han et al. [21] provided one of the most highly cited introductions to data mining; the third edition includes an introduction to active learning (Section 3.1) and briefly introduces neural networks (the core technology of DL), but visual analytics (Section 3.2) is not mentioned.Even though they include an introduction to data visualization and visual data mining, Han and colleagues' focus is on traditional data visualization methods for understanding data prior to making decisions on data mining methods and for understanding outcomes of data mining, not on the more integrated visual-computational approaches that characterize advances in visual analytics.Thus, their visual data mining approach, while it does propose leveraging visualization advances in productive ways, is comparable to ideas introduced in the late 1990s (e.g., [22,23]); it does not focus on visual interfaces to enable human input to the data mining process or on support of human reasoning about that process.
In work that complements that cited above, Guo and Mennis [24] also investigated spatial data mining and geographic knowledge discovery, where they briefly reviewed several common spatial data mining tasks, including spatial classification and prediction, spatial cluster analysis, and geovisualization.The authors argued that data mining is data-driven, but more importantly, human-centered, with users controlling the selection and integration of data, choosing analysis methods, and interpreting results-it is an iterative and inductive learning process.Guo and Mennis pointed out that handling big and complex spatial data and understanding (hidden) complex structure are two major challenges for spatial data mining.To address these challenges, both efficient computational algorithms to process large data sets and effective visualization techniques to present and explore complex patterns from big spatial data, are required.In earlier work outside the GISience context, Fayyad et al. [25] emphasized the potential role of information visualization in data mining and knowledge discovery.They proposed that the next breakthroughs will come from integrated solutions that allow (domain) end users to explore their data using a visual interface-with the goal being to unify data mining algorithms and visual interfaces , and thereby to enable human analysts to explore and discover patterns hidden in big data sets.
The main goals of this review paper, building on the long term GIScience interest in ML, are to: (1) survey recent work on active learning, DL, and active DL to provide suggestions for new directions built upon these evolving methods, and (2) bring active learning, DL, active DL, and complementary developments in visual analytics to GIScience, and by doing so extend the current GIScience "toolbox".
Through the synthesis of multiple rapidly developing research areas, this systematic review is relevant to multiple research domains, including but not limited to GIScience, computer science, data science, information science, visual analytics, information visualization, image analysis, and computational linguistics.This paper does not attempt to review pure/traditional active learning (see Figure 2, which illustrates a typical pool-based active learning cycle); for classic and recent reviews of these topics, see: [26,27].A survey aimed at making active learning more practical for real-world use can be found in [28]; a survey from the perspective of natural language processing (NLP) can be found in [29]; and a survey of active learning in multimedia annotation and retrieval can be found in [30].Our review focuses on investigating methods that extend and/or integrate active learning with visual analytics and DL for (geo) text and image classification, specifically on the two parts of the active learning cycle highlighted in Figure 3.

The State of the Art: Active Learning, Visual Analytics, and Deep Learning
As outlined above, leveraging the potential of DL to increase classification accuracy (for images or text) requires extensive amounts of manually labeled data.This is particularly challenging in domains requiring experts with prior knowledge that is often tacit [18,[31][32][33][34][35]-in such cases, even crowdsourcing [36], such as Amazon Mechanical Turk [37,38], will not help much.In this section, we review several techniques that are central to addressing this challenge-in particular, active learning (Section 3.1), visual analytics (Section 3.2), active learning with visual analytics (Section 3.3), and active deep learning (Section 3.4).

Active Learning (AL)
In this section, we introduce the core ideas and concepts of active learning (AL) to help the understanding of later topics in this paper.We start by defining AL and why we need it (Section 3.1.1).Then, some core AL concepts, components, and methods are elaborated, with grounding in relevant literature (Sections 3.1.2-3.1.5).Finally, we discuss some recent and important developments in AL (Section 3.1.6)and provide a brief summary and discussion (Section 3.1.7).

What's AL and Why AL?
Can machines learn with fewer labeled training instances than those needed in supervised learning (a full explanation of which is provided in Appendix A.2.1) if they are allowed to ask questions?The answer is "yes", with many encouraging results that have been demonstrated for a variety of problem settings and domains.AL [26][27][28]39] is a sub-field of semi-supervised learning (for details, see Appendix A.2.3) that implements this question-asking idea as an iterative process.AL differs from traditional "passive" learning systems that purely "learn from examples".AL systems aim to make ML more economical and more accurate, because the learning algorithms can participate in the acquisition of their own training data, and are able to avoid using unrepresentative or poorly annotated data based on query strategies (Section 3.1.5).
AL is well-motivated in many ML based applications, where unlabeled data is massive, but labels are difficult, time-consuming, or expensive to obtain.The key idea behind AL is that a ML model can achieve high accuracy with a minimum of manual labeling effort if the (machine) learner is allowed to ask for more informative labeled examples by selection query.A query is often in the form of an unlabeled instance (e.g., an image or a piece of text), picked by the machine learner according to a specific query strategy (Section 3.1.5),to be labeled by an annotator who understands the nature of the domain problem [27].Informative examples refer to those instances that can help improve the machine learner's learning performance, and the informativeness is measured by different query strategies (Section 3.1.5).
AL has been successfully applied to a number of natural language processing tasks [29], such as information extraction, named entity recognition, text categorization, part-of-speech tagging, parsing, and word sense disambiguation.Tuia et al. [7] surveyed AL algorithms for RS image classification.Nalisink et al. employed AL to reduce the labeling effort for image classification [8].A good example using AL to overcome label quality problems by combining experts and crowd-sourced annotators can be found in [40].Another good example of using AL from crowds can be found in [41], where a multi-annotator (see Section 3.1.6)AL algorithm was provided.Most AL based methods are for binary classification tasks (see Appendix A.4.1), see [42] for an example of multi-class classification (see Appendix A.4.2) AL for image classification.While there has been increasing attention to AL, with applications in many domains, systematic and comprehensive comparison of different AL strategies is missing in the literature.We will come back to this later in Section 3.1.7.

AL Problem Scenarios
The AL literature [26,27,30] showcases several different problem scenarios in which the active machine learner may solicit input.The three most common scenarios considered in the literature are: membership query synthesis, stream-based selective sampling, and pool-based sampling.All three scenarios assume that machine learners query unlabeled instances to be labeled by annotators (humans or machines).Figure 4 illustrates the differences among these three AL scenarios.The dashed lines connecting instance space (set of possible observations-also called input space [27,43,44]) in Figure 4, represent that the machine learner does not know the definition of the instance space (thus the features of the space and their ranges are not known [27,43]).

Unlabeled data pool (U)
Query is labeled by (human or machine) annotator

Pool-based sampling
Model (i.e., classifier) selects the best instance from the pool U to query Model (i.e., classifier) says "yes" or "no" to use this instance as a query

Stream-based selective sampling
Model (e.g., classifier) generates a new instance to query   [26]).Each starts from the set of all possible observations (the instance space at left) and applies a query strategy (light blue box) for selecting which instance to ask the human or machine annotator to label (dark blue box).

Sample a large pool of instances
Membership query synthesis was proposed in [45], and further developed and extended in [46][47][48][49][50].In this scenario, the machine learner knows the definition of the instance space (e.g., feature dimensions and ranges are known).The learner can generate (i.e., synthesize) a new instance (e.g., an image or a piece of text) from scratch (thus one that meets the parameters of the instance space, but may or may not actually exist [28]) that satisfies the instance space definition, and then enlist an annotator for labeling [49,50].Query synthesis can synthesize a new artificial (membership) query from scratch using a small amount of labelled data-it is therefore very efficient [49].Query synthesis is often tractable and efficient for finite problem domains [46].Thus, query synthesis has recently gained interest in some domains in which labels do not come from human annotators, but from experiments, where only heuristics are known.In such domains, artificial queries can be synthesized to elicit information (e.g., automated science [47,48]) or to detect and extract knowledge and design information with minimal cost (e.g., adversarial reverse engineering) [50].
Query synthesis is reasonable for some domain problems, but one major problem is that the synthesized (membership) queries are often not meaningful, and thus annotators, particularly human ones, can find it hard to assign labels [51,52].By contrast, the stream-based and pool-based scenarios introduced below can address these limitations, because the queries always correspond to real examples.Therefore, the labels can be more readily provided by annotators [52].
In stream-based selective sampling (also called stream-based or sequential AL), given an unlabeled instance, which is drawn one at a time from the data source, the machine learner must decide whether to query its label or to discard it [26,39,[52][53][54].In a stream-based selective sampling scenario, learners can use the following two ways to query: (1) use a query strategy (Section 3.1.5.), (2) compute a region of uncertainty and pick instances falling in that region.The stream-based scenario has been studied in several real-world tasks (e.g., learning ranking functions for information retrieval [55], social media text classifications [56], and word sense disambiguation [57], where a word such as "bank" in "river bank" can be distinguished from the word "bank" in "financial bank").One advantage of the stream-based selective sampling AL method is that it is suitable for mobile and embedded devices where memory and power is often limited, because in this scenario, each unlabeled instance is drawn one at a time from the data source.
In pool-based sampling AL [58,59], samples are selected from an existing pool for labeling using criteria designed to assess the informativeness of an instance.Informativeness has been defined as representing the ability of an instance to reduce the generalization error of a ML model [60,61]; query strategies designed to achieve informativeness of samples are discussed in Section 3.1.5.
A substantial proportion of AL methods in the literature are pool-based [26,56], with examples in domains that include: text classification (see examples later in this paper for text and image classification), image classification and retrieval [30], information retrieval [62], video classification and retrieval [30], speech recognition [63], and cancer diagnosis [64].Only a few AL methods employed stream-based selective sampling [56].For many real-world learning problems, large collections of unlabeled data can be gathered at once.This motivates pool-based sampling, because pool-based sampling evaluates and ranks the entire collection before selecting the best query [26].This helps build a classifier with better performance and less labeled examples.
As outlined above, the three sampling scenarios have different primary applications.Membership query synthesis is most applicable to limited applications such as automated scientific discovery and adversarial reverse engineering [50], due to fact that instances produced by synthesized queries might be not recognizable to human annotators [51].Stream-based methods are typically used for streaming data (as the name implies) because they scan the data sequentially and make individual decisions for each instance.Because they do not consider the data as a whole, stream-based selective sampling methods are typically less effective than pool-based for any situation in which data can be assembled ahead of time.Due to the limited focus of membership query synthesis and stream-based selective sampling, and the broad focus of pool-based sampling, a substantial proportion of AL methods discussed in the literature are pool-based [26,56].Not surprisingly, this is also true for application of AL to (geo) text and image classification.Given this overall emphasis in the literature, and within the subset directed to geospatial applications, the focus in the remainder of the paper is on pool-based sampling, with the alternatives mentioned only to highlight particular recent innovations.

AL Core Components
A typical AL system contains two components: a learning engine and a sample selection engine.A learning engine refers to a ML model for most AL methods in the literature or a committee of ML models when the AL query strategy used is QBC (see Section 3.1.5).A sample selection engine is the query strategy used to measure how informative an unlabeled instance is (elaborated in Section 3.1.5).
A typical AL system works in an iterative way, as illustrated in Figure 2. In each iteration, the learning engine trains a model based on the current training set.The sample selection engine then selects the most informative unlabeled samples for labeling, and these (newly labeled) samples are added to the training set.As a contrast, the random sampling strategy [65] selects instances randomly from the unlabeled pool, without considering whether they provide the most beneficial information to improve the classifier's learning performance.This strategy is equivalent to "passive" learning.

Batch-Mode AL
In most AL research, queries are selected in serial (i.e., labeling one instance at a time).This is not practical when training a model is slow or expensive.By contrast, batch-mode (also batch mode) AL [28] allows the machine learner to query a batch (i.e., group) of unlabeled instances simultaneously to be labeled, which is better suited to parallel labeling environments or models with slow training procedures to accelerate the learning speed.In batch-mode AL, the number of instances in each query group is called batch size.For some recent overview papers for batch-mode AL, see [56,66,67].

AL Query Strategies
Query strategies are central in AL methods; they are used to identify those training examples that can contribute most to the learning performance of ML models.Various AL query strategies have been proposed, defined, and discussed in several surveys to improve over random sample selection [7,26,[28][29][30].Here we highlight the most commonly used query strategies in AL: (1) uncertainty sampling, (2) diversity, (3) density, and (4) relevance.
Uncertainty sampling [58] picks the instances that the (machine) learner model is most uncertain about.Due to its simplicity, intuitiveness, and empirical success in many domains, uncertainty sampling is the most commonly used strategy.Though uncertainty sampling has many limitations, such as sensitivity to noise and outliers, it still works surprisingly well [68].The heuristic of selecting the most uncertain instances stems from the fact that in many learning algorithms the essential classification boundary can be preserved based solely on the nearby samples, and the samples that are far from the boundary can be viewed as redundant.For a binary classification, the samples that are closest to a classification boundary will be selected.When multiple learners exist, a widely applied strategy is selecting the samples that have the maximum disagreement among the learners [69,70].The disagreement of multiple learners can also be viewed as an uncertainty measure.This query strategy is called query-by-committee (QBC) [70].A committee of ML models are trained on the same data set.Each committee member then votes on the labelings of query candidates.The most informative query is the instance on which they most disagree.Two main disagreement measures have been proposed in the literature: (1) vote entropy [54] and (2) average Kullback-Leibler (KL) divergence [71].Vote entropy compares only the committee members' top ranked class [71], whereas KL divergence metric [72] measures the difference between two probability distributions.KL divergence to the mean [73] is an average of the KL divergence between each distribution and the mean of all the distributions.Thus, this disagreement measure picks the instance with the largest average difference between the label distributions of any committee member and the consensus as the most informative query [26].
Other commonly used uncertainty sampling variants include: least confident, margin sampling, and entropy.Least confident is an uncertainty sampling variant for multi-class classification (Appendix A.4.2), where the machine learner queries the instance whose prediction is the least confident (as the name implies).The least confident strategy only considers information about the most probable label, and thus, it "throws away" information about the remaining label distribution.Margin sampling [74] can overcome the drawback (mentioned in the preceding sentence) of the least confident strategy, by considering the posterior of the second most likely label [26].Entropy is an uncertainty sampling variant that uses entropy [75] as an uncertainty measure.Entropy-based uncertainty sampling has achieved strong empirical performance across many tasks [26].A detailed discussion about when each variant of uncertainty sampling should be used is provided in [26].
The second query strategy, based on a diversity criterion [76], was first investigated in batch-mode AL (Section 3.1.4),where Brinker [76] used diversity in AL with SVMs.Diversity concerns the capability of the learning model to avoid selecting query candidates that rank well according to the heuristic (i.e., query strategy), but are redundant among each other.More specifically, a diversity based query strategy is used to select those unlabeled samples that are far from the selected set and thus can reduce redundancy within the selected samples.Diversity has been studied extensively for margin-based heuristics, where the base margin sampling heuristic is constrained using a measure of diversity between the candidates.An algorithm for a general diversity-based heuristic can be found in [7].In many applications, we need to select a batch of samples instead of just one in an AL iteration.For example, updating (i.e., retraining) a model may need extensive computation, and thus labeling just one sample each time will make the AL process quite slow.Joshi et al. [42] proposed that the selected samples in a batch should be diverse.Dagli et al. [77] and Wu et al. [78] emphasized that the diversity criterion should not only be investigated in batch-mode but also be considered on all labeled samples, to avoid having selected samples being constrained in an (increasingly) restricted area.
The third strategy used by a machine learner is to select samples using a density [59] criterion that selects samples within regions of high density.The main argument for a density-based criterion [26] is that informative instances should not only be those that are uncertain, but also those that are "representative" of the underlying distribution (i.e., inhabit dense regions of the instance space).In density-based selection, the query candidates are selected from dense areas of the feature space because those instances are considered as most representative [26,60,61,78].The representativeness of an instance can be evaluated by how many instances among the unlabeled data are similar to it.Density-based selection of candidates can be used to initialize an AL model when no labels are available at all.Wu et al. [78] proposed a representativeness measure for each sample according to the distance to its nearby samples.Another strategy uses clustering-based methods [79,80], which first group the samples and then selects samples at and around the cluster centers.Qi et al. [80] combine AL with clustering, and their method can refine the clusters with merging and splitting operations after each iteration, which is beneficial for selecting the most informative samples in the AL process, and also helps further improve the final annotation accuracy in the post-processing step.
The fourth strategy, relevance criterion, is usually applied in multi-label classification tasks (Appendix A.4.3).Based on a relevance criterion, those samples that have the highest probability to be relevant for a certain class are selected [30].This strategy fosters the identification of positive examples for a class.Ayache and Quénot [81] have conducted an empirical study on different sample selection strategies for AL for indexing concepts in videos.Their experimental results clearly show that the relevance criterion can achieve better performance than an uncertainty criterion for some concepts.
It is difficult to directly compare these criteria.Seifert and Granitzer's experiments [82] showed that the benefits of these strategies depend on specific tasks, data sets, and classifiers (Appendix A.3). Wang et al. [30] provided several general suggestions: (1) for binary classification problems, applying a relevance criterion may achieve the best results for some extremely unbalanced cases where positive samples are much less frequent than negative ones, (2) in batch-mode AL (Section 3.1.4),integrating a diversity criterion will be helpful for computational efficiency, (3) in many cases, these criteria are combined explicitly or implicitly, (4) the diversity and density criteria are normally not used individually (because they are not directly associated with classification results) and most commonly they are used to enhance the uncertainty criterion.The uncertainty criterion relates to the confidence of a ML algorithm in correctly classifying the considered sample, while the diversity criterion aims at selecting a set of unlabeled samples that are as diverse (distant from one another) as possible, thus reducing the redundancy among the selected samples.The combination of the two criteria results in the selection of the potentially most informative set (Section 3.1.1)of samples at each iteration of the AL process.Patra et al. [83] combine the uncertainty and diversity criteria, where they proposed a batch-mode AL (Section 3.1.4)method for multi-class classification (Appendix A.4.2) with SVM classifiers.In the uncertainty step, m samples are selected from all over the uncertain regions of the classifiers.In the diversity step, a batch of h (m > h > 1) samples that are diverse from each other are chosen among the m samples that are selected in the uncertainty step.Xu et al. [84] also employed SVM-based batch-mode AL, whereas their method incorporated diversity and density measures.To improve classifier performance for interactive video annotation, Wang et al. [85] have combined uncertainty, diversity, density and relevance for sample selection in AL and named the comprehensive strategy as effectiveness.

Recent and Novel AL Methods
Yan et al. [41], Sharma et al. [9], and Sharma and Bilgic [68] introduced some very recent and novel AL based methods.Typical AL algorithms rely on a single annotator (i.e., oracle) who serves in the role of a "teacher".By contrast, the following multiple annotator AL scenario poses new challenges: an oracle, who knows the ground truth, does not exist, and multiple annotators, with varying expertise, are available for querying.Such scenarios are not uncommon in the real world, for example, decision making for emergency management.To bridge the gap, Yan et al. [41] focused on an AL scenario from multiple crowdsourcing annotators.The machine learner asks which data sample should be labeled next and which annotator should be queried to improve the performance of the classifier the most.Specifically, Yan et al. employed a probabilistic model to learn from multiple annotators-the model can also learn the annotator's expertise even when their expertise may not be consistently accurate across the task domain.The authors provided an optimization formulation that allows the machine learner to select the most uncertain sample and the most appropriate annotator to query the labels.Their experiments on multiple annotator text data and on three UCI benchmark data sets [86] showed that their AL approach combined with information from multiple annotators improves the learning performance.
One of the bottlenecks in eliciting domain knowledge from annotators is that the traditional supervised learning approaches (Appendix A.2.1) cannot handle the elicited rich feedback from domain experts.To address the gap, many methods have been developed, but they are often classifier-specific [9]; these methods do not transfer directly from one domain to another.To further address this problem, Sharma et al. [9] proposed an AL approach that can incorporate rationales elicited from annotators into the training of any existing classifier for text classification (Appendix A.5). Their experimental results using four text categorization datasets showed that their approach is effective for incorporating rationales into the learning of multinomial Naıve Bayes, logistic regression, and SVMs classifiers.
Traditional uncertainty sampling does not consider the reasons why a (machine) learner is uncertain on the selected instances.Sharma and Bilgic [68] addressed this gap by using an evidence-based framework to do so.Specifically, the authors focused on two types of uncertainty: conflicting-evidence uncertainty and insufficient-evidence uncertainty.In the former type of uncertainty, the model is uncertain due to presence of strong but conflicting evidence for each class; in the latter type, the model is uncertain due to insufficient evidence for either class.Their empirical evaluations on several real-world datasets using naıve Bayes for binary classification tasks showed that distinguishing between these two types of uncertainties has a drastic impact on the learning efficiency: conflicting-evidence uncertainty provides the most benefit for learning, substantially outperforming both traditional uncertainty sampling and insufficient-evidence uncertainty sampling.The authors, in their explanation of these results, showed that the instances that are uncertain due to conflicting evidence have lower density in the labeled set, compared to instances that are uncertain due to insufficient evidence; that is, there is less support in the training data for the perceived conflict than for the insufficiency of the evidence.

AL Summary and Discussion
Even though AL has been successfully applied to many problems in different domains, no systematic and comprehensive comparison of AL strategies have been examined.This might be caused by the fact that most of the work has been disconnected, using different data sets in different problem domains with insufficient consistency to easily compare AL strategies etc. Ramirez-Loaiza and colleagues [87] have made the first attempt to address this recently, but they only evaluated two classifiers and two AL strategies.
Ramirez-Loaiza et al., based on a meta-analysis of 54 published papers, found that most empirical evaluations of AL approaches in the literature have focused on a single classifier (83%) and a single performance measure (91%) [87].To provide important practical advice for AL methods, these authors also conducted an extensive empirical evaluation of common AL baselines, using two probabilistic classifiers (naive Bayes and logistic regression) and two of the most common AL strategies (uncertainty sampling and QBC).Their evaluation used several performance measures on a number of large datasets.They experimented with both synthetic and real-world datasets, specifically, 10 large real-world binary classification data sets.The smallest dataset had 19 K instances and the largest 490 K.The domains and class distributions of these data sets are diverse-from housing, through ecology, to handwriting and letter recognition.Ramirez-Loaiza et al. [87] concluded that AL algorithms can reduce the time, effort, and resources needed to train an accurate predictive model by carefully choosing which instances should be labeled.Ramirez-Loaiza and colleagues' findings also highlighted the importance of overlooked choices in AL experiments in the literature.For example, they showed that model selection is as important as development of an AL algorithm.

Visual Analytics (VA) and Human-in-the-Loop
VA focuses on the integration of computational methods (e.g., analytical reasoning algorithms) and interactive visual interfaces to extend the perceptual and cognitive abilities of humans [88], and thus to support human reasoning (via exploratory knowledge discovery) about complex phenomenon with big and often heterogeneous data.VA emphasizes the key role of visual representations as the most effective means to convey information to the human and prompt human cognition and reasoning.
VA can support at least three of the core challenges in the context of M&DL: (1) building labeled data efficiently, thus in ways that minimizes the time of human annotators, (2) tuning the methods to produce the most accurate classification results with the least amount of training data and processing time, and (3) helping end users understand both the process through which classifiers are constructed and applied and the end result of their applications (thus supporting "explainable" M&DL).
There is now more than a decade of research in VA, an annual conference (one of the three making up IEEE Vis), and increasing research on basic and applied VA across many domains.Thus, a comprehensive review of even the subset of VA focused on classification tasks is beyond the scope of this paper; for some recent overview papers see [4,[89][90][91][92][93][94].A VA agenda is provided in [95,96], and then for geovisual analytics and related topics in [97].Here, we focus specifically on the role of VA interfaces helping analysts understand M&DL, and then in Section 3.3 we review the recent efforts that are specifically focused on the integration of VA with AL methods.
After surveying a range of projects that support VA contextually in the sensemaking loop, Endert et al. [98] argued for a shift from a 'human-in-the-loop' philosophy to a 'human is the loop' viewpoint.A similar argument about the central role of analysts can be found in [99], where the authors emphasized that a human-centered understanding of ML can lead not only to more usable ML tools, but to new ways of framing learning computationally.Biewald [14] explained why human-in-the-loop computing is the future of ML, and the related need for explainable M&DL is discussed in [100].In related research, Liu et al. [5] provided a comprehensive review about using VA via interactive visualization to understand, diagnose, and refine ML models.Additional calls for a VA-enabled human-in-the-loop approach to improve the accuracy of black-box M&DL models are discussed in [101,102].
Beyond the arguments for the potential of VA to support ML, a few recent studies demonstrated empirically that VA based interactive interfaces can help users understand DL architectures and thus improve the models' classification accuracy.Wongsuphasawa et al. [103] (the Best paper of VAST 2017; IEEE VAST is the leading international conference dedicated to advances in VA) demonstrated a successful example of employing VA to visualize dataflow graphs of DL models in TensorFlow (one of the very popular M&DL libraries released open source in 2015 by Google).The approach used TensorBoard (a VA component for TensorFlow) to help TensorFlow developers understand the underlying behavior of DL models implemented in the system.
In research not so closely tied to one particular toolkit, Alsallakh et al. [104] presented VA methods to help inspect CNNs and improve the design and accuracy for image classification.Their VA interface can reveal and analyze the hierarchy of similar classes in terms of internal features in CNNs.The authors found that this hierarchy not only influences the confusion patterns between the classes, it furthermore influences the learning behavior of CNNs.Specifically, the early layers in CNNs detect features that can separate high-level groups of classes, even after a few training epochs (in M&DL, an epoch is a complete pass through all the training examples; in other words, the classifier sees all the training examples once by the end of an epoch).By contrast, the latter layers require substantially more epochs to detect specialized features that can separate individual classes.Their methods can also identify various quality issues (e.g., overlapping class semantics, labeling issues, and imbalanced distributions) in the training data.In complementary work, Ming et al. [6] developed a VA interface, RNNVis, for understanding and diagnosing RNNs for NLP tasks.Specifically, they designed and implemented an interactive co-clustering visualization of hidden state unit memories and word clouds, which allows domain users to explore, understand, and compare the internal behavior of different RNN models (i.e., regular RNN, LSTM, and GRU).In particular, the main VA interface of the RNNVis contains glyph-based sentence visualization, memory chips visualization for hidden state clusters, and word clouds visualization for word clusters, as well as a detail view, which shows the model's responses to selected words such as "when" and "where" and interpretations of selected hidden units.Their evaluation-two case studies (focused on language modeling and sentiment analysis) and expert interviews-demonstrated the effectiveness of using their system to understand and compare different RNN models.

AL With VA
AL alone has already been applied successfully to many applications (Section 3.1) where labeled data are limited.Here we review some work in AL empowered by VA.In the literature, the integration of interactive VA interfaces and AL methods is also known as interactive ML [10,[105][106][107].All of the reviewed work below strongly indicates that VA can play a powerful role in AL.
A number of case studies were investigated by Amershi et al. [105] to demonstrate how interactivity results in a tight coupling between learning systems and users.The authors report three key results: (1) although AL results in faster convergence, users often get frustrated by having to answer the machine learner's long stream of questions and not having control over the interaction, (2) users naturally want to do more than just label data, and (3) the transparency of ML models can help people provide more effective labels to build a better classifier.
Several additional strong arguments about the power to combine VA with AL to leverage the relative advantages of (experienced) human expertise and computational power can be found in the literature [10,11,15].In one of the more detailed accounts, Holzinger [10] emphasized that in the health (informatics) domain, a small number of data sets or rare events is not uncommon, and so ML based approaches suffer from insufficient training samples.They also present an argument for a human-in-the-loop approach with domain experts by integrating AL with VA, proposing that this integration can be beneficial in solving computationally hard health data problems (e.g., subspace clustering and protein folding), where human expertise can help to reduce an exponential search space through heuristic selection of samples.The ultimate goal of a human-in-the-loop methodology is to design and develop M&DL algorithms that can automatically learn from data and thus can improve with experience over time and eventually without any human-in-the-loop (other than to understand and act upon the results) [10].
Most existing AL research is focused on mechanisms and benefits of selecting meaningful instances for labeling from the machine learner's perspective [28].A drawback of this typical AL query strategy is that users cannot control which instances are selected to be labeled [11,18]-this may affect the performance of an AL model [18].Seifert and Granitzer [82] proposed user-based visually-supported AL strategies that allow the user to select and label examples posed by a machine learner.Their experiments showed that restricting human input to labeling only instances that the system picks is suboptimal.Giving users a more active role in terms of a visual selection of examples and in adapting their labeling strategies on top of tailored visualization techniques can increase labeling efficiency.In their experiments, the basis for the user's decision is a visualization of the a-posteriori probabilities of the unlabeled samples.
Bernard et al. [15] investigated the process of labeling data instances with users in the loop, from both ML (in particular, AL) and VA perspectives.Based on reviewing similarities and differences between AL and VA, they proposed a unified process called visual-interactive labeling (VIL), through which they aim to combine the strengths of VA and AL (first initiatives for the integration of AL and VIL can be found in [82,[108][109][110]).In follow on research, Bernard et al. [11] performed an experimental study to compare VIL and AL labeling strategies (used independently).In that project, they developed an evaluation toolkit that integrates 16 different established AL strategies, five classifiers, and four visualization techniques.Using their toolkit, Bernard et al. conducted an empirical study with 16 expert participants.Their investigation shows that VIL achieves similar performance to AL.One suggestion based on Bernard et al. [11]'s experiment findings was to incorporate (visual) analytical guidance in the labeling process in AL.Their investigation represents an important step towards a unified labeling process that combines the individual strengths of VA and AL strategies.We share the same vision with Bernard et al. [11,15]-while they call it VIL, we think that VA enabled AL is a more intuitive term for the integration of the power of AL and VA, because VIL "hides" the essential role of AL.
Recent developments in ML and VA signal that the two fields are getting closer [11]-for example, Sacha et al. [4] proposed a conceptual framework that models human interactions with ML components in the VA process, and makes the interplay between automated algorithms and interactive visualizations more concrete.At the core of the Sacha et al.'s [4] conceptual framework lies the idea that the underlying ML models and hyper-parameters, which cannot be optimized automatically, can be steered via iterative and accessible user interactions.Interactive visualizations serve as an aid or "lens" that not only facilitates the process of interpretation and validation, but also makes the interactions with ML models accessible to domain users [4].
AL and VA alone are not new, but interactive annotation tools empowered by M&DL classifiers for (geo) text and image data are not well developed, and the role of visualization in active learning for text and image related tasks has not been well developed, either.H 'oferlin et al. [109] extended AL by integrating human experts' domain knowledge via an interactive VA interface for ad-hoc classifiers applied to video classification.Their classifier visualization facilitates the detection and correction of inconsistencies between the classifier trained by examples and the user's mental model of the class definition.Visual feedback of the training process helps the users evaluate the performance of the classifier and, thus, build up trust in the trained classifier.The main contributions of their approach are the quality assessment and model understanding by explorative visualization and the integration of experts' background knowledge by data annotation and model manipulation (modifying a model based on users' expertise can boost the learner, especially in early training epochs, by including fresh domain knowledge).They demonstrated the power of AL combined with VA in the domain of video VA by comparing its performance with the results of random sampling and uncertainty sampling of the training sets.Huang and colleagues' [18,111] experiments and their early results showed that active learning with VA improves learning models performance compared to methods with AL alone for text classification, with an interactive and iterative labeling interface; their AL with visualization method is for a binary (i.e., positive and negative) classification problem (Appendix A.4.1).Heimerl et al. [108] incorporated AL to various degrees with VA for text document retrieval to reduce the labeling effort and to increase effectiveness.Specifically, their VA interface for visual classifier training has a main view (shows the classifier's state with projected documents) , a cluster view (shows the documents with most uncertain classification), a content view (shows the selected documents), a manual view used during evaluation, a classifier history for undo/redo navigation, a labeled document view for listing labeled documents, and most importantly the labeling controls with a preview of the estimated impact of the newly labeled documents on the classifier.
In more recent work, Kucher et al. [13] combined the power of VA and AL for stance classification of social media text.The stance categories Kucher et al. used are: agreement and disagreement, certainty, concession and contrariness, hypotheticals, need/requirment, prediction, source of knowledge, tact and rudeness, uncertainty, volition, irrelevant, neutral.Kucher et al. developed a system called ALVA, which has been used by their domain experts in linguistics and computational linguistics to improve the understanding of stance phenomena and to build a stance classifier for applications such as social media monitoring.Stance classification is a multi-label classification (Appendix A.4.3) problem.Their ALVA system supports annotation of text for multi-label classification, ML classifier training with an AL approach, and exploratory visual analysis of the annotated data.The VA component makes labeling examples for multi-label classification (Appendix A.4.3) much easier and intuitive for the analysts in linguistics and computational linguistics.

Active Deep Learning (ADL)
As discussed further in Appendix A.1, DL can discover intricate patterns hidden in big data.Advances in DL have been dramatic and rapid, and the landscape of M&DL is changing quickly as a result.For example, Jean and colleagues [112,113] in 2015 demonstrated for the first time that DL could beat Google's existing phrase-based statistical process for language translation and by November 2016, after Google switched to that approach, evidence showed that their new system was already on par with human translation [114].
We have seen above many successful use cases for AL (Section 3.1) and AL integrated with VA (Section 3.3).Now we review some recent work in AL combined with DL-active deep learning (ADL).It is also called deep active learning (e.g., see [115]), but active deep learning is a much more commonly used term in the literature.The main process of ADL is very similar to AL.The main difference is that the machine learner in regular AL is a traditional ML algorithm (e.g., SVM), whereas in ADL, the learner is a DL one, such as CNN.As emphasized in Appendix A.1, DL has better scalability for Big Data problems than traditional ML [116].This motivates ADL because it combines the power of DL and AL-better scalability than ML and less labeled data than regular DL for training a good machine learner.
AL has been investigated with some DL architectures for image classification and text classification (including sentiment analysis).Wang and Shang [17] applied AL methods in DL networks for image classification.The (DL) classifiers they used are stacked restricted Boltzmann machines (stacked RBMs) and stacked auto-encoders, with three commonly used uncertainty sampling based query strategies (i.e., least confidence, margin sampling, and entropy, see Section 3.1.5).Their experiments were run on the well-known MNIST [117] benchmark data set (one of the classic data sets for benchmarking ML algorithms).The authors conclude that their ADL method outperforms random sampling consistently by a significant margin, regardless of the selection of uncertainty-based strategy and classifier.Gal et al. [118] also developed an AL framework that integrates DL for image classification, whereas the classifier they used is Bayesian CNNs.Their result showed a significant improvement on existing AL approaches.
Another successful integration example of deep CNNs and AL for image classification can be found in [119]-the authors proposed an ADL framework called Cost-Effective Active Learning (CEAL), where the classifier can be simultaneously updated with progressively annotated informative samples.Unlike most traditional AL methods focusing on uncertain samples of low prediction confidence, their strategy selects two complementary kinds of samples to incrementally improve the classifier training and feature learning: (1) the minority informative kind contributes to training more powerful classifiers, and (2) the majority high confidence kind contributes to learning more discriminative feature representations.Although the number of samples that belongs to the first type is small (e.g., an image with a soccer ball and a dog is much more rare than images that contain only a soccer ball), the most uncertain unlabeled samples usually have great potential impact on the classifiers.Selecting and annotating them as part of the training set can contribute to a better decision boundary of the classifiers.Their framework progressively selects the minority samples among most informative samples, and automatically pseudo-labels (i.e., pick up the class which has the maximum predicted probability, and use it as if it was the true label [120]) majority high confidence samples from the unlabeled set for feature learning and model updating.The labeled minority samples benefit the decision boundary of the classifier and the majority pseudo-labeled samples provide sufficient training data for robust feature learning.Their experiment results, on two challenging public benchmark data sets (face recognition on CACD database [121] and object categorization on Caltech-256 [122]), demonstrated the effectiveness of their CEAL framework.
Most AL methods in the literature (Section 3.1) ask annotators to annotate data samples.By contract, Huijser and van Gemert [123] provide a recent example of combining AL with DL, where they took a completely different approach-it asks for annotators to annotate the decision boundary.At this point, their method focuses on a binary classification (Appendix A.4.1) and a linear classification model (i.e., SVM).Additionally, the method used a deep generative model to synthesize samples according to a small amount of labeled samples, which will not work for text related tasks (because deep generative models are designed for continuous data like images [124,125], rather than the discrete data of words and phrases that must be dealt with in NLP problems [126]).
After reviewing some ADL methods for image classification, we now introduce recent ADL work for text classification problems.Zhou et al. [127] integrated AL with DL for semi-supervised sentiment classification using RBMs.Their experiments on five sentiment classification data sets showed that their ADL methods outperform classic semi-supervised learning algorithms and DL architectures applied for sentiment classification.Zhang and Wallace [128] proposed an ADL method for text classification, where the classifier is a CNN.In contrast to traditional AL approaches (e.g., uncertainty sampling), the most novel contribution is that their method is designed to quickly induce discriminative word embeddings (Appendix A.6), and thus improve text classification.Taking sentiment classification as an example, selecting examples in this way quickly pushes the embeddings of "bad" and "good" apart.Their empirical results (with three data sets about sentiment where two were categorized as positive/negative, one as subjective/objective) show that the method outperforms baseline AL approaches.However, their method is for binary classification (Appendix A.4.1), other types of classification tasks (Appendixes A.4.2-A.4.4) are not touched upon.
Research on combining AL with RNNs for short-text classification is rare.To address the gap, Zhou [115] demonstrated using AL with RNNs as classifiers for (Chinese) short-text classification.The proposed ADL algorithm dramatically decreases the amount of labeled samples without significantly influencing the classification accuracy of the original RNNs classifier, which trained on the whole data set.In some cases, the proposed ADL algorithm even achieves better classification accuracy with less trained data than the original RNNs classifier.

GIScience and RS Applications Using AL and ADL
In this section, we review recent work in which AL and/or ADL have been applied to GIScience and RS problems.We provide some very recent applications in GIScience using AL, one of which integrated VA, in Section 4.1, and also cover some applications using AL/ADL in RS in Section 4.2 (AL applied in RS has much longer history than in non-RS components of GIScience, and ADL has recently emerged in RS areas.To our knowledge, no RS applications have integrated AL or ADL with VA).All of those applications shown that AL and ADL, combined with VA ideally, are promising methods to bring the power of M&DL, as well as VA, to GIScience and RS communities.

GIScience Applications Using AL/AL with VA
Júnior et al. [65]'s very recent (2017) work on GPS trajectory classification provides solid evidence that AL can be used together with VA to help domain experts perform semantic labeling of movement data.In this work, they pose three research questions: (1) Is there a ML method that supports building a good classifier for automatic trajectory classification but with a reduced number of required human labeled trajectories?(2) Is the AL method effective for trajectory data? and (3) How can we help the user in labeling trajectories?To answer the rest of their research questions, Júnior et al. developed a web-based interactive tool named ANALYTiC to visually assist domain experts to perform GPS trajectory classification using AL and a simple VA interface, where users can pick one of the six (traditional ML) classifiers (Ada boost, decision tree, Gaussian naive Bayes, k-nearest neighbors (KNN), logistic regression, and random forest) and one of the three query strategies (uncertainty sampling, QBC, and random sampling) to start with trajectory labeling.Their interactive tool supports only binary classification (Appendix A.4.1).Júnior et al. also conducted a series of empirical evaluation experiments with three trajectories data sets (animals, fishing vessels, and GeoLife).Their results showed how the AL strategies choose the best subset to annotate and performed significantly better than the random sampling (baseline strategy).Their examples also demonstrated how the ANALYTiC web-based visual interface can support the domain expert in the AL process and specifically in the trajectory annotation using a set of visual solutions that ease the labeling inference task.They concluded that ML algorithms can infer semantic annotations defined by domain users (e.g., fishing, non-fishing) from trajectories, by learning from sets of manually labeled data.Specifically, AL approaches can reduce the set of trajectories to be labeled while preserving good performance measures.Their ANALYTiC web-based interactive tool visually guides domain experts through this annotation process.
Another very recent AL study that is very closely related to GIScience problems can be found in [56], where Pohl et al. applied AL methods to social media data (i.e., tweets) for crisis management.Two ML classifiers (i.e., kNN and SVM) are used in their AL application with several uncertainty strategies for binary classification (Appendix A.4.1) to distinguish between relevant and irrelevant information contained in a data stream.The authors used stream-based (Section 3.1.2)batch-mode AL (Section 3.1.4).Two types of data sets are used in their experiments: synthetic and social media data sets related to crises.Their empirical results illustrate that batch-mode AL is able to, with good performance, distinguish between relevant and irrelevant information from tweets for crisis management.
Overall, the application of AL with ML (or DL) applied to non-RS GIScience problems is just beginning.Given the rapid advances in M&DL and AL, we anticipate this situation to change quickly, with additional applications to mobility data, geospatial text analysis, and a range of location-based service applications.An objective of this review, of course, is to enable such development.

RS Applications Using AL/ADL
DL has achieved success in many applications, however, a large set of good quality labeled samples are needed to train a good DL classifier, as emphasized in Appendix A.1.Zhu et al. [12] provided a very recent survey of DL in RS, where they reviewed the recent advances and analyzed the challenges of using DL with RS data analysis.More importantly, they advocate that RS scientists should adapt DL to tackle large-scale RS challenges, such as application of RS and DL to study climate change and urbanization.However, AL (Section 3.1) and ADL (Section 3.4) based methods are not touched on in their review.In their conclusions, the authors did emphasize that limited training samples in RS represents a challenging bottle-neck to progress.Our review provides a promising solution to the challenges they pointed out.To help RS researchers get started with DL, a technical tutorial on DL for RS data is provided in [129].
AL has a relatively long history and has been widely studied for RS applications (compared with attention given to AL in other components of GIScience).Many successful AL examples in RS in the literature (reviewed below in this section) have demonstrated that AL can aid RS image classification tasks, whereas ADL (Section 3.4) has only been recently applied to RS for image classification.Below, we first introduce some AL methods used for RS image classification, and then more recent ADL methods applied to RS image classification problems.Some pioneering work using AL for RS image classification can be found in [130][131][132][133][134]. Tuia et al. (2011) [7] surveyed and tested several main AL methods used in RS communities for (multispectral and hyperspectral) RS image classification.As introduced in Section 3.1, an AL process requires the interaction between the annotator (e.g., domain experts) and the model (e.g., a classifier)-the former provides labels, which integrates domain knowledge while labeling, and the latter provides the most informative pixels to enlist annotators for labels.This is crucial for the success of an AL algorithm-the machine learner needs a query strategy (Section 3.1.5)to rank the pixels in the RS image pool.Tuia et al. [7] used AL query strategies (Section 3.1.5),also called heuristics in the RS community [7], to group the AL algorithms they reviewed into three main categories [132]: committee, large margin, and posterior probability-based.Tuia et al. also analyzed and discussed advantages and drawbacks of the methods they reviewed, and provided some advice for how to choose a good AL architecture.One of the directions they pointed out is the inclusion of contextual information in heuristics (i.e., AL query strategies)-they emphasized that the heuristics proposed in the literature mainly used spectral criteria, whereas few heuristics directly considered positional information and/or textures.To address the gap of lacking heuristics that consider spatial constraints, Stumpf et al. [135] developed region-based AL heuristics for RS image classification.Empirical tests with multitemporal and multisensor satellite images of their region-based heuristics, which considered both uncertainty and diversity criteria, demonstrated that their method outperformed pointwise sampling and region-based methods that considered only uncertainty.
An early example of applying AL methods in RS can be found in [130], in which Mitra et al. employed an AL technique that selects the n most uncertain samples for segmentation of multispectral RS images, using SVMs for binary classification (Appendix A.4.1).Their AL query strategy is to select the sample closest to the current separating hyperplane of each binary SVM.Ferecatu and Boujemaa [136] also employed an SVM classifier in their AL method for remote-sensing image retrieval.Their experimental evaluation of classification performance confirmed the effectiveness of their AL approach for RS image retrieval.Their AL selection criterion focused on minimizing redundancy between the candidate images shown to the user.
Obtaining training data for land cover classification using remotely sensed imagery is time consuming and expensive, especially for relatively inaccessible locations.In an early step toward the goal of designing classifiers that use as few labeled data points as possible, Rajan et al. [131] proposed an AL technique that efficiently updates existing classifiers by using minimal labeled data points.Specifically, Rajan et al. [131] used an AL technique that selects the unlabeled sample that maximizes the information gain between the posteriori probability distribution estimated from the current training set and the (new) training set obtained by including that sample into it.The information gain is measured by the Kullback-Leibler divergence [71] (Section 3.1.5).One main contribution they made was that their AL method can adapt classifiers when there is substantial change in the spectral signatures between labeled and unlabeled data.Their AL approach is also useful for classifying a series of spatially/temporally related images, wherein the spectral signatures vary across the images.Their empirical results provided good performance, which was tested on both single and spatially/temporally related hyperspectral data sets.
As introduced in Section 3.1.4,batch-mode AL is better suited to parallel labeling environments or models with slow training procedures to accelerate the learning speed.Tuia et al. [132] proposed two batch-mode AL algorithms for multi-class (Appendix A.4.2) RS image classification.The first algorithm extended the SVM margin sampling (Section 3.1.5)by incorporating diversity (Section 3.1.5)in kernel space, while the second is an entropy-based (Section 3.1.5)version of the query-by-bagging algorithm.The AL algorithms in pseudo code were provided in their appendix.Demir et al. [133] also investigated several multi-class (Appendix A.4.2) SVM-based batch-mode AL techniques for interactive classification of RS images; one outcome of the research was a proposed cluster-based diversity criterion for informative query selection.Patra and Bruzzone [134] also proposed a fast cluster-assumption based AL technique, but they only considered the uncertainty criterion.In a follow up study, Patra and Bruzzone [83] proposed a batch-mode AL (Section 3.1.4)technique that considered both uncertainty and diversity criteria for solving multi-class classification (Appendix A.4.2) problems using SVM classifier with OAA architecture.Their experimental results running on two different RS data sets (i.e., hyperspectral and multispectral) confirmed the effectiveness of the proposed technique.
Above, we have seen some successful AL methods to tackle RS problems.Now, we will introduce recent ADL (Section 3.4) work for RS image classification.A RS scene can be classified into a specific scene theme (e.g., a part of a forest, a parking lot, and a lake).In this type of classification task, supervised learning techniques are usually employed.Zou et al. [137] used AL for RS scene classification to remove less informative deep belief network (DBN) features [138], before a t-test was applied on the remaining features for discriminative feature selection.Specifically, they used iterative execution of AL, with 200 iterations, to collect an informative feature set from the DBN features, and then perform a t-test for feature selection.
It is expensive to get good labeled samples in hyperspectral images for RS applications.To address this challenge, Liu et al. [139] proposed an ADL method for RS hyperspectral image classification, where their algorithm selects training samples that maximize two selection criteria (i.e., representativeness and uncertainty).The performance of their algorithm was compared with several other AL (but not integrated with DL) classification algorithms that used different query strategies (i.e., random sampling, maximum uncertainty sampling [26], and QBC [7]; see Section 3.1.5).Their results demonstrated that the proposed algorithm achieved higher accuracy with fewer training samples by actively selecting training samples.
DL has been widely studied to recognize ground objects from satellite imagery, whereas Chen and Zipf [140] also emphasized that finding ground truth especially for developing and rural areas is not easy and manually labeling a large set of training data is very expensive.To tackle this challenge, Chen and Zipf [140] propose an ongoing research project named DeepVGI, with the goal of employing ADL (Section 3.4) to classify satellite imagery with Volunteered Geographic Information (VGI) data.In their deepVGI method, Chen and Zipf [140] tested two classic CNNs (LeNet [141] and AlexNet [142]) and a multilayer perceptron (MLP) (a class of the feed-forward neural network) [143].The overall testing performance of their intial DeepVGI results, compared with Deep-OSM and MapSwipe, demonstrated that DeepVGI's performance (in particular, F1 score and accuracy) is significantly better than DeepOSM, but less good than the MapSwipe volunteers (each image is voted on by three volunteers).Training neural networks with OpenStreetMap (OSM) data, DeepOSM can make predictions of mis-registered roads in OSM data by classifying roads and features from satellite imagery [144].The DL architecture DeepOSM used is a simple one layer CNN.MapSwipe is a crowd-sourcing mobile application that allows volunteers to label images with buildings or roads.
Almost all reported methods applying DL in RS, shared the motivation that getting labeled data for RS imagery is challenging.Thus, AL/ADL, will help clear some hurdles in the process of empowering RS research with DL.

Challenges and Research Opportunities
In this section, we first provide a brief summary and discussion of key themes and overall insights (Section 5.1) derived from reviewing the range of research discussed above.Then, in Section 5.2, we discuss some challenges and related research opportunities identified through this systematic review of existing work on AL, AL combined with VA and/or DL.

Summary and Discussion
After introducing the essential terms and types of classification tasks in M&DL (Appendix A), we reviewed (Section 3) recent and influential research for (geo) text and image classification in AL, VA, AL with VA, and ADL, as well as some recent work in GIScience and RS using AL and ADL.
As we discussed briefly in Appendix A.1, DL algorithms do not require feature engineering, and they also have much better scalability to discover intricate patterns hidden in big data.However, pure supervised DL is impractical in some situations, such as those for which the labeling tasks require domain knowledge from experts, since there are only a few domain experts with time and willingness to label (geo) text or images for training.There is clear evidence in the research reviewed (Section 3) that an AL-enabled DL approach that uses VA methods to iteratively collect modest amounts of input from domain experts and uses that input to refine the DL classifiers will be productive.
While the work investigated in Section 3 has demonstrated the power of AL, AL with VA, and ADL, very few studies thus far combine the power of VA with ADL for (geo) text and image classification.Also, most existing research on AL is focused on single-label classification (Appendix A.4.3), and with substantially less attention to multi-label or hierarchical classification (Appendixes A.4.3 and A.4.4).However, many real world applications require multi-label and/or hierarchical classification.For example, consider an image that contains a major highway as well as a park.In this case, the image should be tagged as two categories, instead of one.Similarly, for a piece of text that talks about both a hurricane and flooding, we cannot simply group it into a single category of natural disaster when we need to make decisions on both wind and water risks, respectively.

Challenges and Research Opportunities
Among many challenges and thus opportunities, below we provide some major ones we identified through our systematic investigation.We group the challenges and opportunities into two sets: technical (Section 5.2.1) and application (Section 5.2.2).

Technical Challenges and Opportunities
Below we list some main technical challenges and opportunites, from classifier and AL problem scenarios related, to VA and AL/ADL integration.

•
Multi-label classification: Most existing multi-label classification research has been based on simple ML models (such as logistic regression [68,87], naive Bayes [68,87,145], and SVM [7,68,83,146]); but, very few on DL architectures, such as CNNs and RNNs.We need to extend the traditional ML models to DL ones for Big Data problems, because as we emphasized in Appendix A.1, DL algorithms have better scalability than traditional ML algorithms [116].Wang et al. [147] and Chen et al. [148] have developed a CNN-RNN framework and an order-free RNN for multi-label classification for image data sets, respectively, whereas few DL based multi-label classification methods for text data have been proposed.

•
Hierarchical classification: As Silla et al. [149] pointed out in their survey about hierarchical classification (Appendix A.4.4) across different application domains, flat classification (Appendix A.4.4) has received much more attention in areas such as data mining and ML.However, many important real-world classification problems are naturally cast as hierarchical classification problems, where the classes to be predicted are organized into a class hierarchy (e.g., for geospatial problems, feature type classification provides a good example)-typically a tree or a directed acyclic graph (DAG).Hierarchical classification algorithms, which utilize the hierarchical relationships between labels in making predictions, can often achieve better prediction performance than flat approaches [150,151].Thus, there is a clear research challenge to develop new approaches that are flexible enough to handle hierarchical classification tasks, in particular, the integration of hierarchical classification with single-label classification and with multi-label classification (i.e., HSC and HMC), respectively.

•
Stream-based selective sampling AL: As introduced in Section 3.1.2and discussed in [26,56], most AL methods in the literature use a pool-based sampling scenario; only a few methods have been developed for data streams.The stream-based approach is more appropriate for some real world scenarios, for example, when memory or processing power is limited (mobile and embedded devices) [26], crisis management during disaster leveraging social media data streams, or monitoring distributed sensor networks to identify categories of events that pose risks to people or the environment.To address the challenges of the rapidly increasing availability of geospatial streaming data, a key challenge is to develop more effective AL methods and applications using a stream-based AL scenario.

•
Intergration of different AL problem scenarios: As introduced in Section 3.1.2,among the three main AL problem scenarios, pool-based sampling has received substantial development.
But, there is a potential to combine scenarios to take advantage of their respective strengths (e.g., use of real instances that humans are able to annotate for the pool-based sampling and efficiency of membership query synthesis).In early work in this direction, Hu et al. [152] and Wang et al. [49] have combined membership query synthesis and pool-based sampling scenarios.
The conclusion, based on their experiments on several real-world data sets, showed the strength of the combination against pool-based uncertainty sampling methods in terms of time complexity.More query strategies (Section 3.1.5)and M&DL architectures need to be tested to demonstrate the robustness of the improvement of the combination.

•
Intergration of VA with AL/ADL: As Biewald explained in [14], human-in-the-loop computing is the future of ML.Biewald emphasized that it is often very easy to get a ML algorithm to 80% accuracy whereas almost impossible to get an algorithm to 99%; the best ML models let humans handle that 20%, because 80% accuracy is not good enough for most real world applications.To integrate human-in-the-loop methodology into ML architectures, AL is the most successful "bridge" [11,13,56,65,115], and VA can further enhance and ease the human's role in the human-machine computing loop [4,5,11,24,25].Intergrating the strengths of AL (especially ADL) and VA will raise the effectiveness and efficiency to new levels (Sections 3.1-3.4).Bernard et al. [11] provided solid evidence to support this thread of research (Section 3.3).

Challenges and Opportunities From Application Perspective (for GIScience and RS Audience)
As Raad emphasized in [153], "When data volume swells beyond a human's ability to discern the patterns in it ... GIS, infused with artificial intelligence, can help executives make better decisions", we share the same vision that GIScience researchers need to bring M&DL into our community, and start to build GeoAI.
Early achievements in M&DL have thus far been greater for image data than for text [154,155] (the main reasons are discussed in [156]).A major reason is the availability of big image repositories, such as ImageNet [157], that support such work for benchmarking.For example, well-known pre-trained CNN models (i.e., ConvNets)-AlexNet [142], VGG ConvNets [158], and GoogLeNet [159])-are trained on the ImageNet [157].Although substantial progress has been made in applying M&DL to image-based tasks, a range of challenges remain in RS and other geospatial image domains.One key challenge is related to leveraging image data collected by the increasing variety of drone-mounted sensors.Drones can easily get a big set of image data, for example, in disaster management applications.In this context, DL has already been applied to building extraction in disaster situations [160], as well as avalanche support focused on finding victims [161].Moving beyond "traditional" uses of supervised DL with image classification, one challenge is to develop interactive web apps that combine AL/ADL and VA to ask volunteers and domain experts to label a small set of data and then build a good classifier, which can help to quickly classify the images and then plot them on map.Doing so can help decision makers to get the big picture and generate insights in a quick and accurate manner.Such a system, of course, will require substantial testing to be usable in domains where life and property are at risk, but it is that risk that should drive research toward this objective.
While M&DL for image classification has a longer history [154,155], success in handling NLP tasks, such as language modeling and sentiment analysis [6,162], is catching up.As Knight [156] emphasizes, it is hard to envision how we will collaborate with AI machines without machines understanding our language, since language is the most powerful way we make sense of the world and interact with it.
These advances in text processing are particularly important since massive amounts of unstructured text are generated each day; based on industry estimates, as much as 80% of data generated by be unstructured [163].Estimates suggest that at least 60% of that unstructured text contains geospatial references [164].These unstructured data signify and give meaning to geospatial information through natural language.However, GIScience has paid limited attention to unstructured data sources.An important step in moving from the unstructured text to meaningful information is to classify the text into categories relevant to target tasks (i.e., text classification, Appendix A.5).
In Section 4, we have seen some successful applications using AL and ADL in the GIScience and RS fields.Even though most of these are based on RS imagery, with some on GPS trajectories, and only a few focus on geospatial text data, as outlined in the review above, advances in M&DL are rapidly being extending into a wide array of other domains, including to address NLP and other text related challenges.
Related to these image and NLP processing advances in M&DL, there are multiple GIScience and RS problems, such as geographic information retrieval (GIR), geospatial semantics, and geolocalization, to which VA, AL, and ADL based strategies can be applied productively.We highlight just a few of these below.

•
Geospatial image based applications: Based on the advances achieved in M&DL, many promising geospaital applications using big geospatial image data sets are becoming possible.Diverse GIScience and RS problems can benefit from the methods we reviewed in this paper, potential applications include: land use and land cover classification [165,166], identification and understanding of patterns and interests in urban environments [167,168], and geospatial scene understanding [169,170] and content-based image retrieval [136,171].Another important research direction is image geolocalization (prediction of the geolocation of a query image [172]), see [173] for an example of DL based geolocalization using geo-tagged images, which did not touch on AL or VA.

•
Geospatial text based applications: GIR and spatial language processing have potential application to social media mining [174] in domains such as emergency management.There have already been some successful examples of DL classification algorithms being applied to tackling GIScience problems relating to crisis management, sentiment analysis, sarcasm detection, and hate speech detection in tweets; see: [162,[175][176][177][178].
A review of the existing geospatial semantic research can be found in [179], but neither DL or AL, nor VA are touched upon in that review.Thus, the research topics and challenges discussed there can find potential solutions using the methods we have investigated in this paper.For example, the methods we investigated here will be useful for semantic similarity and word-sense disambiguation, which are the important components of GIR [180].Through integrating GIR with VA, AL and/or ADL, domain experts can play an important role into the DL empowered computational loop for steering the improvement of the machine learner's performance.Recently, Adams and McKenzie [181] used character-level CNN to classify multilingual text, and their method can be improved using the "tool sets" we investigated in this paper.Some specific application problems for which we believe that VA-enabled ADL has the potential to make a dramatic impact are: identification of documents (from tweets, through news stories, to blogs) that are "about" places; classification of geographic statements by scale; and retrieval of geographic statements about movement or events.• Geospatial text and image based applications: Beyond efforts to apply AL and related methods to text alone, text-oriented applications can be expanded with the fusion of text and geospatial images (e.g., RS imagery).See Cervone et al. [182] for an example in which RS and social media data (specifically, tweets and Flickr images) are fused for damage assessment during floods.The integration of VA and AL/ADL should also be explored as a mechanism to generate actionable insights from heterogeneous data sources in a quick manner.
Deep learning shines where big labeled data is available.Thus, existing research in digital gazetteer that used big data analytics (see [183] for an example, where neither DL or AL, nor VA was used) can also be advanced from the methods reviewed in this paper.More specifically, for example, the method used in [183]-place types from (Flickr) photo tags, can be extended and enriched by image classification and recognition from the geospatial image based applications mentioned above.
Overall, based on the review above, we contend that GeoAI, as implemented via M&DL methods empowered with VA, AL, and ADL, will have a wide array of geospatial applications and thus has considerable potential to address major scientific and societal challenges.domain [192].Labeling refers to the process of annotating each piece of text or image with one of a pre-defined set of class names.ML methods can use this information to learn a model that can infer the knowledge needed to automatically label new (i.e., never seen before) data instances.
Supervised ML methods usually divide the data set into two (i.e., training and test) or three (i.e., training, validation, and test) disjoint subsets.The labels of instances in the test set will not be given to the ML algorithm, but will only be used to evaluate its performance.The main idea of supervised learning is to build a ML model (e.g., a classifier for classification tasks, or a regression model for regression tasks) using the training data set and using the testing data set to validate the model's performance.With supervised learning there are several metrics to measure success.These metrics can be used to judge the adequacy of a method in particular situations and to compare the effectiveness of different methods over various situations [189].
Appendix A.2.2.Unsupervised Learning Unsupervised learning [189] is the ML task of inferring a function to describe hidden structure from "unlabeled" data (i.e., without human annotation).Since the examples given to the learner are unlabeled, expert knowledge is not a foundation of the learning and there is no evaluation of the accuracy of the structure learned by the relevant algorithm.A clustering algorithm called k-means and another algorithm called principal component analysis (PCA) [189] are popular unsupervised ML algorithms, among others.Appendix A.2.3.Semi-Supervised Learning Semi-supervised learning [193,194] is a learning paradigm concerned with the study of how computers and humans learn using both labeled and unlabeled data.One goal of research in semi-supervised learning is to understand how combining labeled and unlabeled data may change the learning behavior, and design algorithms that take advantage of such a combination.A survey focusing on semi-supervised learning for classification can be found in [194].In the survey, Zhu emphasized that there are some similarities between ML and human learning.Understanding human cognitive model(s) can lead to novel ML approaches [195,196].Do humans learn in a semi-supervised manner?The answer is "yes".Humans accumulate "unlabeled" input data, which (often unconsciously) are used to help build the connection between "labels" and input once labeled data is provided [194].
As emphasized in Section 1, labeled data sets are often difficult, expensive, and/or time consuming to obtain, as they require the efforts of experienced human annotators or domain experts.Semi-supervised learning addresses this problem by using a large amount of unlabeled data, together with a relatively small amount of labeled data, to build good classifiers (Appendix A.3). Semi-supervised learning has received considerable attention both in theory and in practice in ML and data mining because it requires less human effort and gives higher accuracy than supervised methods [194].Appendix A.2.4.Brief Discussion of Learning Types When a data set contains both labeled and unlabeled samples, ML methods can combine techniques from the two previous categories (i.e., supervised and unsupervised) to accomplish semi-supervised learning tasks [193].Labeled data instances can be used to induce a model, as in supervised learning, then the model can be refined with the information from unlabeled samples.Analogously, unsupervised tasks can be improved by introducing the clues given by the labeled instances.Active learning (Section 3.1) is semi-supervised learning, and most DL algorithms (e.g., CNN, RNN, and LSTM) belong to supervised learning.
In this paper, we focus on M&DL for classification (where the output of the process is categorical/discrete).Supervised/semi-supervised ML is also used for regression tasks (where the output of the process is continuous).The application of regression is beyond the scope of this paper; interested readers can find recent overviews in [194,[197][198][199][200].

Figure 1 .
Figure 1.An overview of the review flow and the links between different research domains-the dashed lines indicate the flows among the (interdisciplinary) domains.An introduction to essential concepts in machine learning (ML) and deep learning (DL) for understanding the core part of the review (i.e., Section 3) is provided in the Appendix A.

Figure 3 .
Figure 3. Review highlight for the pool-based active learning cycle (adapted based on [26]).
might create nonsense.The stream-based and pool-based scenarios can address the limitation.

Figure 4 .
Figure 4.An illustration of the three main active learning scenarios (adapted based on[26]).Each starts from the set of all possible observations (the instance space at left) and applies a query strategy (light blue box) for selecting which instance to ask the human or machine annotator to label (dark blue box).