Visual Active Learning for Labeling: A Case for Soundscape Ecology Data

Labeling of samples is a recurrent and time-consuming task in data analysis and machine learning and yet generally overlooked in terms of visual analytics approaches to improve the process. As the number of tailored applications of learning models increases, it is crucial that more effective approaches to labeling are developed. In this paper, we report the development of a methodology and a framework to support labeling, with an application case as background. The methodology performs visual active learning and label propagation with 2D embeddings as layouts to achieve faster and interactive labeling of samples. The framework is realized through SoundscapeX, a tool to support labeling in soundscape ecology data. We have applied the framework to a set of audio recordings collected for a Long Term Ecological Research Project in the Cantareira-Mantiqueira Corridor (LTER CCM), localized in the transition between northeastern São Paulo state and southern Minas Gerais state in Brazil. We employed a pre-label data set of groups of animals to test the efficacy of the approach. The results showed the best accuracy at 94.58% in the prediction of labeling for birds and insects; and 91.09% for the prediction of the sound event as frogs and insects.


Introduction
Active Learning (AL) is considered as a special case of machine learning [1]. AL is also called "query learning" because it actively requests information of selected data from the set of unlabeled data, from which the model will learn. It is widely used in hard cases for big data learning, an example of which is evidenced by [2] when data have only 2000 labeled instances and 250,000 unlabeled instances. There are different strategies used for each stage AL, such as those described and categorized by [1,3,4]. The effectiveness of AL was demonstrated in various tasks, such as: (a) automatic speech recognition [5]; (b) classification of voicemail messages [6]; (c) malicious code detection [7]; (d) text classification [8]; (e) speech emotion classification [9]; and (f) audio retrieval [10]. When the user takes part in the process, there is a cooperation between computer and analyst, and it becomes part of the Human In The Loop (HITL) machine learning paradigm [11], targeted at improving learning processes by employing the user's expertise as well as computing strategies.
The main goal of AL is that the algorithm learns using the smallest possible number of instances during training, but generating the best predictions of unlabeled data [1]. There are several sampling strategies, but at least three categories can be generalized according to [1]: (1) certainty-based sampling, (2) query-by-committee, and (3) expected error reduction. In the first sampling strategy, a small set of selected samples is annotated at the beginning, and then manually annotated labels are used to train a classifier, which classifies unlabeled samples. The second type of sampling strategy involves two or more classifiers, who may disagree with respect to some instances; if they agree, those instances are delivered for human annotation or validation. The third type of strategy aims to estimate and select the instances that can have a high impact on the expected model error for human annotation; however, this last strategy may be the most computationally expensive.
The final goal of Active Learning is to learn from a small set of labeled samples. However, a large number of current applications require a large subset of labelled data from the start. Examples are the building of certain machine learning models, such as those applied in deep learning, and applications targeted at selecting and reducing the set of attributes to represent a phenomenon (examples are certain biological data sets, medical records, fraud detection, etc.). Our aim in this work is to find a stable balance of the active learning and label propagation strategies to support the actual labeling process, so that the stage of the data science pipeline can be made more effective.
In order to allow user participation in the labeling processes, visualization tools are necessary, since they can help the user navigate in large data with multiple attributes, by facilitating data interpretation through graphical and interactive representations. Multidimensional projections or embeddings allow the reduction of a multidimensional space of m dimensions to another reduced space with p dimensions (X m ⇒ X p , p < m), while trying to preserve as much information as possible. The AL approach together with visualization techniques can facilitate the acquisition of knowledge from the data and support interpretation as well as sample labeling by the user [12,13].
In this study, we propose a method of user centered active learning that aims to optimize and give more dynamism to the entire process of labeling, by including visual strategies in the several stages of AL. In support of the strategy, we have compared different types of sampling strategies and estimated the appropriate number of samples that must be labeled by the annotator. Thus, we also performed an exhaustive assessment of learning power and performance of the models, in relation to the different sampling strategies and the defined number of clusters. We have also reflected on how the model learns based on a dataset with numerous features against and the same dataset taking as a starting point known discriminative features (as illustrated in Figure 4). After data samples are annotated by expert users, a classification model is trained with the recorded data. Finally, the predictions of the labels for the data set are visualized and evaluated (see Table 1). The projections afford visual analysis and interaction in each stage of the active learning process to give the user a better understanding of the data and of the process of labeling. (see Figure 5). Users can confirm or correct predicted labels. Therefore, this is a visual analytics approach that is meant to blend learning and user supervision in the process of labeling samples in general and in soundscape ecology data in particular (see Figures 3 and 5).
We realize the framework and its methods in the context of environment monitoring using sound recordings. Recordings are becoming central in understanding and describing the condition of natural environments. They are involved in a large number of studies, such as monitoring environmental noise [14], measuring biodiversity integrity and environmental health [15][16][17], freshwater lakes [18,19], and species identification [20]. Soundscape ecology studies [21] have increased in the recent years [22] and, between other things, this research field aims to monitor and understand how different environments respond to changes induced by human activities [23][24][25] and to assess the impact through altered soundscapes.
Extracting information from such data is both very challenging and expensive. Therefore, it is necessary build a bridge between ecoacoustics, machine learning, and visualization, which requires a multidisciplinary approach [26]. One important phase of soundscape ecology studies is the annotation or tagging of events of interest [27]. Therefore, developing new methods that improve the quality of labeling is essential for the success of many soundscape ecology studies worldwide. Here, we test our framework on the task of labeling soundscape ecology data and employing real data on birds, frogs, and insects to evaluate the performance of the method.
The main contributions of this work are summarized as follows: • Proposal, implementation, and testing of a Visual Active Learning strategy for labeling and application of such strategy to soundscape ecology data; • Anchoring the process of user centered labeling through visualization by multidimensional projections. Anchoring data detailing in visualizations of summarized data; in the case of our sample application that is done by the proposal of "Time Line Spectrogram" (TLS) visualizations ( Figure 3). • New sampling strategies incorporated in the process of Active Learning for labeling, and their evaluation ( Figure 4) using this strategy, we demonstrate reduced annotation costs ( k pk = 5 pk = 10 pk = |.|

Visualization in Active Learning and Labeling
Visualization in support of active learning strategies has been observed for some time in a variety of applications such as image processing [28,29]. Years later, the work [30] presented Visalix as a good alternative that also combines AL and visualization, where the user can make the annotation of classes by managing the attributes in a 3D space, but with certain limitations. Reference [31] also contributed in the field by presenting the Case Base Topology Viewer for Active Learning (CBTV-AL), where they considered density, uncertainty, and diversity in the sampling strategies. In the case of diversity and uncertainty, the CBTV-AL needs to be recalculated until the last instance can be labeled. CBTV-AL allows for visualizing the entire labeling process through layouts based on force-directed graph drawing algorithms. The authors emphasize the importance of visualization techniques for user participation in AL. Reference [12] enhances sample selection in AL based on visualization using scatter plots and iso-contours, employing the semi-supervised metric learning method to train with data annotated by the user. Reference [32] presents a pilot study using the t-Distributed Stochastic Neighbor Embedding (t-SNE) [33], force-directed graph layout, and chord diagrams to visualize data and facilitate labeling. According to the authors, this improves the text document labeling process. In these last two approaches, the determination of samples is carried out manually on the visualization (scatter plot) by the user, and this leads to the conclusion that there is no sample selection and AL suggestion strategy directly involved as described by [13]. In addition, there are other strategies as incrementally involved within active learning models as can be seen in the approaches of [34][35][36].
Here, we present an approach to predict labels using label propagation by sample selection over a clustering process. The visualizations in our case are based on multidimensional projections, which aim to map data in 2D based on their similarity, giving an extra layer of effectiveness of the current stage of the labeling. In comparison to previous methods, we contribute both in terms of the interaction method and on the sample selection process.
To test the method, we built a framework for labeling of acoustic landscapes in soundcape ecology. We have employed a data set for animal group identification whose discriminant features we studied before [37] to evaluate the impact on results. Our results prove efficiency in the prediction task of labels and show reduced manual annotation effort with the methodology proposed for soundscape data.

Labeling of Sound Data
Sound data labeling is a key task that, in general, precedes most of the remaining data analysis tasks or the development of new approaches to automatic interpretation. For labeling sound data, there are some computer-based solutions, such as the ones presented recently by [27]. Manual labeling is very costly in time and degree of complexity, and studies in AL for labeling are meant to minimize manual annotation effort of samples by the users. In line with this objective, the study conducted by [38] addresses a methodology based on the combination of AL and self-training by considering the level of confidence of instances. Thus, instances with low confidence scores are delivered to expert users to be labeled and instances with high confidence scores are used in the prediction automatically. Because it is based on defining the scores, the determination of the confidence threshold is crucial. Reference [38] used the FindSounds (FindSounds: https://www.findsounds.com/, accessed on 22 June 2021) database with duration ranging from 1 to 10 s for each instance of audio. Reference [39] proposed a new medoids-based active learning (MAL) method for generating clusters by K-Medoids; afterwards, the medoids from each cluster are presented to expert users and labeled by them. After obtaining the labels, instances are fed to a classifie. Reference [39] used the UrbanSound8k [40] database with maximum duration of four seconds for each instance of audio. Another study conducted by [41] proposed an AL strategy that combines two stages: (i) The first stage implements the same strategy as [39]; (ii) In the second stage, they proposed a selection of samples based on the prediction mismatch, looking mainly for segments with incorrect labels; labels are then corrected and predictions are updated in the groupings using a nearest-neighbor approach. Reference [42] presents an AL-based framework for classifying soundscape recordings. According to the methodology, the authors first generate 60 clusters, then randomly selected 10 instances of each cluster to be manually labeled. It allowed to define the most appropriate class names for each cluster. The reduced manual annotation effort with the active learning methodology in the paper was demonstrated empirically. Reference [42] used a database with maximum duration of 1 min for each instance of audio. As presented above, several venues can be pursued by researchers when advancing the labeling task, with a varying degree of automation and confidence. On the other handl, Reference [43] proposed an active learning system for detecting sound events where the main objective is to improve the process of selection of samples based on the identification of audio segments with the presence of sound activity for annotation; and, in the same line of work, reference [44] proposed a new strategy to determine samples from the audio database, thus asserting the high influence of this sample selection task in reducing the manual labeling effort by the user.
Our approach focuses on displaying the data as submitted to several strategies or techniques for each stage of active learning process, with the main goal of labeling data with improved accuracy, but with less effort by users than with manual labeling. Here, we propose a visual analytics approach that is meant to blend learning and user supervision in the process of labeling samples in general and in soundscape ecology in particular. The goals are: (i) include the learning and supervision of the user in the model for labeling; (ii) know what the appropriate strategies or techniques are in this labeling process; and (iii) support evaluating the quality of labeling to soundscape data.

The Labeling Method
In Figure 1, we illustrate the main steps of our proposed method for labeling strategy. Given a data set that has no labels, the Clustering stage is initially undertaken. Then, samples are extracted from each cluster at the stage named Sampling. Expert users interact with the samples by listening and labeling the audios in the Annotation step. Considering the samples labeled by the user, a learning model (classifier) is trained. Then, the model is then used to predict the rest of the labels of the instances of data set (excluding the samples), and these predictions are performed in the step learning-prediction. The visualization task is intrinsically present in all stages. Finally, the results obtained with the proposed method are evaluated. The steps of each stage of our method are given in the Algorithm 1, and each step is described in the next subsections. We implemented our method as a framework with application in soundscape ecology. A description of the interface of the system is shown in Appendix A.

Clustering
To start this stage, a data set of unlabeled instances for which features have been extracted is required. In the case of soundscape data, numerical features are computed from the audio itself and from the image of the audio spectrogram. In general, the input to clustering is only the features extracted for all instances. Firstly, k clusters are extracted from the data employing the Euclidean distance. Hierarchical Agglomerative Clustering and K-Means were used in our framework, both using the Scikit-learn package (available in scikit-learn: https://scikit-learn.org/, accessed on 22 June 2021). We consider that, through clustering, it will be possible to identify patterns between the instances of audio data, and, consequently, these patterns will allow the segregation of the event categories from the soundscape and produce good samples for labeling.

Sampling
The main goal of the sampling step is to extract additional representative samples from the data set, to be used later in learning tasks. Thus, p samples are extracted from each of the clusters, employing for this the following sample extraction methods: random (r), medoid (m) and contour (c); as well as their combinations: (random-medoid (rm), random-contour (rc), medoid-contour (mc) and random-medid-contour (rmc)). In the first method, samples are taken randomly. In the case of medoid, samples are the instances closest to the cluster centroid. The method contour takes samples furthest from the centroid of the clusters. Figure 2 illustrates the three types of sampling methods.

Annotation
In this step, the induction of learning from the AL paradigm is initiated, through the interaction of expert users. The goal of this step is to abstract information from users who are experts in recognizing and differentiating sound categories. Thus, this stage deals with the tasks of listening and labeling audio files corresponding to the most representative samples. To perform these tasks, multidimensional projections are employed to allow visualization and interaction with the samples. In a parallel view, the same projection facilitates the visualization of previously generated clustering between instances. In order to assist interaction with projections, visualizations from data summarization should be presented on the top banner of the interface. In our implementation as a new proposal in visual AL, we deal with combined spectrograms over time. They are denominated Time-Line-Spectrogram (TLS)-see the top of Figure 3. The goal of the TLS visualization is to provide more visual information about samples while the user interacts with the projections, in the tasks of labeling by listening. TLS is a supportive visual representation that summarizes the spectrograms of each audio recording (instance) according to the time of recording in the landscape. Other types of data, such as documents, image collections, and videos, can also be summarized by tag clouds or representative pictures.

Learning-Prediction
This step aims to train a learning model from features and labels of samples, and this learning is used to predict the labels of the other instances to the data set. To accomplish this step, the user needs to perform the following tasks: (i) Learning: Model training. In this case, the model to be used is Random Forest Classifier (RFC). (ii) Prediction: After the learning, labels of the instances other than the samples are predicted. Then, by examining the results using the same visualizations, and the criteria of the application, the steps of the proposed method can be repeated starting from the Clustering step (Section 3.1).  figure). In this view, projection multidimensional t-SNE is used for visualizing each data point.

Validation
To validate the proposed method, it is necessary to use a data set that has true labels, that is to say, that all instances of the audios had been previously labeled by expert users. Therefore, the validation in this stage of the method consists of comparing the real labels with predicted labels. In order to validate the prediction, we use the classification Accuracy (AC), which is defined by Equation (1): where TP are the true positives, TN are the true negatives, FP are the false positives, and FN are the false negatives.

Visualization
Visualization techniques-particularly multidimensional projections-are used in all stages of the proposed method (Clustering, Sampling, Annotation, and Learning), to support verification of results and user interaction where feasible.
By employing projections, users have the same mental model of all steps, and, by interpreting similarity between samples and their neighbors on the projections, the user is equipped with a powerful tool for browsing through data. This should increase the performance of expert users tasks.

Data Description and Case Study
To validate our proposed method, we used a data set provided in partnership with the Spatial Ecology and Conservation Lab (LEEC) of São Paulo State University (UNESP-Rio Claro). The soundscape recordings are part of Long-Term Ecological Research within the Ecological Corridor of Cantareira-Mantiqueira (LTER CCM or PELD CCM in Portuguese). The audio data were recorded within 22 landscapes distributed in the LTER CCM region, where the following types of environments were sampled: forest, swamps, and open area (mainly pasture). Originally, the region had been covered by forest, but, due to the expansion of agriculture, pasture, and urbanization, the region shows varying forest cover from 16% to 85% in different portions of study areas. The recordings occurred between October 2016 and January 2017, and each landscape was surveyed during three consecutive months (30 days for forest, 30 days for swamps, and 30 days for open areas). Half of the recordings were collected in the morning (from sunrise to 8:30 a.m.) and half in the evening (6:30 p.m. to 10:00 p.m.). For the purpose of the current study, the raw data were re-sampled, in order to represent the soundscape heterogeneity of all the sampled environments. Therefore, the re-sampled data set has more than 40,000 sound files of one minute each. A total of 2277 sound files were labeled by experts. To assess our method, we chose to work with the data set containing those 2277 instances of audio. The sound files were labeled according to the most dominant sound in each minute, which were divided into three labels: 615 for frogs, 822 for birds, and 840 for the insects. Hereafter, we will refer to this data set as DS1. In the same region of LTER CCM, two other soundscape ecology studies were conducted: (1) [22], which aimed to assess how spatial scale (i.e., extents) influences acoustic indices responses and how these indices behave according to natural vegetation cover (%); and the authors in (2) [37] developed a method to identify the most discriminant features for categorizing sound events in soundscapes. This present study and its realization in the field of soundscape ecology represents an important tool for further ecological studies in this and other natural areas.
For data analysis, we follow the methodology described in article [37], which first suggests a preprocessing of the audios with a set of parameters to generate the Spectrum creation. soundfile (Available in: https://pypi.org/project/SoundFile, accessed on 22 June 2021) and librosa (Available in: https://librosa.github.io/librosa, accessed on 22 June 2021) in Python were used for this. The real part of the spectrum was used for obtaining spectrograms. The study also suggests that feature extraction from three different sources (descriptors based on acoustic indices, descriptors based on cepstral information, and descriptors based on the image of spectrogram) can be used together to the benefit of the analysis. For feature extraction, we employed Essentia (Available in: https://essentia.upf.edu, accessed on 22 June 2021), Python, Cython, and C.
In our experiments, for each audio minute, a total of 238 features were extracted. For the experiments, we set up four data sets (DS): (DS1) frogs, birds, and insects, with 2277 instances; (DS2) frogs and birds, with 1437 instances, (DS3) frogs and insects, with 1455 instances; and (DS4) birds and insects, with 1662 instances. Each data set has two feature settings: the first configuration has 102 features in total for each set (original features). For the second configuration, we select best features using the feature selection method based on important features known as Extra-Trees-Classifier (see [37]). After this step, we remained with 30 features for DS1, 30 for DS2, 46 for DS3, and 31 for DS4.

Data Availability and Bioethics
All the raw data used in this study are available on the following platform: https: //github.com/LEEClab/soundscape_CCM1_exp01, accessed on 22 June 2021. The Spatial Ecology and Conservation lab of Biodiversity Department of UNESP is in charge of keeping this repository available and updated over time. By the nature of data (audio data only), the UNESP bioethical committee does not require any specific authorization, as no live animals were handled during the sampling period.

Results and Discussion
The predictions of data set labels were computed by performing all of the steps described in Section 3, which are: Clustering, Sampling, Annotation, Learning-Prediction, and Visualization. In this scenario, it is important to highlight that, for the purpose of our analysis, user interaction in the step Annotation is recreated or simulated using exceptionally the true labels only to assign the labels to the initial samples. This was so to guarantee the observation of learning effectiveness in the context of the experiments.
For the Learning-Prediction stage, the classifier used was Random Forest (RFC), for showing robust classification in soundscape data in comparison with other classifiers that had been tested such as: Support Vector Classifier (SVC), K-nearest Neighbor Classifier (KNNC), and X-Gradient Boosting classifier (XGBoost). Details of the experiments carried out to analyze the steps of the methodology are described in the next sections.

Clustering and Sampling Analysis
In order to define the best parameters for the method, the goal of this experiment was to evaluate the first steps of Clustering and Sampling. The accuracy of these steps contributed to increasing the accuracy in the prediction of labels. Therefore, an accurate prediction of the labels can, at the same time, mean a certain segregation of categories of events in data of soundscapes.
In this experiment, we evaluated the following parameters: the number of clusters (k), the total number of samples (p), the number of samples per cluster (pk), and the types of strategies for extracting samples (r, m, c, rm, rc, mc; and rmc-see Section 3.2). The setup of the experiment was defined as follows: (1)  The visual active learning method was executed 7728 times for the four data sets, and eight sets of results were obtained in the form of tables; four for data sets with all 102 features, and four for data sets using the best features. The results for the DS1 data set are displayed in Table 1 Initially, the clusters were computed using the K-Means (KM) and Hierarchical Agglomerative Clustering (HAC) algorithms, but the best results were obtained with HAC for clusters larger than 20. As expected, we noticed that the larger the number of samples, the higher the accuracy. However, ideally, one wishes to use as few samples as possible to later predict most of the other instances. Therefore, the number of samples is limited by a threshold for the smallest possible number of samples. In this scenario, from the results, we can infer that the proposal to set the number of samples per clustering in |.| is the best option; thus, this parameter can be calculated automatically. Regarding the method for determining initial samples, the analysis focuses on the predominance of maximum accuracy values. After doing a visual analysis of the information in the table using heatmaps, the most suitable strategies for extracting samples in order (from best to worst) are: r, rc, rm, mc, rmc, m, and c.
In order to make a more clear and specific analysis, Figure 4 illustrates the comparison of results between the best features and all 102 features, specifically considering: the two best sample strategies (r and rc) and the worst sampling method (c) with 10 samples per cluster. In the results presented in Figure 4, we can observe the superiority of the samples r and rc for the data sets under analysis. Thus, through these experiments, the discriminatory capacity of the selected features was verified, achieving high accuracy in prediction of sound categories.

Visual Analysis via Projections
Some of the results presented in Table 1 can be visualized in Figure 5. A set of the resulting visualizations from the framework is presented: (1) k = 24 as the number of clusters; (2) pk = 5 as the number of samples per cluster; and (3) rc was selected as the method or strategies for extracting samples. The visualizations were generated using t-SNE projections. Figure 5 illustrates each step of the proposed method: Clustering, Sampling, Annotation, and Learning-Prediction. The ground truth is added for comparison.
For each data set, the points of varying colors in Figure 5a,e,i,m, represent the 24 clusters generated in the Clustering stage. The colored dots with up to three colors of Figure 5b,f,j,n, represent the samples from the Sampling step, specifically the colors represent the user's labeling interaction in the step Annotation. The colored dots in Figure 5c,g,k,o represent the instances with labels that were determined by the prediction in the step Learning-prediction. It is important to report that-for the training of learningthe labels of the samples were considered, and, in the prediction of the rest of the instances, unlabeled of data sets were also used. Finally, the colors of the points in Figure 5d,h,i,p represent the actual true labels of the data set instances. Visually, we can observe the great similarity between the pairs of instances labeled in Figure 5c,d,g,h,k,l,o,p, which, in reality, translates as the visual degree of similarity of the labels in the prediction in relation to the true labels respectively for the four data sets. The accuracy achieved in the labeling task for each of the four data sets varied from high to pretty high: in DS1 equal to 72.6% (Figure 5c), DS2 equal to 72.36% (Figure 5g), DS3 equal to 91.09% (Figure 5k), and DS4 equal to 94.58% (Figure 5o). Based on these results-at least for our experiments-we can say that, in order to achieve high accuracy in automatic labeling, the experts mostly provide manual annotation for the following percentage of the data: 5.3% for DS1, 8.35% for DS2, 8.2% for DS3, and 7.2% for DS4.

Conclusions, Future Work, and Opportunities
In this study, we presented a method for support sample labeling by employing visual active learning as a label prediction strategy. A framework was tested in the context of a framework for labeling soundscape ecology data. Experiments evaluated each step of the process by the employment of a pre-labeled data set provided by our application partners. We have evaluated the number of clusters, the number of samples per cluster, and the sample strategy within each cluster. Results of the experiments were evaluated according to classification accuracy, which determines the level of prediction of the labels. For clustering unlabeled data, Hierarchical Agglomerative Clustering (HAC) was adapted to the application since it allows for starting the process from user labeled samples, building groups incrementally. According to the results, the best sample strategies were rc and r because these strategies reached the most representative and informative samples from each cluster. The tables in the form of heatmaps with all cluster results identify that trend.
Therefore, we identify an effective parameter configuration for the method and for the current experiment. The best parameters were: (1) which clustering strategy to employ, (2) the number of clusters to generate, (3) the method of strategy to extract samples, (4) the number of samples per clustering, and (5) the technique of visualization where the user will interact in the labeling. Thus, the optimal configuration was: (1) Agglomerative Hierarchical Clustering (AHC) as a strategy to generate clusters; (2) clusters greater than 20; (3) samples of the method random or random combined with contours; and (4) number of samples |.|, per cluster, where |.| is the number of instances per cluster.
The main visual tool employed for user tracking of the process and interpreting the results of each step was multidimensional projections, which can reflect the clustering, sample selection, result of label prediction, and comparison with the ground truth.
With the results achieved in our study, we demonstrate that AL and multidimensional visualization can play an important role in achieving favorable performance with significantly less manual annotation effort. In addition, the accuracy of automatic labeling for our case study varied from high (72.36%) to very high (94.58%) for the four data sets, also reflecting a very good advance for the field of soundscape interpretation in the task of discriminating categories of sounds (in our case, groups of animals).
While our approach can be applied to any vector representation of the data set undergoing labeling, the best success in terms of accuracy is achieved when the data set is described by a set of features that has a good potential to discriminate target labels. Our experiments presented here show that aspect of the problem. It is our plan to tackle this problem in the next progression of our efforts in visual analytics for labeling.
Another contribution of our study is the framework itself (see Figure A1) that encapsulates all steps of our proposed method and its calculations. The framework is openly available. Although the framework is dedicated to soundscape data, it should be applicable to implementing the strategies for other data sets where a set of attributes can be extracted which consistently discriminates labels of interest.
In the particular case of soundscape ecology, recordings are being collected constantly, and the amount of data to be labeled increases exponentially in biodiversity monitoring worldwide [22]. Thus, developing solid and easy-to-use methods are of utmost importance for project managers that want to speedily extract information from their data. After data recording in the field, labeling is one of the most time-consuming tasks that precedes extracting knowledge. Therefore, contributions such as automatic labeling combined with a good visualization tool can be crucial for the success of conservation projects. The application of this approach was limited to experimenting with categorical sound events. While this could be construed as one of its limitations, the applicability of the results of the study can bring benefits to science such as understanding the behavior of certain environments, which can lead to the development of environmental monitoring strategies and policies of conservation.
In the next steps of the application studies, we will try to focus the framework in other categories, such as primates, bats, rain, dogs, human conversation, cars, airplane, guns, stream, wind, and background noises, and also verify applicability to another level of audio data resolution, such as identifying species.
More crucial to the development of visual analytics strategies in the future, it essential that the problem of labeling, which drives data analysis and understanding, machine learning, and crucial feature selection in many different applications, is dedicated more effort in order to find a balance and also to generalize the approach so that it can be applicable to a large variety of domains. Examples of areas where accelerating labeling can have a real impact are document analysis, biological data interpretation, and image and video interpretation, to name a few.

Data Availability Statement:
The data presented in this study are available on the following platform: https://github.com/LEEClab/soundscape_CCM1_exp01 (accessed on 28 June 2021).

Acknowledgments:
The authors acknowledge the work of students in LEEC lab, in particular Lucas Gaspar, for the labeling of the data set.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Overview of the Visual Active Learning Framework for Soundscape Ecology
Our methodology was realized by a framework that we call SoundscapeX (code available at https://github.com/hhliz/vizactivelearning, accessed on 22 June 2021). The software implementation is based on a Client/Server architecture developed in Python that mainly employs the following libraries for the back-end: Tornado, Scikit-learn, Librosa, Pandas, and MongoDB. For visualizations, we employ the D3 Javascript library.
On the server, algorithms written in Python are used to perform clustering, sampling, learning, prediction, and projection tasks. On the client-side, algorithms written in D3/js are used to generate the layout of the views, mainly the projections, and, with these projections, the user interacts to perform the manual labeling. The application window is a view created with D3/JS in the web browser (client), but the visualization geometry is created by complex calculations from algorithms written in Python (server). Python and D3 communicate via JSON message passing.
Here, we provide an overview of the data exploration and labeling functionalities. Figure A1 presents an overview of SoundscapeX and the interface functions for exploring and labeling the data. The main interface components are shown in the regions labeled: (A) The process of labeling starts in the configuration panel, where the user issues a query of a data set to be used, the set of features to represent the data, type of the normalization, clustering technique to be used (k-means, HCA), number of clusters, and visualization (t-SNE or Uniform Manifold Approximation and Projection UMAP [45]). (B) At the top, the first seven buttons are used to interact and explore the data within the projection in region D. Then, the remaining buttons are a mini configuration panel to determine the type of sampling, number of samples, and launch of a small interface where the user can listen to the audio and label selected data samples. (C) The region presents the spectrogram of the whole set of audios in the form of a timeline; it is named the "Time-Line-Spectrogram" (TLS) and offers an overview of the audios under analysis. In addition, TLS is coordinated with the projections data allowing exploration. (D) In this region, the visualizations of the active learning process can be observed, by displaying the projection with colored clustering, the selection of samples to be labeled by the user, and finally the prediction of the labels. An example of the alternative visualizations in this region is seen in Figure 5.