Enhancing Soundscape Characterization and Pattern Analysis Using Low-Dimensional Deep Embeddings on a Large-Scale Dataset

Nieto Mora, Daniel Alexis; Duque-Muñoz, Leonardo; Martínez Vargas, Juan David

doi:10.3390/make7040109

Open AccessArticle

Enhancing Soundscape Characterization and Pattern Analysis Using Low-Dimensional Deep Embeddings on a Large-Scale Dataset

by

Daniel Alexis Nieto Mora

^1,*

,

Leonardo Duque-Muñoz

¹

and

Juan David Martínez Vargas

²

¹

Laboratorio de Máquinas Inteligentes y Reconocimiento de Patrones MIRP, Instituto Tecnológico Metropolitano—ITM, Medellín 050034, Colombia

²

School of Applied Sciences and Engineering, EAFIT University, Medellín 050022, Colombia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(4), 109; https://doi.org/10.3390/make7040109

Submission received: 22 July 2025 / Revised: 18 September 2025 / Accepted: 19 September 2025 / Published: 24 September 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Soundscape monitoring has become an increasingly important tool for studying ecological processes and supporting habitat conservation. While many recent advances focus on identifying species through supervised learning, there is growing interest in understanding the soundscape as a whole while considering patterns that extend beyond individual vocalizations. This broader view requires unsupervised approaches capable of capturing meaningful structures related to temporal dynamics, frequency content, spatial distribution, and ecological variability. In this study, we present a fully unsupervised framework for analyzing large-scale soundscape data using deep learning. We applied a convolutional autoencoder (Soundscape-Net) to extract acoustic representations from over 60,000 recordings collected across a grid-based sampling design in the Rey Zamuro Reserve in Colombia. These features were initially compared with other audio characterization methods, showing superior performance in multiclass classification, with accuracies of 0.85 for habitat cover identification and 0.89 for time-of-day classification across 13 days. For the unsupervised study, optimized dimensionality reduction methods (Uniform Manifold Approximation and Projection and Pairwise Controlled Manifold Approximation and Projection) were applied to project the learned features, achieving trustworthiness scores above 0.96. Subsequently, clustering was performed using KMeans and Density-Based Spatial Clustering of Applications with Noise (DBSCAN), with evaluations based on metrics such as the silhouette, where scores above 0.45 were obtained, thus supporting the robustness of the discovered latent acoustic structures. To interpret and validate the resulting clusters, we combined multiple strategies: spatial mapping through interpolation, analysis of acoustic index variance to understand the cluster structure, and graph-based connectivity analysis to identify ecological relationships between the recording sites. Our results demonstrate that this approach can uncover both local and broad-scale patterns in the soundscape, providing a flexible and interpretable pathway for unsupervised ecological monitoring.

Keywords:

autoencoders; deep learning; ecoacoustics; embeddings; feature projections; soundscape patterns; unsupervised learning

1. Introduction

In recent years, the study of soundscapes has emerged as a powerful tool for ecological monitoring and environmental assessment. A soundscape is defined as the collection of biophonic, geophonic, and anthropophonic sounds that characterize a given environment [1]. Through passive acoustic monitoring, researchers can gather continuous, noninvasive, and cost-effective information about ecosystems, including biological activity, species richness, and anthropogenic disturbance [2]. Unlike traditional biodiversity surveys, which are often limited by spatial or temporal constraints, acoustic methods enable long-term sampling across large areas and can reveal patterns that would otherwise remain undetected. These advantages have contributed to the growing adoption of soundscape analysis in conservation programs, landscape-scale monitoring efforts, and biodiversity assessments [3]. As acoustic technologies and computational tools continue to improve, soundscapes offer increasing potential for understanding the dynamics and health of ecosystems at the spatial and temporal scales. At the same time, ecoacoustic research is experiencing rapid growth in both the scale of data collection and the demand for deeper insights into the ecological interactions embedded within acoustic communities [3,4]. These trends highlight the pressing need for analytical frameworks capable of handling massive datasets while also capturing the complexity of acoustic niches across frequency bands, ultimately enabling more comprehensive and interpretable assessments of ecosystem structure and function [5].

Recent developments in machine learning have significantly improved the ability to analyze ecoacoustic data [2,6], particularly in tasks involving species detection and classification. Supervised learning techniques, especially those based on deep neural networks, have enabled the automatic identification of animal vocalizations from large volumes of acoustic recordings [6,7]. Notably, tools such as BirdNET [8], which use convolutional neural networks trained on expert-labeled datasets, have achieved high accuracy in the identification of numerous species of birds in different environments. These models have facilitated large-scale biodiversity monitoring and made it possible to study species-specific patterns with high temporal resolution [9,10]. Despite these advances, many existing approaches focus primarily on taxonomic classification, often ignoring the broader acoustic structure of the landscape and the contextual information embedded in non-biological or unclassified sounds. This narrow focus limits the ecological interpretation of soundscapes and restricts the ability to assess ecosystem-level properties. However, the acoustic environment encodes more than just the presence of species or vocal activity. Soundscapes reflect the structure and function of ecosystems as a whole, including spatial patterns, temporal dynamics, and environmental stressors [11,12,13]. Attributes such as habitat connectivity, land use heterogeneity, and ecosystem degradation can be inferred from the composition of and variability in acoustic signals over time and space. These broader patterns are essential for understanding ecological processes, especially in landscapes undergoing rapid change. Yet, studies that treat the soundscape as a complex and integrated ecological signal remain relatively scarce. Most existing research has prioritized species-level outcomes, leaving a gap in our understanding of how acoustic patterns relate to landscape structure and ecosystem health. In this study, we address this gap by comparing multiple methodological pipelines that combine dimensionality reduction and unsupervised clustering for large-scale soundscape characterization. Our goals are to explore how these approaches reveal spatial organization in the acoustic environment and provide tools for interpreting the composition and distribution of clusters from an ecological perspective.

Our work is based on and motivated by recent efforts to explore soundscape patterns through unsupervised analysis. For example, the authors of [14] used acoustic indices to perform spatial exploration of soundscapes within the same study area examined here. Their work emphasized the value of unsupervised learning and evaluated clustering outputs through comparisons with species detections, spectrograms, and the spatial distribution of acoustic indices, highlighting how a soundscape-level structure can emerge without relying on taxonomic annotation. Similarly, the authors of [15] proposed an unsupervised framework that leverages passive acoustic monitoring data and network inference to examine acoustic heterogeneity across landscapes. By characterizing biophonic patterns through the use of sonotypes, they constructed site-level profiles and applied graphical models to infer ecological interactions. Their graph-based approach allowed them to represent similarities among sites and capture acoustic diversity in heterogeneous environments.

On the other hand, although autoencoders have been widely adopted in other fields such as bioinformatics [16], cybersecurity [17], anomaly detection [18], and even remote sensing and landscape monitoring applications [19,20], they remain relatively underused in ecoacoustics. Notable exceptions include the work of [21], who proposed a vector-quantized autoencoder for generating synthetic audio of underrepresented species, and [22], one of the earliest studies to explore autoencoders as an alternative to acoustic indices for clustering short audio recordings. Additionally, in our previous work [23], we evaluated unsupervised learning using a variational autoencoder in comparison with cepstral coefficients and a convolutional architecture known as KiwiNet, highlighting the potential of deep unsupervised representations for soundscape analysis. However, our current work distinguishes itself from the studies mentioned above and from our previous research in several key aspects. (1) We propose a fully unsupervised methodology for the characterization of soundscapes, with an emphasis on spatial representation and acoustic similarity. Within this framework, we introduce a connectivity-based approach that links geographically distant sites according to their acoustic resemblance, providing a novel perspective for exploring ecological structure and interactions through soundscapes. (2) We conduct a systematic evaluation of the relationship between the discovered patterns and ecological attributes derived from metadata, particularly highlighting the spatial structure as a central dimension of interpretation. (3) Finally, as a complementary contribution, we extend the methodological state of the art by incorporating projection techniques such as PaCMAP (understudied in ecoacoustic applictions) and by developing a systematic process for parameter selection in both the dimensionality reduction and clustering stages, supported by quantitative evaluation. In the process, we emphasize that parameter choices directly shape the clustering outcomes obtained in subsequent stages, underscoring their critical role in ensuring robust and ecologically meaningful groupings.

2. Materials and Methods

2.1. Dataset Description

The dataset used in this study was obtained from passive acoustic recordings collected within the Rey Zamuro Reserve and Matarredonda Private Nature Reserve, located in the village of La Novilla, San Martín municipality, Meta Department, Colombia (at approximately

3^{\circ} 34^{'} 40^{″}

N,

73^{\circ} 26^{'} 49^{″}

W). Established in 1993, the reserve spans an area of 6000 hectares, predominantly composed of natural savannas and introduced pastures (around 60%), while the remaining 40% consists of forest cover [14]. The area is part of the tropical humid foothill biome of the Meta region, with elevations ranging between 260 and 300 m above sea level. It lies near the confluence of three hydrographic basins: Caños Cumaral, Chunaipo, and Camoa [24].

The reserve harbors a variety of ecosystems. Forested areas include gallery or riparian forests that line streams and rivers, functioning as critical biological corridors and refuges for fauna within the savanna matrix. In particular, approximately 1200 hectares of dense, well-conserved forest are found in the Matarredonda sector. Seasonally flooded forests of the várzea or igapó type are also present and are ecologically important, especially for primate species that frequently utilize canopy strata between 12 and 18 m in height [25].

The savanna complex consists of multiple formations, including ecologically significant morichales (wetlands dominated by the palm Mauritia flexuosa), which are known to support a wide variety of fauna. Acoustic monitoring was also conducted in other non-forest habitats such as dense shrublands, grasslands, and pasturelands. The ecological interface between open savanna habitats and forest patches, particularly the gallery forests that cut across the landscape, is regarded as a key component in sustaining regional biodiversity and ecological functionality.

The climate in this region is classified as humid tropical, with a mean annual temperature of 25.6 °C and average annual precipitation of approximately 2513 mm. Sunrise occurs around 6:06 a.m., and sunset occurs around 6:05 p.m., resulting in roughly 12 h of daylight year-round.

The acoustic recordings were obtained through a grid-based sampling design, deploying 94 AudioMoth devices (versions 1.0.0 to 1.2.0) spaced 200 m apart. Devices were mounted at a standardized height of 1.5 m above ground, enclosed in Ziploc bags for protection, and powered by AA alkaline batteries, using 32 GB Sandisk Extreme memory cards for data storage. Recordings were captured in mono at a sampling rate of 192,000 Hz, covering various habitats including forest interiors, edges, and open areas [14].

Figure 1 shows the geography of the study site and the points where the acoustic recorder units were located.

2.2. Methods

To characterize the soundscape of the Zamuro and Matarredonda dataset, we employed three sets of features derived from distinct methodological approaches: acoustic indices, embeddings extracted from the VGGish neural network, and a convolutional autoencoder architecture previously proposed in our earlier work [26]. This methodology builds upon our previous study regarding the characterization and clustering of large-scale ecoacoustic datasets. The choice of these methods was in response to their complementary strengths; acoustic indices are widely adopted in ecoacoustics and allow direct ecological interpretation of soundscape dynamics [27,28]; VGGish provides a state-of-the-art reference in general audio representation learning, enabling benchmarking against a model recognized in the machine learning literature [5,29]; and the convolutional autoencoder offers a tailored approach optimized for ecoacoustic data, capable of capturing complex and latent structures beyond the reach of traditional indices or pretrained models [22]. The primary contribution of this study lies in the enhanced analysis and interpretation of results, particularly in revealing spatial and compositional patterns of the acoustic landscape.

Although more advanced machine learning approaches such as transformers or vision transformers have emerged in recent years, these were not considered in this study for several reasons. First, their computational cost grows substantially with the dataset size, making them impractical for large-scale ecoacoustic datasets such as ours. Second, their reduced interpretability poses a limitation in ecological contexts, where understanding and explaining the detected patterns is as important as achieving high predictive performance. Finally, the ecoacoustic research community has emphasized the importance of maintaining a balance between the complexity of the ecological questions and the sophistication of the analytical tools to ensure that the methods used remain proportionate and practically useful [29,30,31]. Based on these considerations, we selected methods that offer both efficiency and interpretability while remaining aligned with the needs and standards of ecoacoustic research.

All analyses and feature extraction methods were implemented using Python 3.10 with the following key libraries: scikit-maad v1.3 [32], TensorFlow v2.8 for VGGish embeddings, and PyTorch v1.13 for autoencoder implementation.

2.2.1. Acoustic Indices

Acoustic indices are computational descriptors extracted from audio signals to summarize the ecological, biological, and anthropogenic patterns within soundscapes. In this study, we computed a total of 60 acoustic indices using the scikit-maad toolbox [32], which provides a comprehensive suite of features derived from different analysis domains.

The indices were calculated using a sliding window approach across each audio file. For each window, all indices were computed and then averaged across time, resulting in a single representative value per index for each recording. This approach ensured robustness and comparability across the dataset.

The indices spanned three main categories: temporal, spectral, and time-frequency. Temporal indices are derived directly from the audio waveform and describe amplitude-based dynamics over time, such as envelope variation, energy, and entropy. Spectral indices focus on the distribution of signal energy across frequency bands and are calculated from the signal’s frequency representation, capturing properties such as the spectral entropy, centroid, and bandwidth. Time-frequency indices combine both temporal and spectral information and are computed via the fast Fourier transform (FFT), which enables the construction of spectrograms and the analysis of complex acoustic structures, such as modulations and transients.

These indices are particularly useful for large-scale ecoacoustic monitoring because they offer a compact and interpretable way to quantify soundscape dynamics without the need for manual annotation. For instance, the Acoustic Complexity Index (ACI) is often used to estimate the level of biological activity in an environment by detecting variations in intensity over short time scales. The Normalized Difference Soundscape Index (NDSI) distinguishes between biotic and anthropogenic sound components, while the Acoustic Diversity Index (ADI) reflects frequency band occupancy, potentially serving as a proxy for species richness. By combining multiple indices, it is possible to generate multidimensional acoustic signatures that can reveal spatial and temporal patterns related to biodiversity, habitat quality, and ecological change.

2.2.2. VGGish Embeddings

The second method relies on the use of VGGish, a convolutional neural network pretrained on the large-scale AudioSet dataset [33]. VGGish operates on log-mel spectrogram representations and extracts 128-dimensional feature embeddings that are known to capture perceptually relevant information from environmental audio. Each audio segment was transformed into a log-mel spectrogram and fed through the VGGish model to extract compact and transferable feature representations for downstream analysis. VGGish processes audio in 0.96-s segments with 50% overlap, generating log-mel spectrograms with 64 frequency bins covering 125–7500 Hz. The model was used without any fine-tuning, as it was considered a baseline method, and our implementation followed practices reported in other ecoacoustic studies that employed the same dataset [4,34]. Moreover, AudioSet is a strongly labeled dataset, and performing fine-tuning would require labels. While in our case, metadata from the supervised classification scheme could potentially serve this role, such fine-tuning would not necessarily translate into improved performance for the unsupervised tasks, which are the core focus of this study. Since the unsupervised framework is designed to generate insights into the spatial patterns of acoustic landscapes beyond predefined labels, we prioritized maintaining the generalization and comparability of VGGish embeddings by using the pretrained model as is.

In this sense, VGGish, being trained on a large and diverse corpus and validated across multiple tasks, provides a robust and widely recognized baseline against which we can evaluate the effectiveness of our proposed characterization method.

2.2.3. Autoencoder Feature Extraction

Autoencoders are a class of deep neural networks designed for unsupervised feature learning by compressing input data into a lower-dimensional latent space and then reconstructing it. In our analysis, we reused a previously proposed architecture tailored for the characterization of soundscapes.

Let

x \in R^{D}

be an input vector representing a spectrogram segment. The encoder function

ϕ_{θ} : R^{D} \to R^{d}

maps

x

into a latent vector

z

, where

d < D

. Conversely, the decoder function

ψ_{θ^{'}} : R^{d} \to R^{D}

reconstructs the input, producing

\hat{x} = ψ_{θ^{'}} (ϕ_{θ} (x))

.

The objective of the model is to minimize the reconstruction error between the input

x

and its approximation

\hat{x}

, typically through the mean squared error (MSE) loss function. The training process involves optimizing the encoder and decoder parameters

(θ, θ^{'})

such that

θ^{*}, θ^{' *} = arg min_{θ, θ^{'}} \frac{1}{n} \sum_{i = 1}^{n} {∥x^{(i)} - ψ_{θ^{'}} (ϕ_{θ} (x^{(i)}))∥}_{2}^{2}

(1)

This formulation ensures that the latent representation

z

captures the most informative patterns from the input data in a compact form. Once trained, the encoder is used to extract feature embeddings from the entire dataset, enabling subsequent dimensionality reduction and clustering analysis.

In our case, we used the same convolutional autoencoder architecture proposed in our previous work [26]. The network comprises a symmetric structure, with four convolutional layers in the encoder and four deconvolutional layers in the decoder, each followed by ReLU activation functions, except for the final layer, which uses a sigmoid function. The latent space has a dimensionality of 5.184, corresponding to

64 \times 9 \times 9

, derived from the number of output channels and the residual spatial dimensions after the encoding path.

2.2.4. Feature Projection and Dimensionality Reduction

To explore and visualize patterns in the high-dimensional feature spaces, we employed two widely used dimensionality reduction techniques from the state of the art: Uniform Manifold Approximation and Projection (UMAP) and Pairwise Controlled Manifold Approximation and Projection (PaCMAP). Both methods are nonlinear manifold learning techniques that aim to preserve relevant structural relationships from the original feature space in a lower-dimensional embedding, typically

R^{2}

or

R^{3}

.

UMAP is based on Riemannian geometry and fuzzy topological representations [35]. It constructs a high-dimensional weighted graph where each edge represents the probability that two points are connected and then optimizes a low-dimensional embedding by minimizing the cross-entropy between the high- and low-dimensional fuzzy simplicial sets. Formally, the optimization minimizes the following loss:

L_{UMAP} = \sum_{(i, j)} w_{i j}^{(h)} log \frac{w_{i j}^{(h)}}{w_{i j}^{(l)}} + (1 - w_{i j}^{(h)}) log \frac{1 - w_{i j}^{(h)}}{1 - w_{i j}^{(l)}},

(2)

where

w_{i j}^{(h)}

and

w_{i j}^{(l)}

represent the edge weights in the high- and low-dimensional graphs, respectively.

PaCMAP [36] is a more recent technique that has shown improved performance in preserving both global and local structures, particularly in dense datasets. It introduces a more balanced approach by defining three types of pairwise relationships: near pairs, mid-near pairs, and further pairs. The method minimizes a loss function by combining these distances with dynamically adjusted weights:

L_{PaCMAP} = \sum_{near} \frac{d_{i j}^{2}}{d_{i j}^{2} + a} + w_{mid} \sum_{mid} \frac{d_{i j}^{2}}{b + d_{i j}^{2}} + w_{far} \sum_{far} \frac{1}{c + d_{i j}^{2}},

(3)

where

d_{i j}

is the Euclidean distance between points i and j in the low-dimensional space and a, b, and c are fixed constants that shape the contribution of each term. The weights

w_{mid}

and

w_{far}

are updated over iterations to emphasize the local or global structure during different phases of optimization.

PaCMAP has been shown to be especially effective for ecoacoustic data, yielding compact and well-separated groupings even in highly dense datasets, thereby facilitating the identification of latent structure in soundscape representations.

2.3. Evaluation of Embedding Projections

To quantitatively evaluate the quality of the low-dimensional representations obtained through UMAP and PaCMAP, we used the trustworthiness metric [37]. This metric assesses how well the local structure of the original high-dimensional space is preserved in the lower-dimensional embedding. Unlike clustering or classification metrics, trustworthiness does not require ground-truth labels, making it especially appropriate for ecoacoustic datasets, where annotations are often unavailable [5].

Mathematically, given a dataset with n points, let

X = {x_{1}, x_{2}, . . ., x_{n}}

denote the original high-dimensional data and

Y = {y_{1}, y_{2}, . . ., y_{n}}

denote its low-dimensional embedding. For each point

x_{i}

, we define the set of its k nearest neighbors in the original space as

N_{i}^{X}

and, similarly,

N_{i}^{Y}

for the embedding space.

The trustworthiness

T (k)

is defined as follows:

T (k) = 1 - \frac{2}{n k (2 n - 3 k - 1)} \sum_{i = 1}^{n} \sum_{j \in U_{i}^{k}} (r_{i, j} - k)

(4)

where the following definitions apply:

$U_{i}^{k} = {j : j \in N_{i}^{Y} and j \notin N_{i}^{X}}$ is the set of points that are among the k nearest neighbors of $y_{i}$ in the embedding but not among the k nearest neighbors of $x_{i}$ in the original space.
$r_{i, j}$ is the rank of point $x_{j}$ in the ordered list of distances from $x_{i}$ in the original space.

Intuitively, trustworthiness penalizes points that are neighbors in the embedding but not in the original space, weighting the penalty according to how far these points actually are in the original space. A value of

T (k) = 1

indicates perfect preservation of the neighborhood structure up to k, while lower values indicate distortions.

2.3.1. Clustering Methods

To analyze the structure of the low-dimensional embeddings generated via UMAP and PaCMAP, we employed two clustering techniques: K-Means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). K-means is a partitioning algorithm that divides the dataset into k clusters by minimizing the intra-cluster variance. The optimization criterion for k-means is defined as follows:

L_{KMeans} = \sum_{i = 1}^{k} \sum_{x \in C_{i}} {∥ x - μ_{i} ∥}^{2},

(5)

where

C_{i}

denotes the ith cluster,

x

is a data point assigned to that cluster, and

μ_{i}

is the centroid of

C_{i}

. While k-means is efficient, it assumes isotropic clusters and requires prior knowledge of the number of clusters k, which may not align with the complexity of large ecoacoustic datasets.

In contrast, DBSCAN is a density-based clustering algorithm that identifies clusters as regions of high point density. It defines clusters based on two main parameters: the neighborhood radius

ε > 0

and the minimum number of points

M i n P t s

required to form a dense region. Given a dataset D, the

ε

-neighborhood of a point

p \in D

is defined by

N_{ε} (p) = {q \in D ∣ ∥ p - q ∥ \leq ε},

(6)

where

∥ p - q ∥

is typically the Euclidean distance between p and q. If the cardinality of this neighborhood satisfies

| N_{ε} (p) | \geq M i n P t s

, then p is considered a core point. A cluster is formed by connecting all core points that are density-reachable either directly or indirectly through chains of neighboring core points. Points not reachable from any core point are labeled noise or outliers.

Given the size and nature of our dataset, consisting of approximately 53,000 projected feature vectors, DBSCAN offers notable advantages over l-means. Its ability to discover arbitrarily shaped clusters and to automatically ignore outliers makes it particularly effective for high-density heterogeneous data. These conditions are often met in ecoacoustic datasets, especially when using dimensionality reduction techniques like PaCMAP, which tend to create compact and dense groupings in the embedded space. In this context, DBSCAN can robustly identify natural groupings without requiring the specification of the number of clusters beforehand.

Although HDBSCAN, a hierarchical extension of DBSCAN, was considered in the early stages of the study, it was ultimately excluded due to its high computational cost. Moreover, HDBSCAN did not produce significantly different clustering results from DBSCAN when qualitatively evaluated through visual inspection of the projections. As such, DBSCAN was chosen as the most suitable density-based clustering method for this analysis.

For the dimensionality reduction and clustering stages, we performed an exhaustive grid search over the parameter space of each method. For the projection techniques (UMAP and PaCMAP), we varied parameters such as the number of neighbors and ratios to far and mid-near pairs. For the clustering algorithms (k-means and DBSCAN), we explored a range of values for k (number of clusters in k-means),

ε

(neighborhood radius), and

M i n P t s

(minimum points for DBSCAN). This optimization process was guided by metadata available in the dataset, including the time of recording, time-of-day categories (e.g., morning, afternoon, or night), and geographic location of each recording unit. These metadata served as surrogate labels to qualitatively assess the coherence of clusters and separability in the projected spaces.

To further refine the selection of optimal parameters for DBSCAN, we implemented the use of reachability plot, a diagnostic tool derived from the ordering of points based on density-connectivity. This plot helps to visualize the density structure of the dataset and identify potential cluster boundaries.

Mathematically, the reachability distance between a point p and a core point o is defined as follows:

{ReachDist}_{ε} (p, o) = max ({CoreDist}_{ε} (o), ∥ p - o ∥),

(7)

where

{CoreDist}_{ε} (o)

is the distance from o to its

M i n P t s

th nearest neighbor. For all points in the dataset, the reachability distances are computed relative to the order in which the DBSCAN algorithm visits them.

The reachability plot then displays these distances along the traversal order. The valleys in the plot correspond to dense regions (i.e., potential clusters), while the peaks indicate sparser areas or boundaries between clusters. By inspecting this plot, we identified appropriate values for

ε

and

M i n P t s

that revealed consistent and interpretable cluster structures, especially in combination with the dense groupings produced by the PaCMAP projection. This approach provided an intuitive and data-driven way to optimize clustering parameters in complex high-density ecoacoustic datasets.

2.3.2. Density Peak-Based Validation of Clusters (DPVC)

To support evaluation of the DBSCAN clustering results in the low-dimensional PaCMAP space, we developed a custom validation approach inspired by density peak clustering principles, which we term Density Peak-Based Validation of Clusters (DPVC). This metric quantifies the compactness of each detected cluster by measuring the average distance of its members to the most locally dense point within the cluster, referred to as the density peak.

Let

X = {x_{1}, x_{2}, \dots, x_{n}}

denote the set of embedded data points, and let

C = {C_{1}, C_{2}, \dots, C_{k}}

represent the k clusters obtained by DBSCAN, excluding noise. For each point

x_{i} \in X

, we estimate its local density

ρ_{i}

as the average Euclidean distance to its k nearest neighbors. Formally, we have

ρ_{i} = \frac{1}{k} \sum_{j = 1}^{k} d (x_{i}, x_{i_{j}}),

(8)

where

x_{i_{j}}

is the jth nearest neighbor of

x_{i}

and

d (\cdot, \cdot)

denotes the Euclidean distance.

For each cluster

C_{j}

, we identify its density peak

p_{j}

as the point with the smallest local density:

p_{j} = arg min_{x_{i} \in C_{j}} ρ_{i} .

(9)

Then, we compute the mean distance of all points in

C_{j}

to the density peak

p_{j}

:

{DPVC}_{j} = \frac{1}{| C_{j} |} \sum_{x_{i} \in C_{j}} d (x_{i}, p_{j}) .

(10)

Finally, the overall DPVC score is defined as the average of the per-cluster scores:

DPVC = \frac{1}{k} \sum_{j = 1}^{k} {DPVC}_{j} .

(11)

This score captures the internal compactness of clusters relative to their densest region, making it well-suited for validating density-based clustering outcomes in nonlinear embedding spaces. Lower DPVC values indicate tighter and more coherent clusters.

2.3.3. Connectivity and Graph Construction

To explore the relationships between acoustic recordings and their spatial origins, we constructed two types of graphs: one based on the proximity of audio embeddings and another representing connections between the recording devices.

First, a k nearest neighbors graph was created using the low-dimensional PaCMAP projection of the acoustic features. Given a set of n acoustic samples

{x_{i}}_{i = 1}^{n}

embedded in

R^{d}

, a graph

G_{audio} = (V, E)

was constructed such that each node

v_{i} \in V

corresponded to a sample

x_{i}

, and an undirected edge

(v_{i}, v_{j}) \in E

existed if

x_{j}

was among the

k = 1

nearest neighbors of

x_{i}

in the Euclidean space.

Each acoustic sample was associated with a recorder identified by a label

l_{i} \in L

, where

L

is the set of unique recorders. Using the sample-level graph

G_{audio}

, we defined a recorder-level graph

G_{rec} = (L, E^{'})

, where each node corresponds to a recorder, and an edge

(l_{i}, l_{j}) \in E^{'}

was added if there existed at least one edge in

G_{audio}

connecting samples from recorders

l_{i}

and

l_{j}

. The weight

w_{i j}

of each edge was the count of such cross-recorder edges:

w_{i j} = |{(x_{p}, x_{q}) \in E | l_{p} = l_{i}, l_{q} = l_{j}, l_{i} \neq l_{j}}| .

(12)

To normalize edge strengths, a softmax transformation was applied per node. The softmax normalization was only applied to nodes with at least one neighbor. For a node

l_{i}

with neighbors

N (l_{i})

, the weights

{w_{i j}}

were transformed as follows:

{\tilde{w}}_{i j} = \frac{e^{w_{i j}}}{\sum_{l_{k} \in N (l_{i})} e^{w_{i k}}} .

(13)

Finally, all edges with

{\tilde{w}}_{i j} < 0.75

were removed to retain only the strongest normalized connections. This resulted in a sparsified graph representing the dominant acoustic similarities between locations.

3. Results and Discussion

For the experiments, we processed the dataset using only recordings without rainfall. Noisy data and recordings with significant rain content were removed using the methodology described in [38]. This preprocessing step ensured that subsequent analyses focused only on biologically and ecologically informative acoustic content, avoiding finding patterns and clusters biased by noisy data. After removing rainfall data, as part of the preprocessing pipeline, we implemented a custom data loader that resampled the original recordings from 192,000 to 22,050 Hz. This sampling rate was chosen because it encompassed the range of human-audible frequencies and retained most of the ecologically relevant acoustic information present in typical soundscapes. Each recording was then segmented into five non-overlapping 12-s clips. For each segment, a spectrogram was computed following the procedure illustrated in Figure 2.

We performed feature extraction using the baseline methods—VGGish embeddings and acoustic indices—according to the methodology described in Section 2.2. These two approaches served as standardized representations for capturing the spectral and temporal properties of the soundscape.

For autoencoder feature extraction, we trained the vanilla convolutional autoencoder proposed in our previous work using 20% of the dataset over 10 epochs. Unlike our previous study, where 98% of the data (about 17,000 samples) was required for training, in this study, we were able to substantially reduce the proportion of training data while still maintaining accurate reconstructions. This highlights the generalization capability of the network, as it was applied to two large ecoacoustic datasets and achieved comparable reconstruction performance even when trained with a reduced portion of the original data. Such consistency across datasets suggests that the proposed feature extraction approach can be reliably applied in large-scale ecoacoustic contexts without requiring exhaustive training resources. The architecture, illustrated in Figure 2, consists of an encoder with four convolutional layers interleaved with rectified linear unit (ReLU) activations, followed symmetrically by a decoder comprising four transposed convolutional (deconvolutional) layers, which also had ReLU activations, except for the final layer, which used a sigmoid activation to produce the reconstructed output.

To evaluate the performance and generalizability of the model, we monitored the mean squared error (MSE) on a tested subset held out during training and visually inspected the reconstructed spectrograms. The embedding space was obtained by flattening the output of the final convolutional layer in the encoder, producing a representation of 5184 dimensions (

64 \times 9 \times 9

), where 64 corresponds to the number of filters and

9 \times 9

corresponds to the spatial resolution after the encoding stages. This low-dimensional representation served as the input for subsequent clustering and projection analyses.

The analysis of the experimentation and results was structured into the following components: feature projection, clustering, acoustic component identification through indices, spatial pattern analysis, and finally evaluation of data connectivity. While the primary focus of this work is unsupervised exploration, the reliability of such analysis fundamentally depends on the ability of the model to extract meaningful and interpretable features. To assess the representational quality of the extracted features, we conducted a supervised learning evaluation using habitat cover-type labels as a reference standard, allowing a comparative assessment of the described feature extraction methods.

3.1. Multiclass Classification Using Cover Type and Time Metadata as Labels

For the Rey Zamuro dataset, three habitat cover classes were defined: forest (19.4%), pasture (57.7%), and savanna (22.7%). This classification task presents two main challenges—the multiclass nature of the problem and the class imbalance—particularly due to the overrepresentation of pasture samples. Although the forest and savanna classes were relatively balanced with respect to each other, the dominance of the pasture class introduced bias in the learning process.

To assess the discriminative power of the feature representations extracted from each method, we conducted a supervised classification task using a random forest (RF) classifier. The dataset was partitioned into two non-overlapping subsets; 80% of the total samples were allocated for training, and the remaining 20% were reserved exclusively for testing. This partitioning was performed using stratified sampling to preserve the class distribution across both sets. The training set was used to fit the classifier and assess performance during model development, while the test set was held out entirely during training and only used for final evaluation to measure the generalization capability. This approach builds on our previous work [26], where a binary forest vs. non-forest classification was carried out in an exploratory way to verify that our characterization method captured meaningful acoustic information. In this study, we made the evaluation more robust by extending it over multiple days, performing multiclass classification for several habitat covers, adding a multiclass time-of-day classification, and including statistical tests to demonstrate that the observed differences were statistically significant, an aspect not addressed in our earlier work.

The random forest classifier was configured with a fixed maximum tree depth of 16 and a random seed of 0 to guarantee reproducibility and consistency across experiments. Classification performance was then quantified using standard evaluation metrics including the accuracy, macro-averaged F1 score, and recall, allowing us to systematically compare how well each representation captured ecologically meaningful distinctions between landscape types.

Initially, the classification was performed on the entire dataset of 53,275 samples. Although this approach is not computationally demanding, it lacks statistical robustness to assess generalization. To address this, we implemented an alternative evaluation strategy by partitioning the dataset according to the day on which each sample was recorded. This resulted in 13 independent subsets, enabling day-wise classification and allowing for the assessment of metric variability across temporal segments. However, for each day-dataset, we conserved 80% of the data for training and 20% for evaluation. The results, summarized in the box plot presented in Figure 3, show that the features extracted using the autoencoder consistently achieved the highest scores for all evaluation metrics, thereby demonstrating superior representational capacity and robustness.

To quantitatively evaluate the differences in classification performance among the feature extraction approaches, we conducted non-parametric statistical tests on three evaluation metrics: the accuracy, F1 score, and recall. The Friedman test was used to assess whether there were significant differences across the three methods: autoencoders (AEs), VGGish (VGG), and acoustic indices (AIs). When significant differences were detected (

p < 0.05

), pairwise comparisons were further examined using the Wilcoxon signed-rank test. Table 1 summarizes the results of these tests. The Friedman test revealed statistically significant differences across the methods for all three metrics (

p < 0.001

). Post hoc Wilcoxon comparisons indicated that the autoencoder-based features significantly outperformed both VGGish and the acoustic indices in most cases, particularly in comparisons involving AEs vs. VGG and AEs vs. AIs. These results support the conclusion that the features extracted via autoencoders encoded more discriminative information relevant to the classification of habitat cover types.

Similarly, we investigated the use of temporal metadata as classification labels, given that several studies have demonstrated significant variations in soundscape composition across different times of day. Temporal dynamics in acoustic environments are crucial for understanding species behavior, activity patterns, and ecosystem processes [39,40]. For instance, diurnal and nocturnal shifts in vocal activity influence the acoustic community structure, which can be effectively captured and analyzed through time-resolved soundscape data [41]. Incorporating time-of-day information enables a more detailed characterization of ecological patterns and enhances the interpretability of unsupervised clustering and feature extraction methods.

Figure 4 presents box plots summarizing the classification performance across 13 sampling days, segmented into three distinct time-of-day intervals: dawn (5:00 a.m.–8:00 a.m.), day (8:00 a.m.–5:00 p.m.), and night (5:00 p.m.–5:00 a.m.). This temporal segmentation aligns with recent studies in Colombia [1,42] that considered the equatorial location, which resulted in minimal seasonal variation in sunrise and sunset times. This division captures relevant diel patterns in acoustic activity and aligns with ecological processes and animal behavior commonly observed in neotropical soundscapes [41].

A similar statistical evaluation was conducted for the classification performance across the three temporal segments: dawn, day, and night. The results, summarized in Table 2, show a consistent pattern with the habitat cover classification. The Friedman test again revealed statistically significant differences among the feature extraction methods for all metrics (

p = 0.00004

). Subsequent Wilcoxon signed-rank tests confirmed that the autoencoder features significantly outperformed both VGGish and acoustic indices across the accuracy, F1 score, and recall metrics. Of particular note is the AE vs. AI comparison, which yielded a Wilcoxon statistic of 1.000 and a p value of 0.0020, indicating a nearly systematic advantage of the autoencoder.

3.2. Low-Dimensional Feature Embedding and Clustering

In this section, we detail the entire unsupervised procedure. As illustrated in Figure 5, the process begins with data characterization using acoustic indices and embeddings, which are then projected into a low-dimensional space using state-of-the-art methods that have demonstrated strong performance across diverse data types. Clustering is subsequently performed to uncover patterns across multiple dimensions and ecological aspects of the landscape. Finally, we analyze the results through multiple strategies: (1) by examining the spatial structure of each cluster via interpolation of features at each sampling point; (2) by interpreting the most relevant acoustic patterns using acoustic indices; and (3) by proposing a method to estimate the connectivity between locations, using the connectivity of individual recordings as a proxy. Each component of the process is described in detail below.

3.2.1. Feature Projections and Method Optimization

To investigate the underlying structure of the acoustic landscape, we employed two nonlinear dimensionality reduction techniques: Uniform Manifold Approximation and Projection (UMAP) and Pairwise Controlled Manifold Approximation Projection (PaCMAP). These methods were carefully selected because of their ability to preserve both local and global structures when projecting high-dimensional feature spaces [36,43]. UMAP in particular has gained wide recognition in recent years for its robustness and versatility, establishing itself as one of the most reliable methods for feature projection, including ecoacoustic applications [44,45]. PaCMAP, on the other hand, is a more recent approach that has demonstrated the capacity to produce projections with well-separated and highly compact regions [36,46]. This characteristic is especially advantageous for clustering tasks, as it facilitates the identification of distinct groups and improves the interpretability of the ecoacoustic feature space. Altogether, these methods are particularly suitable for ecoacoustic applications, where temporal, spectral, and spatiotemporal patterns coexist in complex ways and require techniques capable of capturing such complexity without losing interpretability.

We did not perform analyses directly in the original feature space for several reasons. (1) Recent studies emphasize the value of low-dimensional visualizations for enhancing interpretability and facilitating expert-driven ecological insights [27,45]. (2) Processing in the original 5184-dimensional space significantly increases computational demands, reducing the feasibility of applying the method in practical or large-scale ecological contexts. Finally, (3) as demonstrated in our previous work [26], the difference in pattern detection performance between using the original space and its low-dimensional projection is marginal, further supporting the use of dimensionality reduction as a reliable and efficient alternative.

We placed particular emphasis on a thorough exploration of the parameter space for both dimensionality reduction and clustering, aiming to enhance the reliability and interpretability of the low-dimensional embeddings. Unlike previous studies in ecoacoustics and bioacoustics, which often apply default or minimally adjusted parameters in dimensionality reduction and clustering techniques (e.g., [5,47]), we performed a detailed grid search to systematically assess how hyperparameter choices affect the structure and separability of the resulting data representations. This is a critical yet frequently overlooked aspect, as recent work has shown that parameter sensitivity in methods like UMAP or PaCMAP can significantly influence the topology of the low-dimensional space and, consequently, the ecological interpretations drawn from these embeddings [29].

For quantitative evaluation, Figure 6a(I) and Figure 6a(II) present the quality assessment of low-dimensional embeddings generated by UMAP and PaCMAP, respectively. We employed the trustworthiness metric, which evaluates the consistency between high-dimensional neighborhoods and their representations in the reduced space without relying on class labels or prior clustering. This makes it particularly appropriate for ecoacoustic datasets, where the annotated ground truth is typically unavailable or limited. For UMAP (Figure 6a(I)), the configuration with the highest trustworthiness score used a neighborhood size of 75 and a minimum distance of 0.01. For PaCMAP (Figure 6a(II)), the optimal configuration used a neighborhood size of 75, a mid-scale neighbor ratio of 0.5, and a far neighbor ratio of 20.

3.2.2. Clustering Analysis

To analyze the latent structures in the low-dimensional embeddings, we applied two commonly used unsupervised clustering algorithms, k-means and DBSCAN, which offer complementary perspectives. K-means assumes spherical and evenly spaced clusters, optimizing intra-cluster compactness, whereas DBSCAN identifies clusters based on the local density, allowing it to detect arbitrarily shaped groupings and exclude noise. Based on the geometric characteristics of the embeddings, we paired UMAP with k-means and PaCMAP with DBSCAN. UMAP tends to produce globally coherent layouts that align with the centroid-based partitioning of k-means, facilitating the separation of data into compact and uniformly distributed clusters. In contrast, PaCMAP often results in high-density, tightly grouped regions with flexible spacing, which aligns well with DBSCAN’s density-based detection mechanism. This strategic pairing allowed us to better exploit the strengths of each clustering algorithm in accordance with the topological properties induced by the respective projection method, yielding more interpretable and ecologically meaningful groupings in the ecoacoustic dataset.

To evaluate the clustering results, we used three internal validation metrics for the k-means and UMAP combination: the silhouette coefficient, Davies–Bouldin Index, and Calinski–Harabasz index. Following an initial global evaluation to identify promising parameter configurations, we performed a per-day analysis to examine the consistency and reproducibility of the clustering structures across different temporal subsets. This approach ensured that the selected combinations of the dimensionality reduction and clustering methods yielded stable and interpretable groups throughout the entire dataset. However, after a deeper analysis of the results, we found that the silhouette coefficient and Davies–Bouldin Index did not consistently favor a specific cluster configuration throughout the study days when observing temporal trends of the metrics versus the number of clusters. To address this variability, we employed a box plot-based visualization (Figure 7) to summarize the distribution of scores for each metric on all days. This visualization revealed a pronounced trend in the Calinski–Harabasz index, which favored a larger number of clusters, while the silhouette coefficient and Davies–Bouldin Index exhibited less consistent behavior, although these metrics displayed a local optimum around nine clusters, indicated by a local maximum in the silhouette score and a local minimum in the Davies–Bouldin Index (where lower values reflect better-defined clusters). Notably, the Calinski–Harabasz Index also exhibited a sharp inflection after nine clusters, suggesting this value as a feasible choice for the number of clusters. This result was unexpected, as it suggests a higher degree of ecological and acoustic heterogeneity compared with our previous assumption. Although ecoacoustic features are known to reflect various spatiotemporal dynamics and frequency-dependent behaviors associated with species composition and environmental structure (e.g., [9,47]), the emergence of nine distinct clusters suggests that the learned acoustic representations are capturing latent ecological patterns that transcend superficial spatial, temporal, or spectral distinctions. This underscores the potential of unsupervised clustering on low-dimensional embeddings as a powerful tool for revealing a nuanced ecoacoustic structure within complex soundscapes.

For the evaluation of parameter configurations using DBSCAN with PaCMAP, we extended the analysis beyond the internal metrics previously described by incorporating two additional methods specifically designed for density-based clustering algorithms. The first was thereachability plot, a visual tool commonly used with the OPTICS algorithm. This plot represents the reachability distances of points in the order they are processed, allowing the identification of cluster structures as valleys or drops in the curve, while flat regions typically indicate noise or transitions between clusters. Its main advantage lies in offering a flexible exploratory view of the data’s density structure without relying on a fixed density threshold.

3.2.3. Evaluation of Optimal DBSCAN Parameter Settings

In addition, we developed a custom procedure inspired by the principles of density peak clustering, which we refer to as Density Peak-Based Validation of Cluster (DPVC). This method begins by estimating the local density of each point using the average distance to its k nearest neighbors. Then, for each cluster, the density peak is identified as the point with the highest local density (i.e., the smallest average distance to its neighbors). The DPVC score for a given cluster is computed as the mean distance of all points in the cluster to this density peak. The final DPVC value is obtained as the average of these scores across all non-noise clusters. This metric provides an indication of within-cluster compactness, and it proved particularly useful for validating DBSCAN results in nonlinear spaces like those produced by PaCMAP, where traditional centroid-based metrics may fail to accurately capture cluster organization.

As shown in Figure 8, the reachability plots for days 5, 6, and 7 show deep and well-separated valleys, indicating the presence of well-structured groups. This is an important condition for the application of density-based clustering methods such as DBSCAN, as such valleys reflect dense regions that are clearly separated by lower-density transitions. Additionally, for the selection of the

ε

parameter based on the reachability distance, it can be observed that the most prominent valleys consistently appeared above a distance of two, suggesting this as a feasible value. Consequently, we selected

ε = 2

for DBSCAN in our experiments. To support this choice, we also applied the Density Peak-Based Validation of Clusters (DPVC) metric, which confirmed that the selected configuration of

ε = 2

and

minPts = 300

produced compact and coherent clusters in the PaCMAP-embedded space. This validation is consistent with recent advances in adaptive density-based clustering, which highlight the importance of tuning parameters such as

ε

and

minPts

in datasets with heterogeneous density structures [48]. Furthermore, modern reviews of density peak clustering methods support the use of local density-based validation criteria such as DPVC to evaluate cluster cohesion [49].

The figure also shows the reachability plots for all sampling days. A consistent clustering structure is visible on most days, with clear anomalies for days 3 and 13. These deviations can be explained by incomplete sampling, as these days corresponded to the deployment or removal of recording devices in the field. As a result, fewer audio samples were available, leading to sparser representations in the PaCMAP space and fewer detectable clusters. This highlights the importance of the completeness of the data when applying this methodology to spatial or temporal subsets. In such cases, a reduced sampling rate can significantly alter the low-dimensional representation and, consequently, the clustering outcomes.

3.3. Soundscape Spatial Pattern Analysis

3.3.1. Methodological Description of the Spatial Pattern Analysis

From this point onward, we present the results analysis based on the previously described strategies, using their respective optimal parameter configurations. The analysis was structured into three main components. First, we conducted a spatial analysis based on the acoustic similarity revealed by the clustering results, allowing us to examine how soundscape patterns were distributed geographically. Second, in order to discern the specific acoustic patterns captured by each proposed method, we performed an index-based analysis. This step identified the dominant acoustic indices contributing to the clustering structure and examined their ecological relevance. Lastly, we introduced an approach to explore geographic connectivity by evaluating the acoustic similarity among locations, thus highlighting potential ecological links or discontinuities in the acoustic landscape. This analytical framework enabled a multifaceted examination of the soundscape structure and facilitated interpretation of the proposed methodology in relation to ecologically meaningful attributes. While the supervised classification experiments demonstrated high discriminative power using available labels such as habitat cover and time-of-day ranges, the core objective of the unsupervised framework was to uncover latent acoustic patterns beyond predefined class labels. Nevertheless, in the final stage of the analysis (focused on acoustic connectivity), we incorporated the habitat cover classes to interpret the similarity between geographically distinct sites. This integration allowed us to validate the ecological relevance of the uncovered acoustic structures while preserving the exploratory nature of the unsupervised approach. In contrast to previous studies conducted in the reserve [4,34], the proposed framework extends beyond supervised schemes and leverages deep learning methods capable of capturing more complex patterns than traditional approaches [14]. This enables a richer unsupervised analysis, complemented by spatially explicit interpretations, addressing a gap that remains largely unexplored in most ecoacoustic studies.

The clustering and projection analyses presented earlier were conducted with the objective of supporting the subsequent study of soundscape patterns while reducing the uncertainty associated with the configuration and parameterization of the methods. As demonstrated previously, these parameters significantly influence the outcome and therefore the interpretation of the acoustic landscape. In this section, we focus on spatial analysis, aiming to investigate whether the identified acoustic clusters were geographically concentrated in specific areas or if they shared similar soundscape features across the landscape. This analysis leverages the design of the sampling protocol in the Zamuro Natural Reserve, which was based on a structured grid layout. This design enables spatial reasoning and interpretation by providing systematic coverage of the study area.

For spatial pattern analysis, we used the cluster assignments obtained from both proposed methodologies, i.e., UMAP combined with k-means and PaCMAP combined with DBSCAN. For each of the 93 recording sites, we computed the number of audio samples belonging to each cluster. This allowed us to quantify the degree of association between each site and each acoustic group, revealing the extent to which certain soundscape patterns dominated specific areas. Cluster-site associations served as the foundation for exploring spatial acoustic structure in the reserve. These proportions of cluster membership, based on the number of audio samples assigned to each cluster per site, were used to represent the sampling points geographically and perform an interpolation procedure. This allowed us to visualize (1) how the clusters were spatially distributed across the landscape, and (2) the degree of similarity among locations based on their acoustic characteristics. In addition to the two clustering approaches previously described, this interpolation analysis was also performed using the characterization based on acoustic indices. The inclusion of this third approach was with the aim of enhancing interpretability, as acoustic indices capture relevant temporal and spectral dynamics of the recordings [50,51], thus providing a meaningful reference framework for the interpretation of emerging spatial patterns, as noted in previous studies [52,53].

3.3.2. Spatial Pattern Results of the PaCMAP and DBSCAN Approach

Initially, we applied the combination of PaCMAP projection and DBSCAN clustering to analyze the spatial distribution of acoustic patterns. The resulting clusters were represented as heat maps, providing a geographic visualization of acoustic similarity (Figure 9). In the heat map, it is possible to observe both broadly distributed patterns across the landscape, such as those represented by clusters 2 and 4, and clusters more concentrated in specific areas, such as clusters 3, 6, and 9. These results suggest that certain acoustic patterns were widespread and occur under a variety of environmental or habitat conditions, while others were limited to particular zones, potentially linked to localized ecological features. This spatial contrast supports the idea that the clustering approach captures both general and site-specific soundscape structures, providing valuable insights into the acoustic organization of the study area. In addition, lateral patterns can be observed in certain clusters, such as cluster 1, which was concentrated toward the left side of the sampling area, and cluster 7, which appeared more frequently on the right side. In contrast, no evident trends appeared along the vertical (north-south) axis of the grid. As mentioned previously, the sampling design was implemented using a spatial grid, making this type of analysis appropriate for interpreting how soundscape variation occurs longitudinally or latitudinally. In this case, the observed patterns suggest stronger acoustic differentiation along the longitudinal gradient of the landscape.

3.3.3. Spatial Pattern Results of the UMAP and K-Means Approach

Similarly, we extracted heat maps for the clustering methodology using UMAP and k-means, as well as for the characterization based on acoustic indices. As a result, we identified several spatial patterns that were shared between the deep embedding-based methods (i.e., Soundscape-Net representations) and the clustering derived from acoustic indices. This finding is relevant in two main ways. First, it confirms that the deep neural network embeddings effectively captured landscape-level patterns that were also perceptible through classical ecoacoustic approaches. Second, this alignment contributes to enhancing the interpretability of the results, thereby supporting interdisciplinary collaboration with biologists and ecologists by bridging data-driven acoustic representations with ecologically meaningful indicators.

Figure 10 shows a direct comparison of the spatial heat maps derived from the clustering outputs of the three evaluated methods. The columns correspond to the three approaches: (a) autoencoder embeddings with k-means–UMAP, (b) autoencoder embeddings with DBSCAN–PaCMAP, and (c) acoustic indices with k-means. Each row displays a representative cluster selected from each method. Several clusters exhibited similar spatial patterns across the methods. For example, in the first row, a cross-shaped pattern appears in the upper left part of the grid (covering rows 3–6 of the recorder layout), and it was consistently detected by both the autoencoder-based methods and the acoustic index-based approach. Moreover, the lower central region of the grid shows activity for the same clusters, with a higher degree of similarity between the two deep learning-based methods, although the pattern was still observable in the acoustic indices. Similarly, in the second and third rows of Figure 10, shared spatial patterns can also be observed. The second row displays a more consistent cluster distribution across the three methods, indicating a stable acoustic pattern that emerged regardless of the feature extraction or clustering technique applied. In contrast, the third row exhibits clusters with high spatial variability, which showed less agreement between methods. This variability may reflect more complex or localized acoustic dynamics, making the clusters in this case more sensitive to the characteristics of the embedding space or the clustering algorithm used.

One key advantage of using UMAP in combination with autoencoders is that UMAP provides a transformation that allows inverse mapping from the low-dimensional embedding back to the original feature space, unlike PaCMAP which does not have this functionality. This reversibility function enables the use of the decoder component of the autoencoder to reconstruct representative spectrograms for each cluster based on the corresponding embeddings, as can be seen in Figure 11. In the context of ecoacoustic analysis, this capability is particularly valuable as it allows us to identify the dominant frequency patterns associated with specific clusters and map them spatially. This offers a direct link between abstract latent representations and their interpretable acoustic content, enhancing both the explanatory power of the model and its ecological relevance.

3.4. Acoustic Index Distribution Among Clusters

To gain deeper insight into the ecological patterns captured by each method, we computed the acoustic indices for all audio samples within each cluster. We then calculated the variance of each index among the clusters to identify which indices had more variability between the groups. This allowed us to determine which acoustic indices were most relevant for characterizing and discriminating between soundscape clusters. Figure 12 shows the acoustic indices with the highest variance for both the autoencoder-based methods and the baseline method using purely acoustic indices, facilitating a direct comparison. For autoencoder based methods, the Normalized Difference Soundscape Index (NDSI) showed the highest inter-cluster variance. The NDSI quantifies the balance between biological and anthropogenic acoustic activity by comparing the energy in biophonic frequency bands (typically 2–8 kHz) to that in anthropophonic bands (1–2 kHz) [54,55]. Higher values indicate a dominance of natural sound sources over human-made noise, making it a strong indicator of ecological integrity. In contrast, for the baseline clustering derived from acoustic indices, the Acoustic General Index (AGI) exhibited the highest variance across clusters. The AGI is a composite metric that integrates the spectral entropy, signal-to-noise ratio, and frequency content to describe the overall complexity and richness of the soundscape [56]. These results highlight how different features—whether derived from deep embeddings or direct signal-based descriptors—emphasize the distinct ecological dimensions of the acoustic environment.

Another index that consistently appeared among the highest variance features across all three approaches was the gamma spectral entropy (

H_{γ}

). This metric captures spectral entropy by modeling the energy distribution of the signal using a gamma function, effectively reflecting the complexity and irregularity of the frequency spectrum. Similar to the Normalized Difference Soundscape Index (NDSI), which measures the proportion of biophonic activity to anthropophonic activity,

H_{γ}

is sensitive to acoustic heterogeneity, and higher values typically indicate a richer, more diverse soundscape. The presence of this index in both deep learning–based and traditional approaches suggests that it plays a key role in capturing ecologically meaningful variations in soundscape composition.

Additionally, we observed specifically in the autoencoder-based clustering using DBSCAN that there was an influence of highly correlated features. For example, pairs such as the low-frequency equivalent level (LEQf) and the total equivalent level (LEQt), which both measure sound energy at different temporal or spectral resolutions, or entropy-based indices like the frequency entropy (

H_{f}

) and paired Shannon entropy (

H_{pairedShannon}

) often contributed simultaneously to inter-cluster variability. This redundancy may affect the sensitivity of density-based clustering methods, as DBSCAN is influenced by local density estimates that can be biased by overlapping information in the feature space. These findings underscore the importance of considering feature redundancy and correlation when combining handcrafted ecoacoustic metrics with unsupervised learning techniques.

Moreover, Figure 13 shows a heatmap of the mean values of the top 10 acoustic indices for each cluster and method. While the previous analysis identified the indices with the highest contribution to the clustering process, this visualization helps to understand how these indices varied across clusters and methods, providing more context for interpreting the cluster composition. For example, the Normalized Difference Soundscape Index (NDSI) showed high values in most clusters, especially under the AE DBSCAN-PaCMAP method. This suggests that many of these clusters represent soundscapes with a high proportion of biophonic activity relative to anthropophonic noise. In contrast, the AI k-means method showed generally low values for most indices across all clusters, except for cluster 2, where all indices reached high values. This pattern indicates that cluster 2 is acoustically distinct and may represent a particular environmental condition.

Another relevant observation is related to the background noise floor (BGNf) [57,58], which remained high in most clusters but was notably low in cluster 2. This suggests a significant difference in ambient noise levels for this cluster compared with the rest, which could be linked to differences in habitat type or human presence. For instance, Rao’s quadratic entropy [59] quantifies acoustic diversity by considering both the richness and dissimilarity of frequency components. Shannon’s acoustic entropy index (H) [60] measures the unpredictability of spectral and temporal patterns, reflecting the complexity of soundscapes. The Acoustic Complexity Index (ACI) [61] captures the variability in intensity between adjacent time steps, often associated with biological activity. The Acoustic Evenness Index (AEI) [62] evaluates how evenly acoustic energy is distributed across frequency bands. Finally, temporal (ACT) and spectral (ACTspMean) activity indices [63] represent the proportions of time and frequency bins, respectively, containing acoustic activity, offering insights into the distribution and persistence of sounds (principal indices’ descriptions are shown in Appendix A, Table A1).

In general, the presented heat maps complement the variance-based analysis by showing how each index behaved in the clusters, helping to interpret the ecological meaning of each group more clearly. Together, these heat maps offer an ecologically interpretable perspective on cluster composition, facilitating the identification of acoustic signatures that characterize each group and strengthening the utility of unsupervised learning in landscape-scale eco-acoustic monitoring.

3.5. Soundscape Connectivity Based on Audio Features

In this work, we approached the concept of connectivity from an engineering perspective, identifying links between recording sites based on the similarity of their acoustic profiles. Locations with comparable acoustic behavior were assumed to share key ecological and environmental characteristics, such as vegetation structure, species assemblages, or proximity to hydrological elements detected acoustically through geophonic signatures. Although landscape connectivity is traditionally defined as the degree to which the landscape facilitates or impedes the movement of organisms among resource patches [53,64,65], our interpretation focuses on spatial consistency and the propagation of acoustic patterns, rather than species dispersal.

Recent studies have highlighted the ecological relevance of acoustic environments as indicators of landscape integrity [53,66] and have shown that soundscapes can encode meaningful ecological information, including structural attributes, biological diversity, and environmental processes [48,67]. From this perspective, the acoustic similarity between sites can reflect not only shared biological or physical sources of sound but also deeper characteristics in the ecological configuration. This approach complements classical notions of connectivity and enhances the potential of acoustic monitoring by exposing spatial patterns within the biophonic, geophonic, and anthropophonic components of the soundscape [52,68,69].

For the analysis of acoustic pattern similarity using connectivity, we leveraged the high-dimensional embedding space generated by the autoencoder. In contrast to previous approaches that built graphs directly in low-dimensional space, here we constructed connectivity graphs based on nearest neighbors in the original feature space to preserve the intrinsic structure of the learned acoustic representations. Using this high-dimensional representation, we constructed an undirected graph by connecting each node to its nearest neighbor using a k-nearest neighbor graph (

k = 1

), effectively capturing the most acoustically similar recordings. This process was performed both for each individual day and for the entire dataset, allowing us to assess connectivity patterns at multiple temporal scales.

To visualize these relationships, we applied PaCMAP with the parameter configuration previously described, projecting the embeddings into a two-dimensional space. The resulting graph layout retained the connectivity structure derived from the original high-dimensional space while enabling spatial interpretation of the similarity patterns across samples. This strategy allowed us to explore acoustic connectivity with greater fidelity, as the graph was informed by the full representation learned by the neural network, while the PaCMAP projection provided an interpretable spatial embedding for visualization and pattern recognition. Figure 14 shows the connectivity graphs for a selection of sample days, illustrating how acoustically similar recordings were linked based on their proximity in the original autoencoder feature space.

However, since interpreting the connections directly in the feature space can be difficult and would require inspecting individual samples to understand their links, we used these connections as a proxy to explore how acoustic similarity was reflected in the physical space of the Rey Zamuro Reserve. Figure 15 shows the resulting graph, where recorder locations are represented as nodes and edges indicate acoustic similarity derived from the high-dimensional autoencoder space. The background includes the land cover classification, showing forest, pasture, and savanna classes to provide ecological context.

As expected, many connections appeared between nearby sites with similar land cover types, confirming the method’s ability to capture ecologically consistent acoustic patterns. However, we also observed several long-range connections between sites with the same cover type, suggesting that this graph-based representation is particularly effective at identifying spatial relationships that are not constrained by geographic proximity. This offers a complementary perspective to the interpolation-based approach presented earlier which, while useful for highlighting general spatial trends, may introduce interpretive bias by emphasizing local continuity. In contrast, the connectivity analysis revealed both local and distant associations, providing a more direct view of the underlying acoustic structure. Furthermore, connections between dissimilar land cover types (such as the link between RZUA10 and RZUE12 as well as RZUH04 and RZUF12) illustrate that acoustically similar conditions can emerge across heterogeneous environments, underscoring the capacity of this method to uncover nuanced ecological dynamics across the landscape.

4. Conclusions

This work presents an unsupervised framework for soundscape analysis that integrates deep representation learning, dimensionality reduction, clustering, and spatial exploration. Feature validation through supervised classification further demonstrates the strength of our approach; the autoencoder-based method outperformed the baseline techniques, reaching an accuracy of 0.84 for the identification of three habitat cover types and 0.89 for time-of-day classification. These results highlight the versatility of the proposed representations in capturing both ecological and temporal patterns. Moreover, statistical tests confirmed that the improvements were significant, with p values below 0.002 in all comparisons, providing robust evidence of the effectiveness of our characterization method.

One of the key aspects of our approach was the careful and systematic optimization of parameters for both projection and clustering methods. Rather than relying on default values, we performed an extensive grid search to evaluate the impact of hyperparameters on the structure and interpretability of the results, an often overlooked step in ecoacoustic studies. As a result, we reduced the original feature space of 5184 dimensions into 2 dimensions, achieving a trustworthiness score above 0.96 for UMAP and close to 0.99 for PaCMAP. These values indicate that the local neighborhood relationships of the high-dimensional space were highly preserved in the low-dimensional projections, ensuring meaningful interpretation of the latent structures. Using these projections with k-means, we obtained a silhouette score of 0.47, reflecting a moderate but consistent clustering structure. For DBSCAN, we proposed an evaluation strategy based on cluster density analysis to determine the optimal parameterization, which was further validated through reachability plots.

A particularly striking result was the emergence of nine clusters across both unsupervised pipelines (UMAP + k-means and PaCMAP + DBSCAN), despite their methodological differences. This consistency suggests that the acoustic landscape in our study area was structured around diverse and well-defined patterns. Moreover, the number of clusters exceeded what could be expected from an analysis based solely on spatial, temporal, or spectral properties, indicating that our method captured a combination of multiple ecological and acoustic dimensions. We also proposed the use of spatial interpolation to map the distribution of clusters across the landscape. Although interpolation in soundscape studies can be controversial, our use of a grid-based sampling design provided the spatial consistency needed to support this technique and interpret longitudinal and latitudinal trends in acoustic variation. The inclusion of acoustic indices further enhanced our ability to interpret the clusters, showing that macro-scale landscape patterns are associated with ecologically relevant features. Notably, indices such as the Normalized Difference Soundscape Index (NDSI) and the H-Gamma appeared consistently across methods and were linked to biophonic richness and biodiversity gradients.

On the other hand, we recognize the value that more detailed metadata or species-level annotations would add to the interpretation and validation of our results. Unfortunately, such granular ecological labels were not available in the datasets used for this study. This limitation is common in ecoacoustic research and serves as one of the main motivations for developing and evaluating methodologies that rely on general metadata (e.g., time, location, and land cover), which are consistently accessible across datasets. We also acknowledge that pretrained transformer-based models such as AST, wav2vec2, or vision transformers may provide complementary perspectives, and their integration represents a promising direction for future work. For this reason, we consider it essential to further explore how these models can be efficiently adapted for application to large-scale datasets such as the one presented in this study while maintaining interpretability. Preserving interpretability is crucial, as it ultimately enables meaningful interaction between artificial intelligence tools and the application domain (in this case, contributing to effective and ecologically relevant soundscape monitoring).

Furthermore, the proposed methodology and processing pipeline are openly available and designed to be adaptable. We encourage researchers working with more detailed ecological datasets to build upon this framework, as such collaborations could enhance its applicability and validation, ultimately contributing to more robust ecoacoustic analysis across diverse environmental contexts.

Finally, we introduced a novel method for analyzing acoustic connectivity, transitioning from similarity graphs in a high-dimensional feature space to interpretable spatial connections among physical recording sites. This allowed us to detect not only local relationships but also long-range acoustic similarities that might reflect the ecological structure or shared sound sources. By bridging engineering-based techniques with ecological interpretation, this approach opens opportunities for interdisciplinary analysis and supports the development of scalable tools for landscape monitoring.

Overall, our findings demonstrate that unsupervised deep learning, when combined with thoughtful design and multi-layered analysis, can offer powerful insights into the organization of complex soundscapes. This methodology contributes to the growing need for data-driven, label-free approaches in ecoacoustics and provides a foundation for future work in biodiversity assessment, habitat monitoring, and conservation planning.

Author Contributions

Conceptualization, D.A.N.M., J.D.M.V. and L.D.-M.; methodology, D.A.N.M., J.D.M.V. and L.D.-M.; software, D.A.N.M.; validation, D.A.N.M., J.D.M.V. and L.D.-M.; formal analysis, D.A.N.M., J.D.M.V. and L.D.-M.; investigation, D.A.N.M. and L.D.-M.; resources, D.A.N.M., J.D.M.V. and L.D.-M.; writing—original draft preparation, D.A.N.M. and L.D.-M.; writing—review and editing, D.A.N.M.; visualization, D.A.N.M.; supervision, J.D.M.V. and L.D.-M.; project administration, J.D.M.V. and L.D.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by project PF2410, titled “Fortalecimiento Grupo Máquinas Inteligentes y Reconocimiento de Patrones”, funded by the Instituto Tecnológico Metropolitano (ITM).

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset from the Rey Zamuro Reserve and Matarrendonda Nature Reserve used in this study is available and freely accessible upon request from the authors for research purposes.

Acknowledgments

We extend our gratitude to the Universidad de Antioquia for providing the acoustic dataset and for the valuable guidance and support throughout the development of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. More relevant acoustic indices found by the proposed methods in the results analysis and discussion.

Abbr.	Full Name	Description (Variants)	Reference
ACI	Acoustic Complexity Index	Measures variation in intensity over time within frequency bands; reflects biotic activity. Temporal and spectral variants exist.	[61]
ACTcount	Active segment count	Count of active segments in time or frequency.	[63]
ACTfraction	Active fraction	Proportion of signal above energy threshold. Exists in time and spectral forms.	[63]
ACTspMean	Mean active spectral width	Mean bandwidth of active spectral segments. Temporal variant: ACTtMean.	[63]
AEI	Acoustic Evenness Index	Energy evenness using Gini index.	[62]
AGI	Acoustic Generalized Index	Composite of multiple indices for biodiversity proxy.	[32]
BGN	Background noise	Ambient noise level. Estimated in time or frequency.	[57]
ECU	Entropy of cumulative spectrum	Cumulative entropy across frequency bins.	[32]
ENRF	Spectral energy ratio	Ratio of energy in frequency bands.	[62]
EPS	Entropy of power spectrum	Entropy of power spectral density. Variants include EPS_SKEW and EPS_KURT.	[58]
H_gamma	Gamma entropy	Entropy modulated by gamma; measures distribution complexity.	[60]
H_pairedShannon	Paired Shannon entropy	Shannon entropy for co-occurring components.	[60]
Hf	Spectral entropy	Entropy of energy across frequencies. Time-domain variant: Ht.	[60]
KURT	Kurtosis	Peakedness of amplitude or frequency distribution. Time and frequency variants.	[32]
LEQ	Equivalent continuous level	Averaged sound pressure level. Variants exist in time and frequency.	[58]
NDSI	Normalized Difference Soundscape Index	Compares biological vs. anthropogenic energy. Time and frequency variants exist.	[58]
RAOQ	Rao’s quadratic entropy	Biodiversity metric accounting for trait dissimilarity.	[59]
SNR	Signal-to-noise ratio	Signal vs. noise energy ratio. Temporal and spectral forms exist.	[28]

References

Rendon, N.; Rodríguez-Buritica, S.; Sanchez-Giraldo, C.; Daza, J.M.; Isaza, C. Automatic acoustic heterogeneity identification in transformed landscapes from Colombian tropical dry forests. Ecol. Indic. 2022, 140, 109017. [Google Scholar] [CrossRef]
Noble, A.E.; Jensen, F.H.; Jarriel, S.D.; Aoki, N.; Ferguson, S.R.; Hyer, M.D.; Apprill, A.; Mooney, T.A. Unsupervised clustering reveals acoustic diversity and niche differentiation in pulsed calls from a coral reef ecosystem. Front. Remote Sens. 2024, 5, 1429227. [Google Scholar] [CrossRef]
Eldridge, A.; Casey, M.; Moscoso, P.; Peck, M. A new method for ecoacoustics? Toward the extraction and evaluation of ecologically-meaningful soundscape components using sparse coding methods. PeerJ 2016, 4, e2108. [Google Scholar] [CrossRef] [PubMed]
Castro-Ospina, A.E.; Rodríguez-Marín, P.; López, J.D.; Martínez-Vargas, J.D. Leveraging time-based acoustic patterns for ecosystem analysis. Neural Comput. Appl. 2024, 36, 20513–20526. [Google Scholar] [CrossRef]
Sethi, S.S.; Jones, N.S.; Fulcher, B.D.; Picinali, L.; Clink, D.J.; Klinck, H.; Orme, C.D.L.; Wrege, P.H.; Ewers, R.M. Characterizing soundscapes across diverse ecosystems using a universal acoustic feature set. Proc. Natl. Acad. Sci. USA 2020, 117, 17049–17055. [Google Scholar] [CrossRef]
Hou, Y.; Ren, Q.; Zhang, H.; Mitchell, A.; Aletta, F.; Kang, J.; Botteldooren, D. AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyance. J. Acoust. Soc. Am. 2023, 154, 3145–3157. [Google Scholar] [CrossRef]
Colonna, J.G.; Carvalho, J.R.; Rosso, O.A. Estimating ecoacoustic activity in the Amazon rainforest through information theory quantifiers. PLoS ONE 2020, 15, e0229425. [Google Scholar] [CrossRef]
Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A deep learning solution for avian diversity monitoring. Ecol. Informatics 2021, 61, 101236. [Google Scholar] [CrossRef]
Sharma, S.; Sato, K.; Gautam, B.P. A Methodological Literature Review of Acoustic Wildlife Monitoring Using Artificial Intelligence Tools and Techniques. Sustainability 2023, 15, 7128. [Google Scholar] [CrossRef]
Tuia, D.; Kellenberger, B.; Beery, S.; Costelloe, B.R.; Zuffi, S.; Risse, B.; Mathis, A.; Mathis, M.W.; van Langevelde, F.; Burghardt, T.; et al. Perspectives in machine learning for wildlife conservation. Nat. Commun. 2022, 13, 792. [Google Scholar] [CrossRef]
Nieto-Mora, D.A.; Rodríguez-Buritica, S.; Rodríguez-Marín, P.; Martínez-Vargaz, J.D.; Isaza-Narváez, C. Systematic review of machine learning methods applied to ecoacoustics and soundscape monitoring. Heliyon 2023, 9, e20275. [Google Scholar] [CrossRef]
Gibb, K.A.; Eldridge, A.; Sandom, C.J.; Simpson, I.J. Towards interpretable learned representations for ecoacoustics using variational auto-encoding. bioRxiv 2023. [Google Scholar] [CrossRef]
Fuller, S.; Axel, A.C.; Tucker, D.; Gage, S.H. Connecting soundscape to landscape: Which acoustic index best describes landscape configuration? Ecol. Indic. 2015, 58, 207–215. [Google Scholar] [CrossRef]
Rendon, N.; Guerrero, M.J.; Sánchez-Giraldo, C.; Martinez-Arias, V.M.; Paniagua-Villada, C.; Bouwmans, T.; Daza, J.M.; Isaza, C. Letting ecosystems speak for themselves: An unsupervised methodology for mapping landscape acoustic heterogeneity. Environ. Model. Softw. 2025, 187, 106373. [Google Scholar] [CrossRef]
Guerrero, M.J.; Sánchez-Giraldo, C.; Uribe, C.A.; Martínez-Arias, V.M.; Isaza, C. Graphical representation of landscape heterogeneity identification through unsupervised acoustic analysis. Methods Ecol. Evol. 2025, 16, 1255–1272. [Google Scholar] [CrossRef]
Sun, W.; Guo, C.; Wan, J.; Ren, H. piRNA-disease association prediction based on multi-channel graph variational autoencoder. PeerJ Comput. Sci. 2024, 10, e2216. [Google Scholar] [CrossRef]
Vaiyapuri, T.; Binbusayyis, A. Application of deep autoencoder as an one-class classifier for unsupervised network intrusion detection: A comparative evaluation. PeerJ Comput. Sci. 2020, 6, e327. [Google Scholar] [CrossRef]
Wei, D.; Zheng, J.; Qu, H. Anomaly detection for blueberry data using sparse autoencoder-support vector machine. PeerJ Comput. Sci. 2023, 9, e1214. [Google Scholar] [CrossRef]
Borowiec, M.L.; Dikow, R.B.; Frandsen, P.B.; McKeeken, A.; Valentini, G.; White, A.E. Deep learning as a tool for ecology and evolution. Methods Ecol. Evol. 2022, 13, 1640–1660. [Google Scholar] [CrossRef]
Hirn, J.; García, J.E.; Montesinos-Navarro, A.; Sánchez-Martín, R.; Sanz, V.; Verdú, M. A deep Generative Artificial Intelligence system to predict species coexistence patterns. Methods Ecol. Evol. 2022, 13, 1052–1061. [Google Scholar] [CrossRef]
Guei, A.C.; Christin, S.; Lecomte, N.; Hervet, É. ECOGEN: Bird sounds generation using deep learning. Methods Ecol. Evol. 2024, 15, 69–79. [Google Scholar] [CrossRef]
Rowe, B.; Eichinski, P.; Zhang, J.; Roe, P. Acoustic auto-encoders for biodiversity assessment. Ecol. Inform. 2021, 62, 101237. [Google Scholar] [CrossRef]
Guerrero, M.J.; Restrepo, J.; Nieto-Mora, D.A.; Daza, J.M.; Isaza, C. Insights from Deep Learning in Feature Extraction for Non-supervised Multi-species Identification in Soundscapes. Lect. Notes Comput. Sci. (Incl. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinform.) 2022, 13788, 218–230. [Google Scholar] [CrossRef]
Casallas-Pabón, D.; Calvo-Roa, N.; Rojas-Robles, R. Seed dispersal by bats over successional gradients in the Colombian orinoquia (San Martin, Meta, Colombia). Acta Biológica Colomb. 2017, 22, 348–358. [Google Scholar] [CrossRef]
Ramírez B, H.; Mejía, W.; Barrera Zambrano, V.A. Flora al Interior del Área 1 del Banco de Hábitat del Meta de Terrasos. v2.9; SiB Colombia: Bogotá, Colombia, 2023. [Google Scholar]
Nieto-Mora, D.A.; Ferreira de Oliveira, M.C.; Sanchez-Giraldo, C.; Duque-Muñoz, L.; Isaza-Narváez, C.; Martínez-Vargas, J.D. Soundscape Characterization Using Autoencoders and Unsupervised Learning. Sensors 2024, 24, 2597. [Google Scholar] [CrossRef]
Gibb, K.A.; Eldridge, A.; Sandom, C.J.; Simpson, I.J. Towards interpretable learned representations for ecoacoustics using variational auto-encoding. Ecol. Inform. 2024, 80, 102449. [Google Scholar] [CrossRef]
Chen, L.; Xu, Z.; Zhao, Z. Biotic sound SNR influence analysis on acoustic indices. Front. Remote Sens. 2022, 3, 1079223. [Google Scholar] [CrossRef]
Omprakash, A.; Balakrishnan, R.; Ewers, R.; Sethi, S. Interpretable and Robust Machine Learning for Exploring and Classifying Soundscape Data. bioRxiv 2024. [Google Scholar] [CrossRef]
Yoh, N.; Haley, C.L.; Burivalova, Z. Time series methods for the analysis of soundscapes and other cyclical ecological data. Methods Ecol. Evol. 2024, 15, 1158–1176. [Google Scholar] [CrossRef]
Cowans, A.; Lambin, X.; Hare, D.; Sutherland, C. Improving the integration of artificial intelligence into existing ecological inference workflows. Methods Ecol. Evol. 2024, 2024, 1–10. [Google Scholar] [CrossRef]
Ulloa, J.S.; Haupert, S.; Latorre, J.F.; Aubin, T.; Sueur, J. scikit-maad: An open-source and modular toolbox for quantitative soundscape analysis in Python. Methods Ecol. Evol. 2021, 12, 2334–2340. [Google Scholar] [CrossRef]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar] [CrossRef]
Castro-Ospina, A.E.; Solarte-Sanchez, M.A.; Vega-Escobar, L.S.; Isaza, C.; Martínez-Vargas, J.D. Graph-Based Audio Classification Using Pre-Trained Models and Graph Neural Networks. Sensors 2024, 24, 2106. [Google Scholar] [CrossRef] [PubMed]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020, arXiv:1802.03426. [Google Scholar] [CrossRef]
Wang, Y.; Huang, H.; Rudin, C.; Shaposhnik, Y. Understanding How Dimension Reduction Tools Work : An. J. Mach. Learn. Res. 2021, 22, 1–73. [Google Scholar]
Venna, J.; Kaski, S. Neighborhood preservation in nonlinear projection methods: An experimental study. Lect. Notes Comput. Sci. (Incl. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinform.) 2001, 2130, 485–491. [Google Scholar] [CrossRef]
Bedoya, C.; Isaza, C.; Daza, J.M.; López, J.D. Automatic identification of rainfall in acoustic recordings. Ecol. Indic. 2017, 75, 95–100. [Google Scholar] [CrossRef]
Nolasco, I.; Singh, S.; Morfi, V.; Lostanlen, V.; Strandburg-Peshkin, A.; Vidaña-Vila, E.; Gill, L.; Pamuła, H.; Whitehead, H.; Kiskin, I.; et al. Learning to detect an animal sound from five examples. Ecol. Inform. 2023, 77, 102258. [Google Scholar] [CrossRef]
Farina, A.; Pieretti, N.; Salutari, P.; Tognari, E.; Lombardi, A. The Application of the Acoustic Complexity Indices (ACI) to Ecoacoustic Event Detection and Identification (EEDI) Modeling. Biosemiotics 2016, 9, 227–246. [Google Scholar] [CrossRef]
Ulloa, J.S.; Aubin, T.; Llusia, D.; Courtois, É.A.; Fouquet, A.; Gaucher, P.; Pavoine, S.; Sueur, J. Explosive breeding in tropical anurans: Environmental triggers, community composition and acoustic structure. BMC Ecol. 2019, 19, 28. [Google Scholar] [CrossRef]
Dröge, S.; Budi, L.; Muys, B. Acoustic indices as proxies for biodiversity in certified and non-certified cocoa plantations in Indonesia. Environ. Monit. Assess. 2025, 197, 61. [Google Scholar] [CrossRef]
Dorrity, M.W.; Saunders, L.M.; Queitsch, C.; Fields, S.; Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 2020, 11, 1537. [Google Scholar] [CrossRef]
Poblete, V.; Espejo, D.; Vargas, V.; Otondo, F.; Huijse, P. Characterization of sonic events present in natural-urban hybrid habitats using umap and sednet: The case of the urban wetlands. Appl. Sci. 2021, 11, 8175. [Google Scholar] [CrossRef]
Thomas, M.; Jensen, F.H.; Averly, B.; Demartsev, V.; Manser, M.B.; Sainburg, T.; Roch, M.A.; Strandburg-Peshkin, A. A practical guide for generating unsupervised, spectrogram-based latent space representations of animal vocalizations. J. Anim. Ecol. 2022, 91, 1567–1581. [Google Scholar] [CrossRef] [PubMed]
Sanju, P. Advancing dimensionality reduction for enhanced visualization and clustering in single-cell transcriptomics. J. Anal. Sci. Technol. 2025, 16, 7. [Google Scholar] [CrossRef]
Sueur, J.; Farina, A. Ecoacoustics: The Ecological Investigation and Interpretation of Environmental Sound. Biosemiotics 2015, 8, 493–502. [Google Scholar] [CrossRef]
Wang, Z.; Ye, Z.; Du, Y.; Mao, Y.; Liu, Y.; Wu, Z.; Wang, J. AMD-DBSCAN: An Adaptive Multi-density DBSCAN for datasets of extremely variable density. In Proceedings of the 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), Shenzhen, China, 13–16 October 2022. [Google Scholar] [CrossRef]
Wang, Y.; Qian, J.; Hassan, M.; Zhang, X.; Zhang, T.; Yang, C.; Zhou, X.; Jia, F. Density peak clustering algorithms: A review on the decade 2014–2023. Expert Syst. Appl. 2024, 238, 121860. [Google Scholar] [CrossRef]
Buxton, R.T.; McKenna, M.F.; Clapp, M.; Meyer, E.; Stabenau, E.; Angeloni, L.M.; Crooks, K.; Wittemyer, G. Efficacy of extracting indices from large-scale acoustic recordings to monitor biodiversity. Conserv. Biol. 2018, 32, 1174–1184. [Google Scholar] [CrossRef]
Machado, R.B.; Aguiar, L.; Jones, G. Do acoustic indices reflect the characteristics of bird communities in the savannas of Central Brazil? Landsc. Urban Plan. 2017, 162, 36–43. [Google Scholar] [CrossRef]
Bradfer-Lawrence, T.; Gardner, N.; Bunnefeld, L.; Bunnefeld, N.; Willis, S.G.; Dent, D.H. Guidelines for the use of acoustic indices in environmental research. Methods Ecol. Evol. 2019, 10, 1796–1807. [Google Scholar] [CrossRef]
Sánchez-Giraldo, C.; Correa Ayram, C.; Daza, J.M. Environmental sound as a mirror of landscape ecological integrity in monitoring programs. Perspect. Ecol. Conserv. 2021, 19, 319–328. [Google Scholar] [CrossRef]
Sousa-Lima, R.S.; Ferreira, L.M.; Oliveira, E.G.; Lopes, L.C.; Brito, M.R.; Baumgarten, J.; Rodrigues, F.H. What do insects, anurans, birds, and mammals have to say about soundscape indices in a tropical savanna. J. Ecoacoust. 2018, 2, 2. [Google Scholar] [CrossRef]
Kholghi, M.; Phillips, Y.; Towsey, M.; Sitbon, L. Active learning for classifying long-duration audio recordings of the environment. Methods Ecol. Evol. 2018, 9, 1948–1958. [Google Scholar] [CrossRef]
Bradfer-Lawrence, T.; Duthie, B.; Abrahams, C.; Adam, M.; Barnett, R.; Beeston, A.; Darby, J.; Dell, B.; Gardner, N.; Gasc, A.; et al. The Acoustic Index User’s Guide: A practical manual for defining, generating and understanding current and future acoustic indices. Methods Ecol. Evol. 2024, 16, 1040–1050. [Google Scholar] [CrossRef]
Towsey, M.W. Noise Removal from Wave-Forms and Spectrograms Derived from Natural Recordings of the Environment. Available online: http://eprints.qut.edu.au/61399/ (accessed on 18 September 2025).
Towsey, M.; Wimmer, J.; Williamson, I.; Roe, P. The use of acoustic indices to determine avian species richness in audio-recordings of the environment. Ecol. Inform. 2014, 21, 110–119. [Google Scholar] [CrossRef]
Botta-Dukát, Z. Rao’s quadratic entropy as a measure of functional diversity based on multiple traits. J. Veg. Sci. 2005, 16, 533–540. [Google Scholar] [CrossRef]
Metcalf, O.; Nunes, C.; Abrahams, C.; Baccaro, F.; Bradfer-Lawrence, T.; Lees, A.; Vale, E.; Barlow, J. The efficacy of acoustic indices for monitoring abundance and diversity in soil soundscapes. Ecol. Indic. 2024, 169, 112954. [Google Scholar] [CrossRef]
Pieretti, N.; Farina, A.; Morri, D. A new methodology to infer the singing activity of an avian community: The Acoustic Complexity Index (ACI). Ecol. Indic. 2011, 11, 868–873. [Google Scholar] [CrossRef]
Kasten, E.P.; Gage, S.H.; Fox, J.; Joo, W. The remote environmental assessment laboratory’s acoustic library: An archive for studying soundscape ecology. Ecol. Inform. 2012, 12, 50–67. [Google Scholar] [CrossRef]
Pijanowski, B.C.; Villanueva-Rivera, L.J.; Dumyahn, S.L.; Farina, A.; Krause, B.L.; Napoletano, B.M.; Gage, S.H.; Pieretti, N. Soundscape ecology: The science of sound in the landscape. BioScience 2011, 61, 203–216. [Google Scholar] [CrossRef]
Rudnick, D.; Ryan, S.J.; Beier, P.; Cushman, S.A.; Dieffenbach, F.; Trombulak, S.C. The Role of Landscape Connectivity in Planning and Implementing Conservation and Restoration Priorities; Issues in Ecology; Ecological Society of America: Washington, DC, USA, 2012; pp. 1–23. [Google Scholar]
Dale, M.R.; Fortin, M.J. From graphs to spatial graphs. Annu. Rev. Ecol. Evol. Syst. 2010, 41, 21–38. [Google Scholar] [CrossRef]
Quinn, C.A.; Burns, P.; Jantz, P.; Salas, L.; Goetz, S.; Clark, M. Soundscape mapping: Understanding regional spatial and temporal patterns of soundscapes incorporating remotely-sensed predictors and wildfire disturbance. Environ. Res. Ecol. 2024, 3, 25002. [Google Scholar] [CrossRef]
Bertassello, L.E.; Bertuzzo, E.; Botter, G.; Jawitz, J.W.; Aubeneau, A.F.; Hoverman, J.T.; Rinaldo, A.; Rao, P.S. Dynamic spatio-temporal patterns of metapopulation occupancy in patchy habitats. R. Soc. Open Sci. 2021, 8, 201309. [Google Scholar] [CrossRef]
Akbal, E.; Barua, P.D.; Dogan, S.; Tuncer, T.; Acharya, U.R. Explainable automated anuran sound classification using improved one-dimensional local binary pattern and Tunable Q Wavelet Transform techniques. Expert Syst. Appl. 2023, 225, 120089. [Google Scholar] [CrossRef]
Fink, D.; Auer, T.; Johnston, A.; Ruiz-Gutierrez, V.; Hochachka, W.M.; Kelling, S. Modeling avian full annual cycle distribution and population trends with citizen science data. Ecol. Appl. 2020, 30, e02056. [Google Scholar] [CrossRef]

Figure 1. Geographic location of the Rey Zamuro and Matarredonda Private Nature Reserve. (a) The sampling site with a distribution of 94 AudioMoth recording devices across different habitat types, arranged in a grid with 200 m of spacing covering forest, savanna, and pasture areas. (b) The location of the reserve within Colombia. (c) The Meta department, the region where the Rey Zamuro Reserve and Matarredonda Private Nature Reserve are situated.

Figure 2. Audio processing pipeline prior to feature extraction. (a) Original audio recordings collected at a sampling rate of 192,000 Hz. (b) Downsampling of the recordings to 22,050 Hz. (c) Segmentation of each recording into five 12-s clips. (d) Computation of spectrograms for each segment. (e) Input of spectrogram batches into the convolutional autoencoder network for feature extraction.

Figure 3. Performance comparison of feature extraction methods in habitat classification across 13 sampling days. The box plots represent the distribution of accuracy, F1 score, and recall results obtained using random forest classifiers trained on features extracted by acoustic indices, VGGish, and autoencoder embeddings. The autoencoder consistently outperformed the other methods, demonstrating higher median scores and lower variability, indicating improved representational quality and robustness.

Figure 4. Performance comparison of feature extraction methods in time-of-day classification across 13 sampling days. The classification was performed using three temporal segments: dawn, midday, and dusk. The box plots represent the distribution of accuracy, F1 score, and recall value obtained using random forest classifiers trained on features extracted by acoustic indices, VGGish, and autoencoder embeddings.

Figure 5. Overview of the proposed unsupervised methodology. (a) Data acquisition across the study area. (b) Feature extraction using convolutional autoencoders, with acoustic indices incorporated as a comparative baseline. (c) Projection of the learned features into low-dimensional spaces using UMAP and PaCMAP. (d) Clustering of the projected features, where PaCMAP is combined with DBSCAN and UMAP with k-means. (e) Evaluation of the framework through three complementary approaches: spatial pattern analysis, connectivity assessment, and comparative analysis with acoustic indices.

Figure 6. Evaluation and visualization of low-dimensional embeddings generated by UMAP and PaCMAP. (a) Trustworthiness scores computed for a grid of parameter configurations in each method. The highest score for UMAP was achieved with 75 neighbors and a minimum distance of 0.01. For PaCMAP, the optimal configuration was 75 neighbors, a mid-scale neighbor ratio of 1.5, and a far neighbor ratio of 20. (b) Final low-dimensional projections of the dataset (subsampled) using the selected hyperparameters.

Figure 7. Summary of k-means clustering performance across different numbers of clusters using internal validation metrics. Each subplot shows a box plot distribution of scores among days for a given number of clusters. (a) Silhouette coefficient, where higher values indicate more cohesive and well-separated clusters. (b) Davies–Bouldin Index, where lower values indicate better clustering quality. (c) Calinski–Harabasz Index, where higher values suggest better-defined cluster structure.

Figure 8. Reachability plots for PaCMAP-embedded data across different sampling days.

Figure 9. Spatial distribution of acoustic clusters across the Zamuro Natural Reserve using PaCMAP projection and DBSCAN clustering. Each panel represents a heat map corresponding to one of the nine clusters. Color intensity indicates the relative number of audio samples associated with that cluster at each recording site.

Figure 10. Comparison of spatial distributions between the different clustering pipelines. Red boxes highlight areas where spatial overlap or similarity is observed between (a) autoencoders with k-means clustering and UMAP projection, (b) autoencoders with DBSCAN clustering and PaCMAP projections, and (c) acoustic indices with k-means clustering.

Figure 11. Spatial distributions and decoded spectrograms for (a) cluster 1, (b) cluster 4, (c) cluster 8, and (d) cluster 9, derived from UMAP projections and autoencoder embeddings. Each panel shows the heat map of a cluster’s presence across the sampling grid (left) and the corresponding representative spectrogram reconstructed from the autoencoder’s latent space (right). This decoding enabled identification of the dominant frequency bands associated with each cluster, facilitating the interpretation of their ecological and acoustic significance.

Figure 12. Comparison of the acoustic indices with the highest inter-cluster variance for both autoencoder-based clustering (left) and baseline clustering using acoustic indices (right).

Figure 13. Mean values of the top 10 acoustic indices per cluster for each analysis methodology.

Figure 14. Examples of acoustic connectivity graphs for (a) day 4, (b) day 5, (c) day 6, and (d) day 7, constructed using PaCMAP projections and a threshold-based nearest neighbor approach.

Figure 15. Acoustic connectivity graph projected onto the physical layout of the Rey Zamuro Reserve. Nodes represent recorder locations, and edges indicate acoustic similarity based on high-dimensional autoencoder embeddings. The background layer displays land cover types (forest, pasture, and savanna), providing ecological context. While several connections occurred between nearby sites with similar cover types, long-range links were also present, revealing spatial acoustic patterns that extended beyond geographic proximity. This graph-based approach complements interpolation methods by highlighting both local and distant acoustic relationships across the landscape.

Table 1. Statistical test results for performance metrics using cover types as labels in a multiclass classification approach. Significant p-values are marked by ** (

p < 0.01

) and *** (

p < 0.001

).

Table 1. Statistical test results for performance metrics using cover types as labels in a multiclass classification approach. Significant p-values are marked by ** (

p < 0.01

) and *** (

p < 0.001

).

Metric	Test	Comparison	Statistic	p Value
Accuracy	Friedman	AE, VGG, AI	20.182	0.00004 ***
Accuracy	Wilcoxon	AE vs. VGG	0.000	0.0010 ***
Accuracy	Wilcoxon	AE vs. AI	1.000	0.0020 **
Accuracy	Wilcoxon	VGG vs. AI	0.000	0.0010 ***
F1 Score	Friedman	AE, VGG, AI	20.182	0.00004 ***
F1 Score	Wilcoxon	AE vs. VGG	0.000	0.0010 ***
F1 Score	Wilcoxon	AE vs. AI	1.000	0.0020 **
F1 Score	Wilcoxon	VGG vs. AI	0.000	0.0010 ***
Recall	Friedman	AE, VGG, AI	20.182	0.00004 ***
Recall	Wilcoxon	AE vs. VGG	0.000	0.0010 ***
Recall	Wilcoxon	AE vs. AI	1.000	0.0020 **
Recall	Wilcoxon	VGG vs. AI	0.000	0.0010 ***

Table 2. Statistical test results for performance metrics using three ranges of hours as labels. Significant p values are marked by ** (

p < 0.01

) and *** (

p < 0.001

).

Table 2. Statistical test results for performance metrics using three ranges of hours as labels. Significant p values are marked by ** (

p < 0.01

) and *** (

p < 0.001

).

Metric	Test	Comparison	Statistic	p Value
Accuracy	Friedman	AE, VGG, AI	16.909	0.0002 ***
Accuracy	Wilcoxon	AE vs. VGG	0.000	0.0010 ***
Accuracy	Wilcoxon	AE vs. AI	0.000	0.0010 ***
Accuracy	Wilcoxon	VGG vs. AI	20.000	0.2783
F1 Score	Friedman	AE, VGG, AI	15.273	0.0005 ***
F1 Score	Wilcoxon	AE vs. VGG	0.000	0.0010 ***
F1 Score	Wilcoxon	AE vs. AI	1.000	0.0020 **
F1 Score	Wilcoxon	VGG vs. AI	13.000	0.0830
Recall	Friedman	AE, VGG, AI	14.364	0.0008 ***
Recall	Wilcoxon	AE vs. VGG	0.000	0.0010 ***
Recall	Wilcoxon	AE vs. AI	1.000	0.0020 **
Recall	Wilcoxon	VGG vs. AI	18.000	0.2061

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nieto Mora, D.A.; Duque-Muñoz, L.; Martínez Vargas, J.D. Enhancing Soundscape Characterization and Pattern Analysis Using Low-Dimensional Deep Embeddings on a Large-Scale Dataset. Mach. Learn. Knowl. Extr. 2025, 7, 109. https://doi.org/10.3390/make7040109

AMA Style

Nieto Mora DA, Duque-Muñoz L, Martínez Vargas JD. Enhancing Soundscape Characterization and Pattern Analysis Using Low-Dimensional Deep Embeddings on a Large-Scale Dataset. Machine Learning and Knowledge Extraction. 2025; 7(4):109. https://doi.org/10.3390/make7040109

Chicago/Turabian Style

Nieto Mora, Daniel Alexis, Leonardo Duque-Muñoz, and Juan David Martínez Vargas. 2025. "Enhancing Soundscape Characterization and Pattern Analysis Using Low-Dimensional Deep Embeddings on a Large-Scale Dataset" Machine Learning and Knowledge Extraction 7, no. 4: 109. https://doi.org/10.3390/make7040109

APA Style

Nieto Mora, D. A., Duque-Muñoz, L., & Martínez Vargas, J. D. (2025). Enhancing Soundscape Characterization and Pattern Analysis Using Low-Dimensional Deep Embeddings on a Large-Scale Dataset. Machine Learning and Knowledge Extraction, 7(4), 109. https://doi.org/10.3390/make7040109

Article Menu

Enhancing Soundscape Characterization and Pattern Analysis Using Low-Dimensional Deep Embeddings on a Large-Scale Dataset

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

2.2. Methods

2.2.1. Acoustic Indices

2.2.2. VGGish Embeddings

2.2.3. Autoencoder Feature Extraction

2.2.4. Feature Projection and Dimensionality Reduction

2.3. Evaluation of Embedding Projections

2.3.1. Clustering Methods

2.3.2. Density Peak-Based Validation of Clusters (DPVC)

2.3.3. Connectivity and Graph Construction

3. Results and Discussion

3.1. Multiclass Classification Using Cover Type and Time Metadata as Labels

3.2. Low-Dimensional Feature Embedding and Clustering

3.2.1. Feature Projections and Method Optimization

3.2.2. Clustering Analysis

3.2.3. Evaluation of Optimal DBSCAN Parameter Settings

3.3. Soundscape Spatial Pattern Analysis

3.3.1. Methodological Description of the Spatial Pattern Analysis

3.3.2. Spatial Pattern Results of the PaCMAP and DBSCAN Approach

3.3.3. Spatial Pattern Results of the UMAP and K-Means Approach

3.4. Acoustic Index Distribution Among Clusters

3.5. Soundscape Connectivity Based on Audio Features

4. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI