HierLabelNet: A Two-Stage LLMs Framework with Data Augmentation and Label Selection for Geographic Text Classification

Chen, Zugang; Zhao, Le

doi:10.3390/ijgi14070268

Open AccessArticle

HierLabelNet: A Two-Stage LLMs Framework with Data Augmentation and Label Selection for Geographic Text Classification

by

Zugang Chen

^1,2,* and

Le Zhao

^1,2

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

College of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(7), 268; https://doi.org/10.3390/ijgi14070268

Submission received: 24 April 2025 / Revised: 6 June 2025 / Accepted: 7 July 2025 / Published: 8 July 2025

Download

Browse Figures

Versions Notes

Abstract

Earth observation data serve as a fundamental resource in Earth system science. The rapid advancement of remote sensing and in situ measurement technologies has led to the generation of massive volumes of data, accompanied by a growing body of geographic textual information. Efficient and accurate classification and management of these geographic texts has become a critical challenge in the field. However, the effectiveness of traditional classification approaches is hindered by several issues, including data sparsity, class imbalance, semantic ambiguity, and the prevalence of domain-specific terminology. To address these limitations and enable the intelligent management of geographic information, this study proposes an efficient geographic text classification framework based on large language models (LLMs), tailored to the unique semantic and structural characteristics of geographic data. Specifically, LLM-based data augmentation strategies are employed to mitigate the scarcity of labeled data and class imbalance. A semantic vector database is utilized to filter the label space prior to inference, enhancing the model’s adaptability to diverse geographic terms. Furthermore, few-shot prompt learning guides LLMs in understanding domain-specific language, while an output alignment mechanism improves classification stability for complex descriptions. This approach offers a scalable solution for the automated semantic classification of geographic text for unlocking the potential of ever-expanding geospatial big data, thereby advancing intelligent information processing and knowledge discovery in the geospatial domain.

Keywords:

geoscientific classification; geographic information science; LLMs; data augmentation; label imbalance

1. Introduction

Earth observation (EO) data, as an integrated resource in Earth sciences, encompasses a wide range of domains including cryosphere, land cover change, polar processes, field surveys, ocean surface dynamics, digital elevation models, climate dynamics and composition, as well as interdisciplinary studies [1,2]. The acquisition and application of such data involve not only remotely sensed observations—obtained via satellite and airborne platforms—but also in situ measurements [3]. Enhancements in data accessibility, availability, and usability play a critical role in informing policy formulation and supporting scientific decision-making across related disciplines [4,5,6,7]. With the continued advancement of Earth system science, the demand for high-quality Earth observation data has steadily increased [8]. However, the growing volume and complexity of geospatially rich datasets present significant technical challenges, particularly in terms of efficient data organization, classification, and management.

Text classification has become an increasingly vital technique for the intelligent processing of geographic observation data, serving as a core method for information extraction and knowledge construction [9]. While traditional text classification tasks—such as spam detection, sentiment analysis, news categorization, and user intent recognition [10]—have been widely applied across general domains, their deployment in specialized fields still faces substantial challenges.

For instance, in the medical domain [11], issues such as the scarcity of non-English corpora, imbalanced class distributions, and limited annotated data significantly impact model performance [12]—challenges that are analogously encountered in geographic text classification. Specifically, geographic texts often embed rich spatial relationships [13], such as the titles of data products, yet LLMs may exhibit bias when interpreting geographic knowledge [14], compromising the accuracy of named entity recognition. Many place descriptions are not simple toponyms but rather complex expressions composed of multiple entities [15,16]. For example, the title “2019 Jingdezhen City Fuliang County Kangshan Dike Partial Area UAV LiDAR Point Cloud Dataset” exemplifies this. Through its hierarchical structure of ‘City–County–Specific Geographic Feature–Sub-region,’ this title clearly delineates relationships of spatial containment, localization, and hierarchy, transitioning from macro-level administrative divisions to micro-level observation extents. As geographic terminology continuously evolves and semantic shifts occur frequently, classification models require enhanced adaptability and the ability to dynamically update. Consequently, the development of computationally efficient and semantically precise text classification methodologies represents a critical imperative within EO analytics. Such methodological advancements simultaneously address the exigent requirements for managing exponentially proliferating and increasingly heterogeneous datasets, while substantially enhancing the precision of information retrieval systems and named entity recognition (NER) frameworks. This optimization cascade subsequently amplifies the operational efficacy and inferential reliability of critical downstream applications, encompassing geographic information systems (GIS), disaster mitigation protocols, and urban intelligence infrastructures within smart city implementations.

In practical applications, high-quality annotated corpora are essential for training effective classification models [10,17]. However, in the geographic information domain, such resources are typically expensive to produce and require specialized expertise, making them especially scarce for low-frequency classes and exacerbating the problem of data sparsity [18]. This challenge is further intensified in short-text scenarios [19,20,21], where simplified syntax, narrow vocabulary coverage, and vague semantic representations hinder model generalization. A critical issue in understanding geographic information is semantic heterogeneity [22], where identical geographic concepts or terms are often assigned varied meanings, hindering precise comprehension.

Moreover, terminology in the geoscience domain is typically highly specialized, with certain phrases conveying nuanced, domain-specific meanings. Terms such as “remote sensing interpretation data” and “remote sensing application data” lack systematic definitions and semantic distinctions in general-purpose knowledge bases (e.g., Wikipedia) [23], rendering general pre-trained models insufficient for accurate semantic understanding in domain-specific applications. The absence of structured semantic knowledge in this field limits existing models’ comprehension of specialized terminology, thereby constraining the efficacy of geographic text classification systems.

This study addresses long-standing challenges in geographic text classification, including data scarcity, class imbalance, high domain specificity of terminology, and the prohibitive cost of manual annotation. We propose and validate a two-stage LLMs framework with data augmentation and label selection for earth observation text classification (HierLabelNet). Leveraging the text generation capabilities of LLMs, the framework synthesizes training data that preserves geospatial semantic characteristics using only a minimal set of labeled samples. This approach effectively mitigates issues of limited original corpus and uneven category distribution. By narrowing the label selection space of the LLM, computational overhead can be significantly reduced, simultaneously mitigating the model’s susceptibility to variations in label formatting and sequential ordering. This strategic alignment and restriction of LLM outputs enhances the stability and consistency of classification performance.

This framework not only significantly improves the accuracy and robustness of geographic text classification but also reduces reliance on large-scale manual annotation and diminishes the demand for annotators’ domain expertise. As such, it offers a viable path toward the automated and efficient management of geographic information texts. The proposed methodology provides both theoretical support and a practical foundation for advancing intelligent geospatial information processing technologies, holding substantial academic value and considerable application potential.

2. Related Works

2.1. Data Synthesis with LLM

The performance of text classification models heavily relies on the quality of training data. However, the processes of data collection and annotation often present significant challenges [24], including high costs, time-consuming procedures, and an increasing demand for large-scale, diverse datasets—challenges that are particularly pronounced in the specialized domain of geographic information text classification. To mitigate data scarcity and annotation costs, researchers frequently employ text transformation techniques—such as synonym replacement, paraphrasing, and back-translation—to generate augmented samples, thereby enriching dataset diversity and enhancing model robustness [25]. However, these methods demonstrate significant limitations regarding semantic preservation, contextual appropriateness, and generation control. For example, Wei and Zou’s Easy Data Augmentation (EDA) framework implements basic operations—word-level replacement, insertion, swapping, and deletion—to improve text classification performance in low-resource settings without requiring additional external resources [26]. Despite demonstrating effectiveness on small datasets, EDA-generated samples frequently exhibit reduced naturalness and semantic fidelity, rendering them inadequate for complex tasks that demand high linguistic quality. Bayer et al. [27] pointed out that conventional data augmentation techniques—including word substitution, sentence shuffling, and back-translation—despite showing moderate efficacy in low-resource contexts, frequently depend on static lexical resources or rule-based heuristics. Consequently, these approaches exhibit limitations in generating text with sufficient semantic diversity and contextual coherence, thereby introducing potential risks of label drift and semantic distortion.

In recent years, data synthesis techniques based on LLMs have emerged as a novel research paradigm to mitigate the challenges associated with limited data availability and annotation cost [28,29,30]. LLMs demonstrate robust capabilities in natural language understanding and generation, making them promising tools for generating task-relevant synthetic data [31,32], thereby exhibiting significant potential in supporting foundational yet critical NLP tasks such as text classification [33]. Ubani et al. introduced a zero-shot prompting method utilizing GPT-3.5 that generates semantically similar augmented samples without additional training [34]. Similarly, ChatAug reformulates sentences into multiple semantically equivalent variants to enhance model adaptability to diverse expressions [35]. Meanwhile, Genie achieves more natural, faithful, and varied synthetic data through fine-grained control of the generation process [36], and DIALOGIC efficiently expands limited dialog datasets with minimal or zero human intervention, significantly improving performance in low-resource conversational systems [37].

Building on these findings, we implement LLM-based data synthesis to address data sparsity and class imbalance in earth observation text classification. Through the integration of domain-specific knowledge into prompt engineering, we direct the model to generate synthetic samples enriched with geographic features while maintaining semantic consistency with target scenarios. Our experimental results demonstrate that augmenting real training data with these synthetic samples yields significant improvements in classification accuracy, thus validating the efficacy and applicability of our approach for earth observation text classification tasks.

2.2. Text Classification with LLM

Pretrained language models such as BERT [38] and RoBERTa [39], trained on large-scale corpora containing billions of tokens and comprising hundreds of millions of parameters, have demonstrated robust semantic modeling capabilities. These models can efficiently compute semantic similarity between texts and have been widely adopted across various NLP tasks, including text classification, where they have achieved state-of-the-art performance. However, their generalization ability to novel corpora remains highly dependent on the availability of substantial volumes of high-quality annotated data and typically requires task-specific fine-tuning to adapt effectively.

In recent years, the emergence of LLMs has led to a paradigm shift in NLP research. Researchers have increasingly explored zero-shot and few-shot in-context learning strategies to reduce dependence on manual annotations in text classification tasks [40,41,42], yielding promising results. Kostina et al. [43] conducted a comprehensive evaluation of multiple LLMs across two different classification scenarios, considering variations in model size, quantization strategies, and architectural configurations. They also examined the impact of prompt engineering on model performance, demonstrating that LLMs generally outperform traditional methods when handling complex classification tasks. Moreover, Ahmadnia et al. [44] proposed an active learning framework integrated with few-shot learning to further enhance LLM performance under limited supervision.

Nonetheless, challenges persist when LLMs are applied to classification tasks involving a large number of label options or semantically overlapping categories. Lu et al. [45] observed that the classification accuracy of LLMs significantly degrades in scenarios with high semantic similarity among candidate labels. Furthermore, LLMs exhibit biases toward specific label formats or positional arrangements during output generation, introducing additional variability and instability. To mitigate these issues, Lu et al. adopted an iterative label space compression strategy that progressively reduces the set of candidate labels through successive rounds of LLM inference. However, this approach incurs substantial computational and temporal costs due to the intensive reliance on model inference.

Building upon these insights, this study proposes a simplified label space compression strategy based on semantic similarity computation. By directly measuring the semantic relevance between the input text and each candidate label, the method performs label pre-ranking and filtering, thereby significantly reducing dependence on LLM inference and lowering computational resource consumption. Compared to iterative methods, the proposed approach not only demonstrates improved classification accuracy but also effectively circumvents the model’s bias toward specific label formats or positions, enhancing the consistency and robustness of classification outcomes.

3. Methodology

3.1. Data Synthesis Method

In this study, a vector database was constructed using a small set of manually annotated samples, enabling semantic expansion of input texts through an embedding-based similarity search mechanism in Figure 1. For example, the raw data from a groundwater resource survey conducted in September 2021 in the southwestern region of Aolunbulage Town, Alxa League, Inner Mongolia, is labeled as Remote Sensing Inversion Data in our classification system. Similarly, the dataset containing economic control variables from various U.S. states spanning 2012 to 2022 is labeled as Socioeconomic Data. Specifically, for each input instance, semantically similar samples were retrieved. Leveraging the natural language understanding capabilities of LLMs, entity recognition is performed based on example-driven inputs to extract key elements from each sample, including temporal information, spatial references, core data themes, and data resolution. These elements were then used to enrich the semantic context and incorporated into prompts fed to LLMs, guiding the generation of synthetic samples aligned with the original class labels.

Considering the spatiotemporal characteristics and domain-specific terminology inherent in geographic information texts, additional rule-based constraints were introduced in prompt design. The generation rules primarily constrain key elements in the text—such as temporal expressions, spatial references, and core data themes—by specifying transformation requirements during the text generation process. For auxiliary expressions involving time and space, random substitutions are performed based on the LLM’s internal corpus to enhance sample diversity. For domain-specific terms, replacements are made using semantically similar terminology extracted from the data themes of similar examples, ensuring domain consistency and semantic coherence.

During generation, decoding parameters such as

T e m p e r a t u r e \in (1.0, 1.5)

were systematically adjusted to balance output diversity and generation stability.

T e m p e r a t u r e

constitutes a critical hyperparameter that regulates the stochasticity of text generation in language models. Low

T e m p e r a t u r e

generates outputs characterized by heightened determinism and enhanced consistency. Conversely, high

T e m p e r a t u r e

produces outputs exhibiting greater lexical diversity and semantic novelty.

Furthermore, a semantic confidence score was computed for each generated sample-label pair, and only high-confidence samples were retained as valuable additions to the augmented training set. Based on our analysis, semantic complexity and model prediction stability exhibited variations across different categories. Consequently, instead of employing a uniform static threshold, we calibrated thresholds independently for each category, ultimately establishing confidence thresholds within the range of 0.85 to 0.95. This interval maintains data reliability while preserving maximum sample diversity.

3.2. Retrieval Label

The structure of candidate labels

L = {l_{1}, l_{2}, \dots, l_{N}}

is defined as a collection of potential labels retrieved through a mixed retrieval process, as shown in Figure 2. Specifically, the retrieval process identifies

T o p_K

similar texts

T = {t_{1}, t_{2}, \dots, t_{K}}

, where each text

t_{i}

is associated with a set of labels. To refine the candidate labels, we define operations, including identifying duplicate labels, merging similar labels, and removing redundant information. For a given text

x

, we retrieve

T o p_K

similar texts and their associated labels via a combination of vector-based and BM25 retrieval methods. The similarity scores are computed, and scores guide the weighting process.

To integrate lexical relevance and semantic similarity, a hybrid scoring function is proposed, and can be computed as follows:

{Score}_{h y b r i d} = w_{b m 25} \times Score (D, Q) + w_{v e c t o r} \times c o s (v, u)

(1)

The first component,

Score (D, Q)

, represents the BM25 relevance score, scaled by

w_{b m 25}

, while the second component incorporates cosine similarity,

c o s (v, u)

, scaled by

w_{v e c t o r}

. This formulation enables the simultaneous capture of lexical matching signals and semantic relationships, providing a comprehensive metric for ranking documents.

The BM25 relevance Score between document

D

and query

Q

is computed as the summation over

N

query terms, and can be computed as follows:

Score (D, Q) = \sum_{i = 1}^{n} IDF (q_{i}) \cdot \frac{f (q_{i}, D) \cdot (k_{1} + 1)}{f (q_{i}, D) + k_{1} \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}

(2)

IDF (q_{i})

represents the inverse document frequency of the query term

q_{i}

, and

f (q_{i}, D)

denotes the term frequency of

q_{i}

in document

D

. The numerator of the fraction scales the term frequency by

k_{1} + 1

, while the denominator incorporates length normalization. Specifically, the denominator combines

f (q_{i}, D)

with a factor

k_{1} \cdot (1 - b + b \cdot \frac{|D|}{a v g d l})

, where

| D |

and

a v g d l

are the current document length and the average document length in the collection, respectively. The parameters

k_{1}

and

b

are tunable constants that regulate the impact of term frequency scaling and length normalization.

To minimize label complexity, R-Label is used to reduce the number of candidate labels, focusing on high-quality labels while filtering out irrelevant ones. This approach avoids iterative selection and improves label precision. By limiting the LLM calls and reducing token usage, the framework significantly improves efficiency. Extensive experiments with different

T o p_K

values demonstrate the method’s ability to enhance classification accuracy, particularly in zero-shot settings and tasks involving large label sets.

Based on the integrated retrieval process described above, in Formula (3), a subset

L^{'} \subseteq L

is obtained. Then, we utilize the ICL to instruct LLM to select the appropriate label

l \in L^{'}

for input

t :

l = {L L M}_{θ}^{I C L} (l| I C L (t, L^{'}))

(3)

String-matching mechanisms fail to address the semantic equivalence problem between canonical labels and their paraphrased variants (e.g., mapping model-generated “meteorology” to standardized “atmosphere-ocean” category).

To overcome the limitation, we propose the Vector-based Output Regularization (VOR). Utilizing bge-m3, we encode both the LLM output

v_{output}

and predefined canonical labels

(L = {l_{1}, \dots, l_{n}})

into semantic embedding. Compute-cosine similarity between the output embedding and label space can be computed.

4. Experiments

4.1. Dataset and Evaluation Metrics

4.1.1. Dataset

EO data utilized in this study originates from the China Earth Observation Data Center (https://noda.ac.cn/datasharing/search, accessed on 14 October 2024), featuring a hierarchical taxonomy comprising four primary categories: Earth observation data, value-added information products, remote sensing foundational data, and auxiliary datasets. The primary categories are partitioned into 10 distinct yet heterogeneous secondary subclasses, shown in Table 1. A key observation is the substantial semantic overlap present between the labels of these subclasses. This semantic similarity distribution, calculated via the Bge-m3 model, is visualized in Figure 3. The dataset exhibits characteristic imbalanced distribution, with orders-of-magnitude disparities between head classes (e.g., optical/inversion products) and tail classes (e.g., remote sensing sample data) in Figure 4.

4.1.2. Evaluation Metrics

The Micro-F1 is calculated by aggregating the true positives (TP), false positives (FP), and false negatives (FN) across all classes, followed by computing the global precision and recall, and finally calculating their harmonic mean. Since this metric weighs each instance equally regardless of its class membership, it inherently emphasizes performance on majority classes. Consequently, Micro-F1 effectively reflects the overall classification performance of a model, making it particularly suitable for evaluating systems in scenarios with highly imbalanced class distributions.

In contrast, the Macro-F1 is computed by first calculating the precision, recall, and F1 score for each individual class independently, and then averaging these F1 scores across all classes. This metric assigns equal weight to each class regardless of its frequency, thereby focusing more on the model’s ability to distinguish minority classes. Consequently, Macro-F1 is more sensitive to performance disparities among different classes under imbalanced data conditions, offering a more comprehensive assessment of the model’s classification capabilities across the entire label space.

4.2. Experimental Settings

To comprehensively evaluate the proposed classification framework, we deployed open-source LLMs, including Qwen2.5 [46] and LLaMA3.1 [47], in a local computing environment. We selected Qwen2.5 and LLaMA3.1 for their open availability, strong performance, and fine-tuning support. Considering both the representational capacity of contemporary open-source LLMs and the constraints imposed by local computational infrastructure, we selected two exemplary architectures for our experimental evaluation: Qwen 2.5-7B-Instruct and LLaMA 3.1-8B-Instruct. During the experimental process, class-balanced data synthesis was achieved by constructing fine-grained prompts for under-represented categories to guide data generation, while applying downsampling strategies to overrepresented classes.

Subsequently, a semantic vector database was constructed using the bge-m3 embedding model. For each input instance, semantically similar samples were retrieved to construct enhanced prompts, which were then utilized to guide the LLMs in performing the text classification task. To improve instruction-following capabilities, we utilized all open-source models in their instruction-tuned versions. For consistency and generation stability across experiments,

T e m p e r a t u r e

was uniformly set to 1.0 across all model configurations. When

T e m p e r a t u r e = 1

, the model samples from the original Softmax probability distribution. Setting a higher

T e m p e r a t u r e

enhances the diversity of generated content but may lead to a drop in Macro-F1. Conversely, a lower

T e m p e r a t u r e

results in more deterministic and repetitive outputs, which are more strongly influenced by specific examples in the input context. To establish comprehensive comparison baselines, we select BERT and RoBERTa as representative pre-trained language models for comparative analysis in text classification tasks.

4.3. Main Results

Experimental results in Table 2 demonstrate that the proposed HierLabelNet framework significantly enhances the performance of LLMs in geographic text classification tasks. We evaluated the framework under few-shot learning settings by systematically varying the number of contextual examples (denoted as Q) used in prompt construction, assessing its adaptability and scalability across different model architectures and data configurations.

Under extremely low-shot conditions (Q = 1), HierLabelNet yielded substantial performance improvements. For instance, when applied to the Qwen2.5-1.5B-Instruct model, micro-F1 increased by 3.30 percentage points, while macro-F1 improved dramatically by 28.85%. These performance gains remained statistically significant as Q increased. In the case of Qwen2.5-7B-Instruct, micro-F1 and macro-F1 rose by 4.17% and 6.14%, respectively, under Q = 1, indicating robust scalability across model capacities.

Cross-model evaluations further revealed model-specific sensitivities to HierLabelNet. Generative LLMs such as the LLaMA3.1 series exhibited greater responsiveness to the framework. For example, LLaMA3.1-8B-Instruct showed consistent performance gains from Q = 1 to Q = 8, with the highest observed improvements being 3.86% (micro-F1) and 1.83% (macro-F1). These results highlight the generalizability and adaptability of R-Label in Chinese-language contexts, particularly for complex geographic text classification tasks.

However, the experiments also revealed inherent limitations in context-based learning with LLMs. As Q increases, performance improvements tend to plateau, with marginal gains diminishing to below 2%, suggesting that current LLM architectures face computational bottlenecks in assimilating large volumes of semantic examples. In contrast, HierLabelNet delivers more substantial gains in the low-Q regime, offering higher efficiency and reduced token consumption—favorable characteristics for real-world deployment.

4.4. Data Synthesis Analysis

As can be seen in Figure 4, we systematically contrast the original label distributions with their augmented counterparts, validating the effectiveness of class-balancing strategies through quantitative distribution comparison. The original dataset exhibits a highly concentrated distribution pattern, with the categories of “Optics” and “Remote Sensing Inversion” dominating disproportionately while other categories suffer from severe underrepresentation, reflecting significant class imbalance. This skewed distribution tends to induce models to overfit features of dominant categories while inadequately learning patterns from minority classes, thereby generating biased predictions and demonstrating performance degradation on low-frequency categories (e.g., “Remote Sensing Samples”). In contrast, the optimized dataset shows substantial improvement in class proportionality. Although “Optics” and “Remote Sensing Inversion” remain relatively prevalent, their dominance diminishes with enhanced representation of minority categories. This balanced configuration enables comprehensive feature learning across all classes, effectively mitigating distributional bias and strengthening generalization capability.

To evaluate the effectiveness of the proposed method under different data distribution conditions, we constructed two distinct datasets from the original geographic text corpus. The first dataset preserved the inherent label imbalance through random sampling, while the second balanced the label distribution by employing equal sampling across augmented minority class samples. To control for the influence of dataset size on model performance, both datasets were constrained to 1000 samples.

In the experiments, we employed Qwen2.5-7B-Instruct as the base model and evaluated classification performance under both the “Full Labels” and “Ours” settings across the two datasets. Results show in Table 3, as the number of contextual examples (Q) increases, model performance improves progressively and significantly across all scenarios. Notably, at Q = 2, even within the Full Labels setting, models trained on the augmented (balanced) dataset achieved nearly 2% higher accuracy in both micro-F1 and macro-F1 scores compared to those trained on the original (imbalanced) data.

Further analysis indicates that although the traditional Full Labels model maintains a relative advantage under certain conditions on the original dataset, our proposed method exhibits more substantial performance gains when combined with data augmentation. Specifically, under the low-resource condition of Q = 1, the “Ours” method yields a macro-F1 improvement of over 10 percentage points compared to the Full Labels baseline, underscoring the critical role of data augmentation in supporting the core capabilities of our framework.

Notably, across all Q values, the “Ours” approach consistently outperforms the Full Labels baseline in micro-F1 scores. This indicates that even under limited-resource scenarios, our framework—leveraging augmented geographic text data and reduced label space—can significantly enhance classification accuracy. Overall, these experimental results validate the adaptability and generalizability of our method in complex geographic text classification tasks, while highlighting the crucial importance of high-quality data augmentation in the automation of geospatial information processing.

4.5. Ablation Study of R-Label

We evaluated the performance under different conditions: without label retrieval optimization (w/o R-Label), without similar samples (w/o Similar Samples), without sample learning (w/o Samples), and without both sample learning and label retrieval optimization (w/o Samples + R-Label). To mitigate random fluctuations, all ablation experiments were repeated five times under identical random seeds, with results reported as mean performance metrics accompanied by standard deviations. The experimental configuration maintained consistent hyperparameter settings, training epochs, and model architecture across all conditions. Target components were progressively disabled through masking mechanisms: the label retrieval optimization ablation was implemented by disabling the cross-modal semantic matching algorithm, while the similar samples module elimination employed a random sampling replacement strategy. Throughout the experiments, the data preprocessing pipeline remained identical to the baseline model to ensure rigorous variable control. This progressive ablation strategy enabled effective isolation of individual module impacts on overall performance. The experimental results are shown in Table 4, revealing statistically significant performance degradation across EO datasets, confirming the critical role of each component.

Our full model achieved 71.12% micro-F1 and 69.62% macro-F1. The removal of retrieved label guidance (w/o R-Label) led to consistent performance drops, indicating its importance in semantic grounding. More dramatically, eliminating similar samples demonstration (w/o Similar Samples) caused severe degradation, particularly highlighting the contextual samples’ crucial role in maintaining prediction stability. Complete removal of sample demonstrations (w/o Samples) results in 62.82% micro-F1 and 54.58% macro-F1, with the macro-F1 decrease emphasizing samples’ importance for class-balanced performance. The most severe degradation occurs when jointly removing both components (w/o Samples + R-Label), plunging to 39.78% micro-F1 and 31.61% macro-F1, suggesting synergistic interaction between sample-based contextual learning and label-guided reasoning.

The hierarchical degradation patterns across both datasets systematically verify that:

Retrieved labels provide essential semantic constraints for task understanding.
Contextual samples offer crucial domain-specific grounding.
Their layered integration enables complementary knowledge fusion. The non-linear performance drop under joint ablation confirms the components’ multiplicative rather than additive interaction.

The confusion matrix visualization analysis (as shown in Figure 5) demonstrates that the proposed method achieves substantial improvements in critical categories: the terrain category shows a 10.3% relative accuracy gain from zero-shot to optimization phases, while the geography category exhibits stable 25% enhancement. These improvements stem from our dual-stage optimization framework: the candidate label refinement stage enhances class separability through spatial feature re-weighting, whereas the retrieval mechanism reduces feature space noise interference via dynamic sample selection.

Experimental results reveal synergistic effects of algorithmic combinations on complex categories: (1) The non-linear improvement trajectory of optical categories empirically validates the parameter convergence characteristics of the feature selection mechanism; (2) The linear gains in remote sensing application and interpretation categories confirm the effectiveness of semantic granularity optimization; (3) The robustness improvement in ground categories demonstrates the critical role of low-dimensional feature modeling in classification stability. Particularly, the unified framework elevates average accuracy by 42.7% across six core categories (atmospheric and geographical domains), with primary advantages manifested in two dimensions: noise filtering and enhanced label-semantic alignment.

Ablation studies further attribute methodological improvements: The combined strategy of latent semantic analysis and prompt-tuning achieves F1-score enhancement in socio-economic categories, significantly outperforming conventional data augmentation approaches. The method effectively mitigates long-tailed distribution challenges.

This study establishes an interpretable optimization paradigm for multi-category classification in complex scenarios, whose methodological framework demonstrates direct transferability to other fine-grained classification tasks.

4.6. Sample and Label Quantity in Context Learning

This study conducted a systematic controlled experiment to explore how retrieval mechanism parameters influenced the inference efficiency of LLMs. As shown in Figure 6, we constructed a comprehensive testing framework over an enhanced geographic text dataset, focusing on two key parameters:

t e x t_t o p

(denoting the number of retrieved samples used for prompt construction) and

l a b e l_t o p

(representing the number of candidate labels). The experiments covered 15 parameter combinations with

t e x t_t o p \in {1,2, 4,8, 16}

and

l a b e l_t o p \in {5,10,15}

. To ensure statistical stability, each experiment was repeated five times under strict variable control, and the average inference time was reported.

The results showed that, under fixed

l a b e l_t o p

, inference time grew quasi-linearly with increasing

t e x t_t o p

. For example, when

l a b e l_t o p

= 5, increasing

t e x t_t o p

from 1 to 16 led to an approximately 15.9% rise in inference time. Similar trends are observed for

l a b e l_t o p

= 10 and

l a b e l_t o p

= 15, with time increases of 13.1% and 13.8%, respectively.

Under fixed

t e x t_t o p

, inference time increased in a stepwise manner with respect to

l a b e l_t o p

. Taking

t e x t_t o p

= 1 as an example, increasing

l a b e l_t o p

from 5 to 15 resulted in a time increase from 89.4 s to 102.8 s, about 15.0%; for

t e x t_t o p

= 16, the corresponding increase was 12.9%.

When both

t e x t_t o p

and

l a b e l_t o p

increased simultaneously, the inference time exhibited superlinear growth. Specifically, from

t e x t_t o p

= 1,

l a b e l_t o p

= 5 to

t e x t_t o p

= 16,

l a b e l_t o p

= 15, the total inference time increase reached 30.9%, which exceeded the theoretical sum of their individual effects. This suggested a synergistic interaction between the two parameters, indicating a compound increase in computational load when constructing and processing complex prompts.

We further investigated the impact of candidate label space size during the retrieval phase on geographic text classification performance. As illustrated in Figure 7, the results revealed a clear downward trend in classification performance as the size of the label space increased. When the number of candidate labels was set to

l a b e l_t o p

= 5, the model achieved a micro-F1 score of 62.82% and a macro-F1 score of 56.58% on the EO dataset. However, as the label space expanded to

l a b e l_t o p

= 10 and 15, the performance dropped significantly, with micro-F1 scores decreasing to 59.75% and 55.77%, respectively.

Notably, when using the full label space (i.e., without any filtering), a dramatic performance degradation was observed as follows: the micro-F1 score plummeted to 39.78%, while the macro-F1 score dropped further to 31.61%. These results suggested that an overly large candidate label space could severely interfere with the model’s decision boundaries, thereby weakening its discriminative capacity in complex multi-class classification tasks and substantially diminishing overall classification effectiveness.

5. Conclusions

EO data in real-world applications often suffers from issues such as data sparsity and class imbalance, which significantly impede the performance of automated classification models. Moreover, the acquisition of high-quality annotated data typically entails prohibitive costs and extended timeframes. To address these challenges, this study proposes and implements a data augmentation and efficient classification framework tailored for geographic text. The framework innovatively integrates LLMs, advanced data augmentation techniques, and an efficient semantic vector indexing mechanism. Comprehensive experimental evaluations demonstrate that the proposed framework significantly enhances both the predictive accuracy and computational efficiency of geographic text classification while substantially mitigating dependence on manually annotated datasets. The core contribution of this work lies in its ability to alleviate key bottlenecks in data processing efficiency for geographic text classification, offering a more effective and scalable approach for managing increasingly large and diverse EO datasets. Furthermore, it holds strong potential for enhancing the accuracy of related information retrieval systems, thereby supporting critical applications including GIS, disaster response, and smart city services.

Despite its promising performance, the effectiveness of the proposed framework is inherently influenced by the capabilities of the underlying embedding models and LLMs. We anticipate that as these foundational models continue to evolve and improve, the performance of the framework in text classification tasks will further advance. Future work will focus on the following directions: (1) constructing higher-quality and larger-scale domain-specific datasets to enable more effective fine-tuning of embedding models and LLMs, thereby enhancing their understanding and representation of EO domain knowledge; (2) extending the applicability of the framework to multimodal EO data classification scenarios, exploring methods for integrating heterogeneous sources such as images, text, and spatiotemporal data.

Author Contributions

Zugang Chen put forward the original idea and conducted the experiments. Le Zhao conducted the experiments and analyzed the results and also wrote the paper. Zugang Chen provided some suggestions for the revision of the paper. Guoqing Li contributed to the conceptualization of the study and participated in the writing, reviewing, and editing of the manuscript. Jing Li, Hengliang Guo, and Jian Wang were involved in reviewing and editing the manuscript, as well as providing resources for the research. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [Grant No. 42201505]; the Natural Science Foundation of Hainan Province of China [Grant No. 622QN352]; the National Key Research and Development Program of China [Grant No. 2021YFF070420304]; and Computer Network and Information Special Project Of Chinese Academy of Sciences [Grant No. 2025000010]. The author is very grateful to the anonymous reviewer and editor. They have greatly helped improve the quality of the paper.

Data Availability Statement

The analysis datasets for the current study are available from the first author on reasonable request.

Acknowledgments

The authors thank the reviewers and editors for their constructive comments on this paper.

Conflicts of Interest

The authors declare no competing interests.

References

Behnke, J.; Mitchell, A.; Ramapriyan, H. NASA’s Earth Observing Data and Information System—Near-Term Challenges. Data Sci. J. 2019, 18, 40. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, L.; Zhao, Y.; Wang, Y.; Wei, D.; Wu, X.; Ma, X. Recent Advances in Using Chinese Earth Observation Satellites for Remote Sensing of Vegetation. ISPRS J. Photogramm. Remote Sens. 2023, 195, 393–407. [Google Scholar] [CrossRef]
He, L.; Guo, K.; Gan, H.; Wang, L. Collaborative Data Offloading for Earth Observation Satellite Networks. IEEE Commun. Lett. 2022, 26, 1116–1120. [Google Scholar] [CrossRef]
Roncella, R.; Zhang, L.; Boldrini, E.; Santoro, M.; Mazzetti, P.; Nativi, S. Publishing China Satellite Data on the GEOSS Platform. Big Earth Data 2023, 7, 398–412. [Google Scholar] [CrossRef]
Ochiai, O.; Harada, M.; Hamamoto, K. Earth Observation Data Utilization for SDGs Indicators: 15.4.2 and 11.3.1. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2339–2342. [Google Scholar]
Rinaldi, M.; Ruggieri, S.; Ciavarella, F.; De Santis, A.P.; Palmisano, D.; Balenzano, A.; Mattia, F.; Satalino, G. How Can Be Used Earth Observation Data in Conservation Agriculture Monitoring? In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2022–2025. [Google Scholar]
Kavvada, A. Knowledge Generation Using Earth Observations to Support Sustainable Development. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 915–917. [Google Scholar]
Caon, M.; Ros, P.M.; Martina, M.; Bianchi, T.; Magli, E.; Membibre, F.; Ramos, A.; Latorre, A.; Kerr, M.; Wiehle, S.; et al. Very Low Latency Architecture for Earth Observation Satellite Onboard Data Handling, Compression, and Encryption. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 7791–7794. [Google Scholar]
Rousi, M.; Sitokonstantinou, V.; Meditskos, G.; Papoutsis, I.; Gialampoukidis, I.; Koukos, A.; Karathanassi, V.; Drivas, T.; Vrochidis, S.; Kontoes, C.; et al. Semantically Enriched Crop Type Classification and Linked Earth Observation Data to Support the Common Agricultural Policy Monitoring. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 529–552. [Google Scholar] [CrossRef]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep Learning Based Text Classification: A Comprehensive Review. arXiv 2021, arXiv:2004.03705. [Google Scholar]
Boukhers, Z.; Khan, A.; Ramadan, Q.; Yang, C. Large Language Model in Medical Informatics: Direct Classification and Enhanced Text Representations for Automatic ICD Coding. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024; pp. 3066–3069. [Google Scholar]
Beliveau, V.; Kaas, H.; Prener, M.; Ladefoged, C.N.; Elliott, D.; Knudsen, G.M.; Pinborg, L.H.; Ganz, M. Classification of Radiological Text in Small and Imbalanced Datasets in a Non-English Language. arXiv 2024, arXiv:2409.20147. [Google Scholar]
Wang, J.; Zhao, Z.; Wang, Z.J.; Cheng, B.D.; Nie, L.; Luo, W.; Yu, Z.Y.; Yuan, L.W. GeoRAG: A Question-Answering Approach from a Geographical Perspective. arXiv 2025, arXiv:2504.01458. [Google Scholar]
Decoupes, R.; Interdonato, R.; Roche, M.; Teisseire, M.; Valentin, S. Evaluation of Geographical Distortions in Language Models: A Crucial Step towards Equitable Representations. In Discovery Science, Proceedings of the 27th International Conference, DS 2024, Pisa, Italy, 14–16 October 2024; Springer: Cham, Switzerland, 2025; Volume 15243, pp. 86–100. [Google Scholar]
Chen, P.; Xu, H.; Zhang, C.; Huang, R. Crossroads, Buildings and Neighborhoods: A Dataset for Fine-Grained Location Recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 3329–3339. [Google Scholar]
Hu, Y.; Mao, H.; McKenzie, G. A Natural Language Processing and Geospatial Clustering Framework for Harvesting Local Place Names from Geotagged Housing Advertisements. Int. J. Geogr. Inf. Sci. 2019, 33, 714–738. [Google Scholar] [CrossRef]
Liu, H.; Qiu, Q.; Wu, L.; Li, W.; Wang, B.; Zhou, Y. Few-Shot Learning for Name Entity Recognition in Geological Text Based on GeoBERT. Earth Sci. Inform. 2022, 15, 979–991. [Google Scholar] [CrossRef]
Vajjala, S.; Shimangaud, S. Text Classification in the LLM Era—Where Do We Stand? arXiv 2025, arXiv:2502.11830. [Google Scholar]
Kong, E.; Zhang, J.; Yu, D.; Shen, M. Chinese Short Text Classification Method Based on Enhanced Prompt Learning. In Proceedings of the 2024 7th International Conference on Computer Information Science and Application Technology (CISAT), Hangzhou, China, 12–14 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 423–427. [Google Scholar]
Sun, Z.; Harit, A.; Cristea, A.I.; Yu, J.; Shi, L.; Al Moubayed, N. Contrastive Learning with Heterogeneous Graph Attention Networks on Short Text Classification. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–6. [Google Scholar]
Chen, J.; Hu, Y.; Liu, J.; Xiao, Y.; Jiang, H. Deep Short Text Classification with Knowledge Powered Attention. Proc. AAAI Conf. Artif. Intell. 2019, 33, 6252–6259. [Google Scholar] [CrossRef]
Kuo, C.-L.; Chou, H.-C. An Ontology-Based Framework for Semantic Geographic Information Systems Development and Understanding. Comput. Geosci. 2023, 181, 105462. [Google Scholar] [CrossRef]
Chen, H.; Zhao, Y.; Chen, Z.; Wang, M.; Li, L.; Zhang, M.; Zhang, M. Retrieval-Style In-Context Learning for Few-Shot Hierarchical Text Classification. Trans. Assoc. Comput. Linguist. 2024, 12, 1214–1231. [Google Scholar] [CrossRef]
Li, Z.; Zhu, H.; Lu, Z.; Yin, M. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 10443–10461. [Google Scholar]
Liu, P.; Wang, X.; Xiang, C.; Meng, W. A Survey of Text Data Augmentation. In Proceedings of the 2020 International Conference on Computer Communication and Network Security (CCNS), Xi’an, China, 21–23 August 2020; pp. 191–195. [Google Scholar]
Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6382–6388. [Google Scholar]
Bayer, M.; Kaufhold, M.-A.; Reuter, C. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv. 2022, 55, 146. [Google Scholar] [CrossRef]
Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large Language Models for Data Annotation and Synthesis: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 930–957. [Google Scholar]
Long, L.; Wang, R.; Xiao, R.; Zhao, J.; Ding, X.; Chen, G.; Wang, H. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. In Findings of the Association for Computational Linguistics, Proceedings of the ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 11065–11082. [Google Scholar]
Guo, X.; Chen, Y. Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. arXiv 2024, arXiv:2403.04190. [Google Scholar]
Choi, J.; Kim, Y.; Yu, S.; Yun, J.; Kim, Y. UniGen: Universal Domain Generalization for Sentiment Classification via Zero-Shot Dataset Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1–14. [Google Scholar]
Tao, C.; Fan, X.; Yang, Y. Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation. In Proceedings of the 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT), Hangzhou, China, 20–22 December 2024; pp. 628–634. [Google Scholar]
Patwa, P.; Filice, S.; Chen, Z.; Castellucci, G.; Rokhlenko, O.; Malmasi, S. Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 20–25 May 2024; ELRA and ICCL: Paris, France, 2024; pp. 6017–6023. [Google Scholar]
Ubani, S.; Polat, S.O.; Nielsen, R. ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT. arXiv 2023, arXiv:2304.14334. [Google Scholar]
Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Zeng, F.; Liu, W.; et al. AugGPT: Leveraging ChatGPT for Text Data Augmentation. IEEE Trans. Big Data 2025, 11, 907–918. [Google Scholar] [CrossRef]
Yehudai, A.; Carmeli, B.; Mass, Y.; Arviv, O.; Mills, N.; Shnarch, E.; Choshen, L. Achieving Human Parity in Content-Grounded Datasets Generation. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Li, Z.; Chen, W.; Li, S.; Wang, H.; Qian, J.; Yan, X. Controllable Dialogue Simulation with In-Context Learning. In Findings of the Association for Computational Linguistics, Proceedings of the EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4330–4347. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Gretz, S.; Halfon, A.; Shnayderman, I.; Toledo-Ronen, O.; Spector, A.; Dankin, L.; Katsis, Y.; Arviv, O.; Katz, Y.; Slonim, N.; et al. Zero-Shot Topical Text Classification with LLMs—An Experimental Study. In Findings of the Association for Computational Linguistics, Proceedings of the EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 9647–9676. [Google Scholar]
Tian, K.; Chen, H. ESG-GPT:GPT4-Based Few-Shot Prompt Learning for Multi-Lingual ESG News Text Classification. In Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing, Torino, Italia, 20–25 May 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 279–282. [Google Scholar]
Liu, Y.; Li, M.; Pang, W.; Giunchiglia, F.; Huang, L.; Feng, X.; Guan, R. Boosting Short Text Classification with Multi-Source Information Exploration and Dual-Level Contrastive Learning. Proc. AAAI Conf. Artif. Intell. 2025, 39, 24696–24704. [Google Scholar]
Kostina, A.; Dikaiakos, M.D.; Stefanidis, D.; Pallis, G. Large Language Models For Text Classification: Case Study And Comprehensive Review. arXiv 2025, arXiv:2501.08457. [Google Scholar]
Ahmadnia, S.; Jordehi, A.Y.; Heyran, M.H.K.; Mirroshandel, S.A.; Rambow, O.; Caragea, C. Active Few-Shot Learning for Text Classification. arXiv 2025, arXiv:2502.18782. [Google Scholar]
Lu, Z.; Tian, J.; Wei, W.; Qu, X.; Cheng, Y.; Xie, W.; Chen, D. Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models. In Findings of the Association for Computational Linguistics, Proceedings of the ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 7841–7864. [Google Scholar]
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2025, arXiv:2412.15115. [Google Scholar]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]

Figure 1. Data synthesis and judgment of the confidence level.

Figure 2. A visualization of the application architecture of label optimization and few-shot learning in LLM, which performs vectorization and output regularization on the data of LLM.

Figure 3. Distribution map of semantic similarity calculation between different labels.

Figure 4. Visualization of the comparison of label distributions, (a) label distribution of the original data; (b) label distribution after data augmentation.

Figure 5. A visualization of the confusion matrix demonstrates the effectiveness of our framework in enhancing single-label classification: (a) Zero-shot sample; (b) Random sample; (c) Random sample and R-Label; (d) Similar sample and R-Label.

Figure 6. Visualization of the changes in inference time of the dataset under the variable parameters

t e x t_t o p

and

l a b e l_t o p

.

Figure 6. Visualization of the changes in inference time of the dataset under the variable parameters

t e x t_t o p

and

l a b e l_t o p

.

Figure 7. Visualization of the evaluation results of Micro-F1 and Macro-F1 for the dataset under the variable parameter

l a b e l_t o p

. “Full” represents the scenario with all labels.

Figure 7. Visualization of the evaluation results of Micro-F1 and Macro-F1 for the dataset under the variable parameter

l a b e l_t o p

. “Full” represents the scenario with all labels.

Table 1. The corresponding relationship of hierarchical labels in the Earth observation dataset.

Root	Child
Earth Observation	Optical
Information and Deep-Processing	Remote Sensing Inversion (RS Inversion), Remote Sensing Interpretation (RS Interpretation), Remote Sensing Application (RS Application)
Remote Sensing Foundational	Remote Sensing Sample (RS Sample)
Other	Fundamental Geographic (Geographic), Socioeconomic, Topographic, Ground Monitoring (Ground), Atmospheric and Oceanic (Atmospheric)

Table 2. Micro-F1 and Macro-F1 scores on the dataset. We reported the average, standard deviation, and best results across five experiments. Bold: the best result.

Dataset	Q	1		2		4		8
Dataset	Models	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1
EO	Bert	36.51 ± 0.00	28.50 ± 0.00	45.23 ± 0.00	40.47 ± 0.00	59.83 ± 0.21	57.19 ± 0.26	63.90 ± 0.00	61.31 ± 0.00
	Roberta	7.05 ± 0.00	2.96 ± 0.00	21.16 ± 0.00	12.36 ± 0.00	39.83 ± 0.00	33.08 ± 0.00	43.98 ± 0.00	39.95 ± 0.00
	Qwen2.5-1.5B-Instruct	60.58 ± 0.72	45.41 ± 3.07	57.93 ± 1.07	42.29 ± 4.45	57.68 ± 0.99	47.48 ± 3.07	62.99 ± 1.23	56.89 ± 7.43
	Llama3.1-8B-Instruct	59.67 ± 1.46	58.76 ± 1.19	62.66 ± 1.37	60.13 ± 3.37	62.91 ± 1.86	60.26 ± 1.65	63.76 ± 2.17	62.20 ± 2.58
	Qwen2.5-7B-Instruct	69.79 ± 0.35	67.06 ± 0.41	68.96 ± 0.50	66.16 ± 0.57	68.63 ± 0.21	65.68 ± 0.23	68.79 ± 0.50	66.38 ± 0.50
	Qwen2.5-1.5B-Instruct(HierLabelNet)	62.58 ± 1.07	58.53 ± 2.83	61.91 ± 0.77	53.54 ± 0.93	62.82 ± 0.99	60.36 ± 0.52	64.15 ± 0.50	62.37 ± 0.47
	Llama3.1-8B-Instruct(HierLabelNet)	60.03 ± 1.67	59.69 ± 1.77	64.24 ± 1.19	62.76 ± 0.79	64.98 ± 1.57	62.93 ± 1.80	66.22 ± 0.73	63.34 ± 0.79
	Qwen2.5-7B-Instruct(HierLabelNet)	72.70 ± 0.62	71.18 ± 0.62	71.53 ± 0.21	69.50 ± 0.30	71.12 ± 0.35	69.62 ± 0.38	70.37 ± 0.91	68.53 ± 0.98

Table 3. Micro-F1 and Macro-F1 scores on the dataset. We reported the average, standard deviation, and best results across three experiments. Bold: the best result.

Q	Methods	Original Data		Augmented Data
Q	Methods	Micro-F1	Macro-F1	Micro-F1	Macro-F1
1	Full Labels	71.51 ± 4.15	67.27 ± 6.20	74.27 ± 0.62	67.91 ± 4.15
1	Ours	68.74 ± 0.62	65.98 ± 0.75	78.42 ± 0.63	77.34 ± 0.79
2	Full Labels	74.82 ± 0.42	72.77 ± 0.45	76.62 ± 0.21	68.24 ± 0.34
2	Ours	72.89 ± 0.42	70.61 ± 0.40	79.67 ± 0.42	78.71 ± 0.44
4	Full Labels	77.73 ± 0.42	76.02 ± 0.59	78.15 ± 0.63	76.23 ± 0.59
4	Ours	72.20 ± 0.42	69.58 ± 0.56	80.08 ± 0.42	78.62 ± 0.40

Table 4. We conducted ablation experiments on our framework (based on Qwen2.5-7B-Instruct).

	EO
	Micro-F1	Macro-F1
Ours	71.12 ± 0.35	69.62 ± 0.38
w/o R-Label	68.63 ± 0.21	65.68 ± 0.23
w/o similar samples	57.51 ± 2.08	51.42 ± 3.14
w/o samples	62.82 ± 0.89	54.58 ± 3.57
w/o samples + R-Label	39.78 ± 0.42	31.61 ± 3.63

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Zhao, L. HierLabelNet: A Two-Stage LLMs Framework with Data Augmentation and Label Selection for Geographic Text Classification. ISPRS Int. J. Geo-Inf. 2025, 14, 268. https://doi.org/10.3390/ijgi14070268

AMA Style

Chen Z, Zhao L. HierLabelNet: A Two-Stage LLMs Framework with Data Augmentation and Label Selection for Geographic Text Classification. ISPRS International Journal of Geo-Information. 2025; 14(7):268. https://doi.org/10.3390/ijgi14070268

Chicago/Turabian Style

Chen, Zugang, and Le Zhao. 2025. "HierLabelNet: A Two-Stage LLMs Framework with Data Augmentation and Label Selection for Geographic Text Classification" ISPRS International Journal of Geo-Information 14, no. 7: 268. https://doi.org/10.3390/ijgi14070268

APA Style

Chen, Z., & Zhao, L. (2025). HierLabelNet: A Two-Stage LLMs Framework with Data Augmentation and Label Selection for Geographic Text Classification. ISPRS International Journal of Geo-Information, 14(7), 268. https://doi.org/10.3390/ijgi14070268

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HierLabelNet: A Two-Stage LLMs Framework with Data Augmentation and Label Selection for Geographic Text Classification

Abstract

1. Introduction

2. Related Works

2.1. Data Synthesis with LLM

2.2. Text Classification with LLM

3. Methodology

3.1. Data Synthesis Method

3.2. Retrieval Label

4. Experiments

4.1. Dataset and Evaluation Metrics

4.1.1. Dataset

4.1.2. Evaluation Metrics

4.2. Experimental Settings

4.3. Main Results

4.4. Data Synthesis Analysis

4.5. Ablation Study of R-Label

4.6. Sample and Label Quantity in Context Learning

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI