A Survey of Multi-Label Text Classification Under Few-Shot Scenarios

Hu, Wenlong; Fan, Qiang; Yan, Hao; Xu, Xinyao; Huang, Shan; Zhang, Ke

doi:10.3390/app15168872

Open AccessReview

A Survey of Multi-Label Text Classification Under Few-Shot Scenarios

by

Wenlong Hu

^1,2,3,

Qiang Fan

^2,3,*,

Hao Yan

^2,3,

Xinyao Xu

^2,3

,

Shan Huang

^2,3 and

Ke Zhang

^2,3

¹

School of Electronics and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

The Sixty-Third Research Institute, National University of Defense Technology, Nanjing 210007, China

³

Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 8872; https://doi.org/10.3390/app15168872

Submission received: 3 July 2025 / Revised: 2 August 2025 / Accepted: 6 August 2025 / Published: 12 August 2025

Download

Browse Figures

Versions Notes

Abstract

Multi-label text classification is a fundamental and important task in natural language processing, with widespread applications in specialized domains such as sentiment analysis, legal document classification, and medical coding. However, real-world applications often face challenges such as high annotation costs, data scarcity, and long-tailed label distributions. These issues are particularly pronounced in professional fields like healthcare and law, significantly limiting the performance of classification models. This paper focuses on the topic of few-shot multi-label text classification and provides a systematic survey of current research progress and mainstream techniques. From multiple perspectives, including modeling under few-shot settings, research status, technical approaches, commonly used datasets, and evaluation metrics, this study comprehensively reviews the existing literature and advances. At the technical level, the methods are broadly categorized into data augmentation and model training. The latter includes paradigms such as transfer learning, prompt learning, metric learning, meta-learning, graph neural networks, and attention mechanisms. In addition, this survey explores the research and progress of specific tasks under few-shot multi-label scenarios, such as multi-label aspect category detection, multi-label intent detection, and hierarchical multi-label text classification. In terms of experimental resources, this review compiles commonly used datasets along with their characteristics and categorizes evaluation metrics that are widely adopted in few-shot multi-label classification settings. Finally, it discusses the key research challenges and outlines future directions, offering insights to guide further investigation in this field.

Keywords:

few-shot learning; multi-label text classification; transfer learning; prompt learning; metric learning; meta-learning; few-shot multi-label aspect category detection; few-shot multi-label intent detection; hierarchical multi-label text classification

1. Introduction

Multi-label text classification (MLTC) is a fundamental task in the field of natural language processing (NLP), aiming to assign one or more predefined labels to each text sample simultaneously (e.g., sentiment analysis, topic classification). With the advent of the internet and the era of big data, the diversity and complexity of textual information have increasingly intensified. Single-label classification is no longer sufficient to comprehensively capture the multi-dimensional semantic information of texts. Consequently, multi-label classification has emerged as a solution and has become one of the current research hotspots. Existing methods demonstrate strong performance on large-scale datasets and complex task scenarios, effectively capturing multi-level dependencies within the text as well as intricate semantic relationships between texts and labels. However, these approaches often rely on a substantial number of labeled samples for sufficient training and fine-tuning to ensure model accuracy and stability. In practical applications, especially in specialized domains such as medical text processing, scientific paper classification, and legal document analysis, data scarcity, high annotation costs, and the frequent emergence of new labels result in severely limited training samples, thereby significantly constraining the classification performance of models. In particular, when training data for certain categories is extremely scarce, model performance on these categories is often suboptimal. Against this backdrop, it is essential to investigate the problem of multi-label text classification under few-shot scenarios. Traditional multi-label classification methods tend to suffer from severe overfitting when applied to limited data, preventing models from effectively learning the core features of the text. In recent years, the emergence of novel approaches such as few-shot learning [1,2], meta-learning [3], and prompt learning has significantly enhanced models’ ability to learn and generalize from a small number of samples, thereby enabling high classification accuracy even in data-scarce scenarios.

For the task of multi-label text classification under few-shot scenarios, extensive research efforts have been made, yielding a wealth of insights and promising results. However, there is still a lack of a comprehensive and systematic review to consolidate existing findings and provide guidance for future research. The literature surveyed in this review was primarily retrieved from authoritative academic databases, including Google Scholar, Web of Science, and ScienceDirect. The main search keywords included “multi-label text classification” and “few-shot multi-label text classification.” The inclusion criteria were as follows: (1) studies focusing on multi-label text classification tasks, with particular emphasis on few-shot and zero-shot learning scenarios; (2) formally published journal articles and conference papers; (3) works featuring a sound methodological framework and rigorous experimental design; and (4) the literature should demonstrate significant academic impact, with high citation counts and a strong focus on cutting-edge research. Unpublished studies and research unrelated to multi-label text classification under few-shot settings were excluded. Ultimately, 94 papers highly relevant to few-shot multi-label text classification were selected for in-depth discussion and systematic analysis, most of which were published within the last five years, thereby reflecting the latest advances in this field. Therefore, the main contributions of this paper are as follows:

(1): Conduct a comprehensive literature review to systematically organize and summarize recent advances in multi-label text classification under few-shot scenarios, providing valuable references for related research.
(2): Propose a rigorous classification framework to systematically categorize and structure existing studies.
(3): Identify the key challenges and methodological limitations faced by current approaches to multi-label text classification under few-shot scenarios.
(4): Review representative tasks in three specific application scenarios and provide a comparative analysis of the corresponding algorithms.
(5): Summarize the current research challenges in this field and discuss potential directions for future research.

2. Modeling and Current Research Status of Multi-Label Text Classification Under Few-Shot Scenarios

Multi-label text classification under few-shot scenarios refers to the task of assigning one or more relevant labels to a text under the condition of extremely limited training samples. This task combines the challenges of few-shot learning and multi-label classification: on the one hand, many labels in the training set correspond to very few or even zero samples; on the other hand, text instances often require multiple semantically related labels. Therefore, the model must maintain discriminative ability with limited samples while also modeling the potential dependencies between labels. This task typically presents two scenarios in practical research: the first is where the overall data volume is insufficient, and all label samples are scarce; the second is where the total data volume is relatively sufficient but highly imbalanced, with certain labels being rare due to high annotation costs, the emergence of new labels, or domain transfer, resulting in a significant long-tail distribution. Therefore, the key challenge of this task lies in developing classification models that can maintain robust performance under data scarcity while effectively identifying and optimizing tail labels.

2.1. Mathematical Description

Let there be a predefined label set

Y = {y_{1}, y_{2}, \dots, y_{C}}

, where

C

denotes the total number of labels. The training set is represented as

D_{t r a i n} = {(x_{i}, Y_{i})}_{i = 1}^{N}

, where

x_{i}

denotes the

i - th

input text,

Y_{i} \subseteq Y

represents the set of labels associated with

x_{i}

(i.e., the multi-label setting), and

N

is the number of training samples. The goal of the multi-label text classification task is to learn a classification function

f : x \to {0, 1}^{C}

that maps an input text

x

and outputs a

C

-dimensional binary vector:

\hat{Y} = f (x) = ({\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{C})

, indicating the relevance of each label. Here,

{\hat{y}}_{j} = 1

denotes that the text is predicted to contain label

y_{j}

, and

{\hat{y}}_{j} = 0

otherwise. During model training, a sigmoid layer is typically applied to each output dimension, and binary cross-entropy (BCE) is used as the loss function to model each label independently. Under the conventional setting, the model’s generalization ability relies on a sufficient number of training samples (i.e., large

N

), which helps to adequately cover the label space

Y

and improve overall classification performance.

However, under the few-shot setting, the number of training samples is extremely limited. In particular, in extreme few-shot scenarios, each label

y_{j} \in Y

in the training set is associated with at most

K

labeled examples (where

K

is a small constant, such as 1, 5, or 10), i.e., satisfying

|{(x_{i}, Y_{i}) \in D_{t r a i n} |y_{j} \in Y_{i}}| \leq K, \forall y_{j} \in Y

(1)

As a result, the total number of training samples is approximately

N \approx K \cdot C

. While conventional approaches often employ validation sets with a large number of instances to enhance model performance [4], this does not align with the requirements of few-shot learning. To more accurately simulate training and tuning under low-resource conditions, the size of the validation set

D_{v a l}

is similarly constrained—each label appears at most

K

times—ensuring that the training, validation, and testing phases are all conducted under a consistent few-shot setting. Moreover, in more rigorous few-shot learning scenarios, the

N - way K - shot

paradigm is commonly adopted, where the training data is divided into a support set, which provides

K

examples per label for model learning, and a query set, which is used to evaluate the model’s generalization to unseen instances.

2.2. Differences Between Conventional Multi-Label Text Classification and Multi-Label Text Classification Under Few-Shot Scenarios

As shown in Figure 1, the overall classification process consists of four key stages: data preparation, feature extraction and representation, classifier training, and label prediction. In the data preparation stage, conventional multi-label text classification tasks typically benefit from abundant training samples. Common preprocessing operations include tokenization, stopword removal, and label normalization. The dataset is usually split into training, validation, and test sets to support standard training and evaluation procedures. However, under few-shot settings, training samples are extremely scarce, and some labels may be associated with very few or even no examples. To compensate for the lack of information caused by data scarcity, external knowledge or metadata is often incorporated during preprocessing to enrich the semantic representation of the input text. Additionally, dataset partitioning commonly follows the

N - way K - shot

strategy, forming a support set and a query set to meet the experimental requirements of the few-shot learning paradigm.

During the feature extraction and representation stage, conventional multi-label text classification methods typically rely on classic word-embedding techniques and deep-neural-network architectures to transform discrete textual information into semantically rich continuous vector representations that serve as effective features for subsequent multi-label prediction tasks. In this process, pre-trained language models such as BERT and RoBERTa are often incorporated and fine-tuned to further enhance text representation capabilities. In contrast, under few-shot learning scenarios, to alleviate the challenges posed by scarce training data, data augmentation techniques—such as synonym replacement, back-translation, and synthetic sample generation—are commonly applied prior to feature extraction to increase input diversity and mitigate overfitting risks. Subsequently, the model performs representation learning based on pre-trained language models, achieving rapid adaptation to specific tasks through lightweight fine-tuning strategies. These methods, while maintaining parameter stability, contribute to enhancing the model’s robustness and generalization ability under low-resource conditions.

During the classifier training stage, traditional multi-label text classification methods primarily adopt supervised learning paradigms, such as linear classifiers or deep neural networks, to model the associations between labels. However, these methods are highly dependent on data scale and are difficult to directly adapt to few-shot scenarios. To address this issue, researchers have proposed various few-shot learning strategies. Common approaches include metric learning-based methods, which enable rapid generalization by constructing similarity measures between categories; prompt learning-based methods, which guide pre-trained models to understand task semantics through manually designed or automatically generated natural language prompts, thereby improving learning efficiency under low-resource conditions; and meta-learning-based methods (e.g., MAML [5]), which enable the model to quickly adapt to new tasks with few gradient updates by sharing initializations across tasks. These strategies provide new insights for few-shot multi-label text classification.

In the label prediction stage, conventional multi-label text classification methods typically perform parameter tuning using a validation set, followed by evaluating model performance on a test set. In contrast, under few-shot learning scenarios, the dataset is often divided into support and query sets to simulate low-resource conditions for assessing generalization ability, including the model’s predictive performance on unseen labels.

The core differences in challenges lie in the following aspects: The challenge in traditional multi-label text classification tasks lies in effectively modeling the complex relationships between labels, addressing the high-dimensional and sparse label space, and mitigating the impact of label imbalance and model complexity. In the few-shot setting, however, the focus shifts to feature learning and label association modeling under extremely low-resource conditions, particularly in scenarios involving long-tail and emerging labels, where maintaining model stability and generalization ability becomes crucial.

In summary, traditional multi-label text classification methods rely on abundant labeled data with well-established and standardized processes. However, in few-shot settings, due to data scarcity, insufficient information, and the prominence of long-tail label issues, significant challenges arise in model training. To address these challenges, research focus has gradually shifted from pure supervised learning to strategies combining pre-trained models, lightweight adaptation, and external knowledge enhancement, aiming to improve the robustness and generalization ability of models under low-resource conditions.

3. Technical Approaches

In current research, multi-label text classification under few-shot scenarios is primarily addressed through two technical approaches: data augmentation and model training. In terms of data augmentation, researchers commonly employ various strategies such as back-translation, synonym replacement, and pseudo-label expansion to enhance the model’s generalization capability and robustness to limited sample distributions. Additionally, some studies incorporate external knowledge [6] (e.g., predefined label terms and knowledge bases) to enrich the semantic representation of texts. For model training, existing methods can generally be categorized into two aspects: training paradigms and model architecture design. Training paradigms emphasize strategies for model learning, including transfer learning, prompt learning, metric learning, and meta-learning. In contrast, model architecture design focuses on the structural and representational capacity of the models, with representative architectures such as graph neural networks and attention mechanisms being employed to better capture complex inter-label relationships. A detailed taxonomy is illustrated in Figure 2.

3.1. Methods Based on Data Augmentation

In multi-label text classification under few-shot scenarios, data augmentation, as a direct and effective strategy, addresses these issues by incorporating external auxiliary information or applying meaningful transformations to original samples. This not only expands the volume of training data but also enhances the representational quality of the input features. Consequently, data augmentation improves the model’s generalization ability and robustness in low-resource scenarios. As a result, it has been widely adopted in various natural language processing tasks to mitigate training data insufficiency and boost classification performance.

To address the scarcity of annotated data in legal artificial intelligence tasks, Zhou et al. [7] proposed LAIAugment. The approach leverages self-training techniques to generate pseudo-labels for unlabeled samples extracted from large-scale judicial documents, thereby enabling semi-supervised learning. In parallel, an improved text similarity function is employed to retrieve semantically similar corpora to the labeled instances, enhancing both sample quality and feature representation. Empirical evaluations across various legal tasks—such as evidence extraction, legal element recognition, and multi-label prediction—demonstrate that LAIAugment consistently outperforms pre-trained models like RoBERTa, underscoring its effectiveness and practicality in low-resource legal text modeling.

Data augmentation has garnered increasing attention in recent years as a promising approach to address the challenge of long-tail label distributions. However, the extremely limited number of training instances associated with tail labels poses significant difficulties for conventional augmentation techniques in generating high-quality and diverse synthetic samples. Consequently, developing more targeted augmentation strategies to enhance model performance on underrepresented labels has emerged as a key research focus.

Taking extreme multi-label classification (XMC) tasks as an example, early methods were mostly based on sparse linear models or neural networks, but they performed poorly on long-tail labels. To improve the generalization ability of tail labels, Zhang et al. [8] proposed generative data augmentation (GDA), which expands the training data by generating label-invariant input perturbations and introduces a classifier with a label attention mechanism (e.g., LA-RoBERTa), significantly improving performance. Even rule-based augmentation methods, such as synonym replacement, perform well in data-scarce scenarios. Taking medical coding tasks as an example, although the model performs better at recognizing rare labels, it also exposes the issue of underestimating the overall label family. To address this, Falis et al. [6] combined data augmentation and synthesis techniques to alleviate the challenges posed by supervision scarcity at the data level and used analysis tools to reveal model prediction biases, guiding subsequent optimization. However, this method relies on external named entity recognition and linking (NER + L) tools, which have limitations. The difficulty of tail labels in long-tail label learning is also closely related to co-occurrence interference between labels. Representative methods such as LSFA [9], a multi-label pair-level data augmentation framework, specifically enhance the positive feature-label pairs of tail labels, significantly reducing the impact of co-occurring labels. Through prototype-supervised contrastive learning, it obtains decoupled document representations and uses intra-class semantic transfer to migrate statistical features of head labels to tail labels, improving the representation ability of tail labels. Furthermore, XDA [10] explored a prompt-driven data augmentation method based on pre-trained language models (PLMs) specifically designed for XMC. During the fine-tuning of T5, XDA introduces soft prompts to generate label-conditioned samples, balancing sample diversity and label consistency, thereby alleviating label quality issues. Unlike traditional sample-level augmentation, XDA employs a pair-level augmentation mechanism, significantly enhancing the performance of tail labels by masking head label augmented samples.

In summary, contemporary multi-label data augmentation strategies are evolving from generic approaches toward more customized methods with label-aware and label-specific modeling capabilities. These advancements place particular emphasis on improving the recognition and generalization of low-frequency (tail) labels, offering more targeted solutions to the long-standing challenges posed by long-tail distributions in multi-label learning tasks.

3.2. Model-Based Training Approaches

In this task, the design of the model training strategy is crucial. Researchers have proposed various training methods with strong generalization ability to address the few-shot problem. Overall, current mainstream training methods can be categorized into two dimensions: training paradigms and model architectures (design of training paradigms).

3.2.1. Transfer Learning-Based Approaches

The motivation behind transfer learning [11] stems from the human ability to apply previously acquired knowledge to novel tasks, thereby accelerating problem-solving in new domains. Traditional machine learning typically assumes that training and test data share the same feature space and distribution; transfer learning relaxes this constraint by enabling models to learn from different yet related data distributions. The core idea is to pretrain a model on a source domain with abundant data and then transfer the acquired knowledge to a target domain with limited data, thereby enhancing performance on the target task. This approach is particularly effective when target domain data is scarce. Formally, given a source domain

D_{s}

and its associated learning task

L_{s}

, the objective is to leverage knowledge from

D_{s}

to improve learning in a target domain

D_{t}

and task

L_{t}

, even when

D_{s} \neq D_{t}

or

L_{s} \neq L_{t}

.

In the domain of medical text processing, transfer learning has demonstrated significant advantages. Rios et al. [12] proposed a CNN-based transfer learning approach that first trains a model on PubMed abstracts to predict MeSH terms and subsequently transfers this knowledge to electronic medical record (EMR) data for ICD coding. This method effectively alleviates the challenge of predicting rare labels, with particularly notable improvements for low-frequency codes. To address the long-tail distribution problem, Li et al. [13] proposed the LCOAKT model, which leverages label co-occurrence information to perform semantic transfer between head and tail labels. This not only enhances the recognition ability of tail categories but also maintains the performance of head labels.

3.2.2. Prompt Learning-Based Approaches

In recent years, the rapid advancement of pre-trained language models (PLMs) has propelled prompt learning into prominence as an emerging paradigm to tackle the challenges of few-shot learning. The core idea behind prompt learning is to leverage natural language templates to reframe downstream tasks into cloze-style problems that align with the original training objectives of PLMs, thereby enabling task completion with little or no parameter fine-tuning. Prompt learning addresses these limitations by designing input templates and label verbalizers that exploit the inherent semantic knowledge embedded within PLMs, significantly enhancing generalization performance in few-shot and even zero-shot settings. As a result, an increasing number of studies have begun to explore the application of prompt-learning methods to multi-label classification tasks in few-shot environments.

Early studies attempted to adapt multi-label tasks into the prompt-learning framework through manually designed or automatically generated prompt templates. However, these methods generally suffered from issues such as reliance on human expertise, limited applicability, and unstable performance. To address these problems, Wang et al. [14] proposed the AMuLaP method, which combines pre-trained language models and utilizes a one-to-many label mapping approach along with statistical strategies to automatically select label words under a fixed template, reducing the reliance on manual prompt design. Building on this, Livernoche et al. [15] further improved the method by automatically generating multi-word label mappings through statistical techniques. They validated its effectiveness across different sample sizes on models such as BERT, RoBERTa, and DeBERTa. The results showed that even under extremely low-resource scenarios, AMuLaP maintained strong performance, demonstrating the potential of prompt learning in multi-label few-shot tasks. Furthermore, addressing the challenges of label scarcity and knowledge association modeling in multi-label text classification, several studies have explored combining prompt tuning with external knowledge to improve model performance. For example, Wei et al. [16] constructed a prompt tuning model using a unified template and employed a thresholding mechanism to automatically link student exercises with course knowledge points, significantly enhancing classification performance. Hu et al. [17] alleviated the limitations of verbalizer design in traditional prompt learning by extending the label vocabulary (verbalizer) through an external knowledge base. Both studies demonstrate that introducing knowledge-aware prompt tuning mechanisms into multi-label few-shot tasks not only enhances model accuracy and generalization but also offers a feasible path for integrating structured knowledge with prompt learning in future research.

In the healthcare domain, the ICD coding task faces challenges of an extremely large label space (over 150,000 labels) and severe long-tail distribution. The KEPTLongformer model proposed by Yang et al. [18], which combines prompt-based fine-tuning techniques with label semantics, significantly enhances the recognition ability for rare disease labels. Subsequently, they reformulated ICD coding as an autoregressive text generation task [19], integrating the SOAP structure of clinical notes and designing multi-label generation prompt templates to map ICD codes through generated text. Their proposed GPsoap model demonstrates superior performance under few-shot conditions, offering a novel approach to mitigating the issue of rare ICD labels. In multi-intent recognition tasks, traditional methods often fail due to difficulties in accurately setting label thresholds and neglecting the relationships between intents, leading to poor performance. To address this, Zhou et al. [20] proposed a two-stage prompt fine-tuning method (PFT), as shown in Figure 3, which constructs a prompt template for intent count prediction to resolve the threshold estimation problem, significantly improving performance in few-shot multi-intent detection. Zhuang et al. [21] introduced the PLMA framework, which leverages additional clues generated by large language models (LLMs) and small language models (SLMs) to tackle challenges in few-shot multi-label intent recognition. In Chinese named entity recognition (NER), traditional methods perform well on large-scale datasets but struggle in small-sample and nested-entity recognition scenarios. To address this, Zhou et al. [22] proposed the MPBCNER model, which combines multi-label prompts and boundary information, effectively alleviating recognition challenges under few-shot conditions.

In summary, prompt learning, as an emerging and efficient paradigm for few-shot learning, has demonstrated remarkable advantages across various scenarios of few-shot multi-label text classification. By flexibly constructing templates, designing label verbalizers, and incorporating external knowledge and structural information, prompt learning not only alleviates challenges such as label scarcity and limited samples but also significantly enhances model generalization and stability under low-resource conditions. Although challenges remain—such as automating template design and improving prompt generalization—the paradigm’s strong transferability and scalability have opened new avenues for research and practical application in few-shot learning. With the continuous advancement of large language models, prompt learning is poised to unlock greater potential across increasingly complex tasks.

3.2.3. Metric Learning-Based Approaches

Metric learning is a machine-learning approach aimed at learning distance or similarity functions between samples. Its core objective is to map data into a feature space where similar samples are placed closer together while dissimilar samples are positioned farther apart. In few-shot environments, metric learning offers distinct advantages by capturing the similarity structure among samples, thereby facilitating the modeling of semantic correlations between labels and enhancing model robustness and generalization. Recent years have seen the development of various metric learning-based models to address these challenges, with representative methods including Siamese networks, prototypical networks, and matching networks.

The Siamese network is a typical metric-learning architecture consisting of two subnetworks with shared weights. By inputting paired samples and learning their similarity distribution, the network maps similar samples to adjacent representation spaces while pushing dissimilar samples farther apart. In few-shot settings, especially in multi-label text classification tasks with severe long-tail label distributions, Siamese networks demonstrate distinct advantages. For example, the hybrid Siamese convolutional neural network (HSCNN) proposed by Yang et al. [23] combines a conventional CNN model for head labels with a Siamese network structure for tail labels. This method effectively alleviates the extreme imbalance problem through a discriminative training strategy, significantly improving model performance both at the tail label and overall category levels. Moreover, Csányi et al. [24] explored the applicability of Siamese networks with a triplet loss function in multi-label few-shot scenarios. They found that, even with only 10 samples per class, the BERT-based Siamese network with a fully connected layer performed comparably to traditional models (e.g., TF-IDF vectorization + logistic regression) across multiple metrics, particularly excelling in new label recognition tasks.

In addition to Siamese networks, prototypical networks are another commonly used metric-learning method in few-shot learning. The core idea is to construct a prototype for each class using a small number of support samples, representing the center of the class in the embedding space. During classification, the query sample is assigned to the closest class by calculating the distance between it and the class prototypes. To enhance the quality of the prototypes, Hui et al. [25] proposed a prototype network based on a context-aware attention mechanism, which scores key support samples to generate better prototypes, thereby alleviating the issue of prototype bias. Wang et al. [26] employed 3D convolutional neural networks to construct prototype networks. However, most of these methods focus on single-label settings and struggle to address challenges such as sample noise and label co-occurrence in multi-label scenarios. To address this, Luo et al. [27] proposed a multi-label few-shot prototype network with an instance-level attention mechanism, which enhances the representation weights of support samples highly correlated with the current label while suppressing the interference from other labels, thus improving the distinguishability of prototypes and classification performance. However, this method heavily depends on the quality of semantic embeddings. For the issue of long-tail distribution in multi-label tasks, Xiao et al. [28] proposed a triad prototype-orthogonal network (TAPON), which constructs a universal mapping mechanism between few-shot prototypes and multi-sample classifier parameters, effectively improving the generalization ability of tail labels. More recently, Kong et al. [29] extended the prototype idea to multi-label structural modeling by proposing a prototype-based regularization method, aiming to retain both explicit and implicit label correlations. This work not only broadens the application scope of prototype networks but also provides new insights for enhancing structural preservation and generalization capability in extreme multi-label text classification tasks.

The core idea of matching networks is to compute the similarity between new samples and samples from the training set in the embedding space, combined with known labels for non-parametric prediction. This approach demonstrates strong generalization ability, especially in cases with very few samples. In multi-label text classification tasks, particularly in the context of ICD automatic coding in the healthcare domain, matching networks have shown significant advantages. Electronic medical records (EMRs) typically involve a large number of labels, uneven distribution, long text, and complex information, making traditional neural networks limited in predicting rare labels. To address this, Rios et al. [30] proposed the match–CNN model, which combines matching networks with convolutional network structures and designs a multi-label loss function to tackle label scarcity and co-occurrence issues, significantly improving classification performance. Yuan et al. [31] introduced a multi-sense matching mechanism (MSMN), which enhances label semantic representation by incorporating synonym knowledge of codes, improving label expression and sample matching quality and achieving excellent results on datasets such as MIMIC-III. These studies indicate that matching networks, by modeling instance-level similarity, provide an effective alternative for multi-label few-shot tasks, particularly demonstrating strong potential in label-sparse scenarios such as medical coding.

In summary, metric learning demonstrates strong modeling and generalization capabilities in few-shot multi-label text classification through its similarity measurement mechanism. Methods such as Siamese networks, prototype networks, and matching networks effectively alleviate the few-shot problem, particularly the challenges posed by long-tail label distributions. In the future, combining label knowledge modeling and structure-aware techniques, metric learning is expected to realize greater potential in complex tasks such as extreme multi-label classification and cross-domain transfer.

3.2.4. Meta-Learning-Based Approaches

Meta-learning is also referred to as “learning to learn” [32]. Its core principle involves training across multiple related tasks to extract transferable knowledge, thereby optimizing the learning mechanism so that the model can swiftly improve performance on entirely new tasks or novel labels with only a limited number of samples. Unlike traditional approaches that focus solely on optimizing performance for individual tasks, meta-learning fundamentally enhances adaptability and robustness in low-resource settings by enabling the model to abstract common patterns across tasks.

In practical applications, the attentive task-agnostic meta-learning (ATAML) method [33] represents an effective extension of optimization-based meta-learning frameworks, as illustrated in Figure 4. Building upon MAML [5], ATAML incorporates an attention mechanism to facilitate universal representation learning through task-agnostic parameter initialization, while enabling rapid task-specific adaptation via attention. This architecture not only enhances model generalization under low-resource conditions but also improves the modeling of internal semantic structures within text. Experimental results demonstrate that ATAML outperforms random initialization, pre-trained models, and the standard MAML approach across both single-label and multi-label tasks, highlighting the synergistic benefits of integrating attention mechanisms with meta-learning.

In addition to ATAML, recent research has introduced various innovative methods combining meta-learning to address the complex characteristics of few-shot multi-label text classification. For example, Meta-LMTC [34] is the first to incorporate meta-learning into large-scale multi-label text classification (LMTC), proposing a meta-learning framework that combines task construction strategies with low-resource adaptation objectives. This approach demonstrates strong task transferability and scalability, particularly in scenarios involving label scarcity, few-shot, and zero-shot tasks under long-tail distributions. In addressing the challenges of long-tail labels, HTTN [35] (head-to-tail network) extracts meta-knowledge from head labels and transfers it to the tail label classifier, significantly alleviating the performance bottleneck of tail labels. MetaRisk [36] integrates meta-learning with semi-supervised learning to resolve data scarcity and multi-label few-shot generalization issues in banking operational risk classification, providing a practical tool for the financial sector. In the field of medical coding, EPEN [37] (evidence-based meta network) introduces a meta-learning model that integrates knowledge transfer with evidence representation, enhancing the generalization ability from common to rare diseases by memorizing common category knowledge and constructing robust disease representations, demonstrating strong cross-task stability and clinical adaptability.

Overall, meta-learning offers a flexible and generalizable solution to few-shot multi-label text classification. By optimizing initialization parameters, incorporating task-sensitive attention mechanisms, and combining task construction strategies with label transfer mechanisms, these methods showcase the potential of meta-learning in low-resource environments from various perspectives. Furthermore, the integration of domain knowledge and semi-supervised strategies further expands the practical application of meta-learning. In the future, with more refined task modeling and the development of universal meta-knowledge representation methods, meta-learning is poised to play a central role in complex few-shot text classification tasks (model architecture design).

3.2.5. Graph Neural Network-Based Approaches

To alleviate the few-shot problem, graph neural networks (GNNs) have been widely adopted in few-shot multi-label text classification tasks. By jointly leveraging node features and adjacency relationships within a graph structure, GNNs effectively capture high-order semantic dependencies and improve both feature representations and label prediction accuracy. As a representative GNN model, graph convolutional neural networks (GCNNs), although originally developed for image classification, have demonstrated strong modeling capabilities in multi-label text classification. Prior research [38] has shown that GCNNs can construct text–label graph structures, significantly enhancing classification performance in low-resource textual settings.

The novel architecture proposed by Rios et al. [39], ZAGCNN, combines label description information with a hierarchical label graph, propagates semantics through a GCNN, and introduces a label attention mechanism to enhance document representation, thereby significantly improving label prediction performance. Building upon this foundation, the DKEC method [40] further explores the synergistic role of domain knowledge and graph neural networks in modeling long-tail labels. It systematically enhances both graph construction and label alignment mechanisms by employing a heterogeneous graph structure and a heterogeneous graph transformer model, effectively addressing the limitations of conventional GCNs in representing heterogeneous medical knowledge and low-frequency labels.

To address the challenge of long-tail label distributions, Lu et al. [41] proposed KAMG, a multi-graph knowledge aggregation model. This approach constructs graph structures using label word embeddings, label descriptions, and prior label relationships, and introduces two variants based on graph convolutional neural networks (GCNs)—ACNN-KAMG and AGRU-KAMG. These models effectively capture high-order semantic dependencies among labels and offer structurally aware representation learning mechanisms for few-shot and zero-shot labels, making them powerful tools for tackling long-tail label challenges in multi-label learning. Building on this work, Chen et al. [42] proposed the NAS-HRL framework, which incorporates heterogeneous representation learning and neural architecture search. The model designs separate encoding subspaces for textual and label structures and uses GCNs to model semantic label relationships. By automatically searching for the optimal architecture to adapt to the complex structure between labels, this approach demonstrates stronger structural transfer and generalization capabilities. Chalkidis et al. [43] proposed a method that combines graph convolutional networks (GCN) and an extended Node2Vec-based label hierarchy, significantly enhancing model prediction performance by modeling the topological relationships between labels.

For medical tasks, Wang et al. [44] introduced the CoGraph framework, which incorporates graph contrastive learning into few-shot multi-label learning. By constructing a heterogeneous word-entity graph and introducing graph contrastive learning mechanisms (GSCL and GECL), the method effectively enhanced the representation of low-frequency labels, significantly improving medical coding performance. Chen et al. [45] proposed a relationship-enhanced method based on multi-level graph convolutional networks (GCN), which explores the hierarchical and co-occurrence relationships between labels. This approach improved the representation learning of rare ICD codes and enhanced classification performance under long-tail label distributions. In the tourism recommendation scenario, Rajaonarivo et al. [46] addressed the data scarcity issue of “lesser-known points of interest (POIs)” in social media by proposing a model that combines LightGCN with few-shot learning. This model constructs the POI-label relationship through graph structures and integrates graph embeddings with original embeddings, achieving more accurate label inference. The study demonstrated the transferability and scalability of GNNs in low-resource multi-label classification tasks.

In summary, graph neural networks (GNNs), owing to their strengths in modeling structured relational data, have emerged as a key technique in few-shot multi-label text classification. By constructing text–label or label–label graph structures and leveraging methods such as graph convolution, attention mechanisms, and heterogeneous graph modeling, GNN-based approaches effectively alleviate the challenges posed by long-tail label distributions and the representation of low-frequency labels.

3.2.6. Attention Mechanism-Based Approaches

The attention mechanism has been present since the early studies in neural modeling. An early instance was proposed by [47] in the form of a visual attention model. In natural language processing, Bahdanau et al. [48] introduced attention into neural machine translation, where it was used to align and translate sequences more effectively. The core idea of attention lies in mimicking the human cognitive process of selectively focusing on salient information. In language tasks, attention mechanisms compute a distribution of weights over different positions in the input sequence—referred to as the attention distribution—to dynamically generate a context vector that guides the model toward the most relevant information for the task at hand. As research has progressed, attention mechanisms have evolved from early global and local attention models to more advanced forms such as self-attention and multi-head attention. These mechanisms have been widely applied across a range of tasks, including text classification, machine translation, action recognition, and recommendation systems [49].

To address the challenges in few-shot multi-label text classification tasks, attention mechanisms have become a commonly used and effective approach due to their ability to dynamically focus on key information and enhance the model’s ability to distinguish between different label semantics. Some studies have combined attention mechanisms with meta-learning. For example, the ATAML [33] model integrates task-agnostic meta-learning strategies with label-guided attention modules, enabling adaptive attention to task-relevant text features, thereby significantly improving generalization ability under few-shot conditions.

In the domain of medical text classification, attention mechanisms have garnered significant interest due to their strengths in modeling label-specific semantics and enhancing generalization under few-shot scenarios. Vu et al. [50] proposed the LAAT (label attention model for ICD coding), a model specifically designed for multi-label ICD coding tasks (as illustrated in Figure 5). LAAT constructs independent attention representations for each label, enabling the extraction of highly relevant semantic features from lengthy clinical texts. This effectively mitigates issues related to the uneven distribution and wide span of label-relevant segments. Building upon this, the authors further introduced JointLAAT, which incorporates the semantic hierarchy of ICD codes to significantly improve the recognition of low-frequency labels. To further enhance the model’s capacity for representing tail and low-resource labels, Ge et al. [40] developed the DKEC method, which integrates medical domain knowledge into the attention mechanism. This model constructs label embeddings via a heterogeneous medical knowledge graph and introduces a heterogeneous label attention (HLA) mechanism to facilitate bidirectional interaction between labels and text representations. Wang et al. [51] proposed an attention-based soft prompt-learning method, enabling the prompt vectors to dynamically focus on label-relevant text regions. On the Chinese medical datasets KUAKE-QIC and CHIP-CTC, this method significantly outperforms traditional fine-tuning methods in terms of macro-F1-score, with a performance improvement of over 22%, particularly when only 20% of the training data is used.

In summary, the attention mechanism, with its advantages in modeling key information and semantic associations, has become an important technique in few-shot multi-label text classification tasks. From the early use of global and local attention to the evolution of self-attention and multi-head attention, this mechanism has been widely applied across various tasks. In the field of text classification, attention mechanisms not only improve the model’s generalization ability under limited samples but also enhance its capacity to model label semantic differences, offering new solutions to the few-shot challenge.

3.3. Other Research Approaches

In the study of few-shot multi-label text classification, aside from methods such as transfer learning, prompt learning, metric learning, and meta-learning, as well as approaches based on graph neural networks and attention mechanisms, a series of representative techniques have emerged in recent years focusing on enhancing the model’s generalization and label modeling capabilities. Yogarajan et al. [52] developed a cascading domain-specific transformer architecture, improving the model’s ability to recognize low-frequency labels in medical long texts. Rethmeier et al. [53] proposed a label embedding-based self-supervised contrastive learning method, which maps text and labels to a unified semantic space, employing noise contrastive estimation for training. This approach demonstrated good performance in few-shot and long-tail label classification tasks. Yao et al. [54] designed a dual-branch learning model (DBGB), introducing a gradient balancing loss function and a head class label under-sampling strategy to mitigate performance bias caused by label distribution imbalance, thereby enhancing the prediction capability of tail labels. Xu et al. [55] proposed the X-shot framework, which uniformly handles high-frequency, few-shot, and zero-shot labels. By transforming multi-label tasks into a triplet binary classification problem, they achieved unified modeling and inference of different label frequencies through instruction learning combined with indirect supervision mechanisms, demonstrating strong adaptability. Schopf et al. [56] introduced a fusion-based sentence embedding fine-tuning method (FusionSent), which, through dual contrastive learning and parameter fusion strategies, constructed an efficient classification model suitable for label-scarce, multi-label settings.

For ease of comparison and reference, Table 1 summarizes all the methods discussed in this section. Through analysis of the original texts, the shortcomings of the relevant methods are highlighted for a more comprehensive critical evaluation. The advantages of each method have already been detailed in the article, so they will not be reiterated here.

3.4. Multi-Model Performance Evaluation Under Similar Conditions

To better demonstrate the effectiveness of the aforementioned methods, we have summarized the experimental results of the relevant models under identical conditions. As shown in Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7. The conditions include (1) the same dataset, (2) the same evaluation metrics, and (3) primarily addressing the challenge of long-tail label distribution. The optimal results are indicated in bold.

By analyzing the experimental results across multiple datasets, including AmazonCat-13K, RCV1, AAPD, EUR-Lex, and MIMIC-III, we can observe the performance differences among various method categories in the few-shot multi-label text classification task. These differences reflect the effectiveness of each method in addressing challenges such as long-tail label distributions and data sparsity.

Firstly, on the AmazonCat-13K and EUR-Lex datasets, the metric learning-based DBGB method and its extended version DBGB-ens outperform all other methods. This indicates that, in these two datasets, leveraging metric learning combined with ensemble learning effectively addresses label imbalance, particularly improving the prediction accuracy of tail labels.

On the RCV1 dataset, the LSFA method based on data augmentation and the ProtoMix method based on metric learning show similar performance and both perform excellently compared to other methods. Notably, LSFA enhances performance through label-specific feature augmentation, while ProtoMix addresses the data sparsity issue in long-tail label classification using prototype contrastive learning.

On the AAPD dataset, the LSFA method based on data augmentation achieves the best performance, surpassing other methods. This suggests that, for the AAPD dataset, data augmentation plays a key role in mitigating data sparsity and enhancing model robustness.

For the MIMIC-III-full dataset, the graph neural network-based method proposed by Chen et al. [45] (combined with enriched descriptions from EnrichedDescriptions) and the prompt learning-based MSMN + GPsoap combination method proposed by Yang et al. [19] both perform excellently, achieving favorable results. This indicates that, in the complex medical data domain, combining graph neural networks with enhanced descriptions can effectively mine structured information from medical data, and integrating prompt learning-based methods further enhances the model’s performance in multi-label classification tasks.

Finally, in the MIMIC-III 50 dataset experiments, the prompt learning-based KEPTLongformer method demonstrated the best results, especially in rare disease coding tasks, where it significantly improved the accuracy of multi-label small-sample ICD code prediction. This result shows that, for rare label or rare disease tasks, combining prompt-based fine-tuning techniques with label semantics offers significant advantages.

4. Scenario-Specific Studies

In the context of few-shot learning, multi-label text classification has emerged as a critical research direction in natural language processing due to its capacity to model complex label structures under data-scarce conditions. This is particularly evident in downstream tasks characterized by inherent structural or semantic hierarchies—such as aspect category detection, intent detection, and hierarchical text classification—where few-shot multi-label approaches have demonstrated substantial practical potential and theoretical significance. This section focuses on these three representative tasks, systematically outlining the unique challenges and open problems associated with each, while also summarizing key existing methods and their corresponding solutions. The aim is to provide a comprehensive reference for future research efforts in these domain-specific applications.

4.1. Few-Shot Multi-Label Aspect Category Detection

Aspect category detection (ACD), also referred to as aspect category recognition [57], is a fundamental task in natural language processing that aims to identify the specific aspects or topics discussed within a given text. Formally, it is framed as a multi-label text classification problem, wherein a single review or opinionated sentence may correspond to multiple aspect category labels (as illustrated in Table 8). The task is characterized by several intrinsic challenges, including high semantic similarity among labels, complex label co-occurrence structures, and severe class imbalance. These challenges are further exacerbated under few-shot learning conditions. On the one hand, the scarcity of annotated data hinders the model’s ability to capture discriminative features for each category. On the other hand, semantic overlap and frequent co-occurrence among labels often lead to prototype confusion, thereby compromising classification performance. Moreover, the presence of noisy information and ambiguous expressions within user-generated texts significantly weakens model robustness and generalization. Consequently, addressing aspect category detection in few-shot multi-label settings necessitates tackling several core challenges: constructing highly discriminative category representations with limited supervision, mitigating semantic interference, enhancing multi-label inference capabilities, and suppressing the impact of textual noise. These issues have become central research foci in the development of more effective and resilient few-shot multi-label aspect category detection models.

To address the aforementioned challenges, a growing body of recent work has sought to enhance few-shot aspect category detection by integrating prototypical networks with multi-label reasoning mechanisms. Hu et al. [58] were the first to explore aspect category detection in the few-shot setting, introducing proto-AWATT, a multi-label few-shot learning framework based on prototypical networks. As shown in Figure 6, the model employs support-set and query-set attention mechanisms to mitigate noise interference and incorporates a dynamic thresholding strategy to accommodate the variable number of aspect labels within each sentence. This design effectively alleviates key challenges associated with multi-label aspect category detection under limited supervision. Building on this foundation, Liu et al. [59] proposed the label-enhanced prototypical network (LPN), which integrates label semantic information and contrastive learning. By leveraging textual descriptions of aspect categories as auxiliary knowledge, the LPN enhances prototype discriminability at the source level. A contrastive loss function is further introduced to cluster instances of the same class and separate those from different classes, thereby reducing prototype confusion and suppressing noise. Focusing on noise modeling and category distinctiveness, Zhao et al. [60] developed the label-driven denoising framework (LDF). This method uses label-guided attention to filter out noisy words and generate high-quality prototypes, while a label-weighted contrastive loss distinguishes between semantically similar prototypes. The approach yields substantial improvements in classification accuracy for few-shot multi-label aspect category detection. While the use of label textual descriptions has shown promise in enhancing attention mechanisms, such texts are often short and semantically sparse, limiting their ability to effectively differentiate among categories. Furthermore, traditional prototypical networks typically compute class prototypes by averaging support instance embeddings, overlooking the variation among samples that is particularly pronounced in multi-label aspect detection. To overcome these limitations, Wang et al. [61] proposed proto-SLWLA, a prototype network incorporating sentence-level weighting and label enhancement. This method addresses instance-level noise and label semantic ambiguity by assigning greater importance to high-quality samples via sentence-level attention and enriching label representations with language model-generated expansions. A query attention mechanism is also introduced to further suppress irrelevant information. To more effectively manage noise, Zhao et al. [62] proposed the FSO framework, which innovatively models semantic relations among samples using set-theoretic operations—intersection, difference, and union—to distinguish between relevant and irrelevant aspect content. The framework enhances prototype discriminability while employing class-specific query vectors and a multi-label joint loss optimization strategy to eliminate reliance on fixed thresholds, thereby improving multi-label prediction performance. Peng et al. [63] further advanced the field with VHAF, a variational hybrid attention framework. To address noise, VHAF introduces a hybrid mechanism combining aspect-level and cross-instance attention to enhance the discriminative power of aspect-specific embeddings and suppress irrelevant information. To mitigate estimation bias caused by limited samples, the model replaces point-based prototype estimation with variational distribution inference, thereby improving robustness and generalization in low-resource settings.

Building upon the aforementioned advances, recent studies have sought to further enhance few-shot multi-label aspect category detection by incorporating prompt-learning mechanisms. These approaches leverage the prior knowledge and contextual semantic representations of pre-trained language models to construct prompt-augmented prototypical frameworks, achieving notable progress in both robustness and classification performance. Guan et al. [64] introduced ProtPrompt, a prompt-enhanced prototypical network that innovatively integrates prototype learning with prompt-based paradigms. By reformulating the downstream classification task as a masked language modeling (cloze) problem, ProtPrompt explicitly guides pre-trained language models to attend to aspect category information, thereby strengthening the discriminative capacity of sentence embeddings. Through prompt learning combined with cosine similarity optimization, the model effectively addresses noise interference and the lack of prototype separability in few-shot multi-label aspect category detection. Expanding on this line of work, Guan et al. [65] proposed the label-guided prompt (LGP) framework, which targets the dual challenges of noise interference and representation misalignment. LGP introduces a label-guided prompt-learning mechanism to enhance sentence-level semantic representations and employs category descriptions generated by large language models to guide prototype construction. This strategy significantly improves class discriminability and intra-class cohesion, while alleviating the adverse effects of limited training data and semantic overlap among multiple labels. To address the combined challenges of fragile prototype representations and the complexity of multi-label reasoning, Zhao et al. [66] proposed a novel relation graph-guided few-shot learning approach. This method explicitly models both intra-class and inter-class sample relationships by constructing a fully connected relation graph and performing graph propagation and aggregation to yield robust prototype embeddings. A multi-label inference mechanism is further integrated to strengthen label-query relevance, and graph-based contrastive learning is employed to reinforce intra-class consistency and inter-class distinctiveness. As a result, this framework effectively mitigates the issues of data sparsity and complex label prediction inherent in few-shot multi-label aspect category detection, particularly under extremely low-resource conditions such as the 1-shot setting.

To facilitate a systematic comparison of the various modeling approaches, we present a summary of the experimental results in Table 9. All experiments were conducted under consistent settings using the FewAsp (multi) dataset [58], with task configurations following the standard N-way K-shot paradigm—for example, 5-way 5-shot, where each meta-task includes N aspect categories, each supported by K labeled instances and evaluated with five fixed query samples per class. The evaluation metrics include Macro-F1 and AUC. It is worth noting that due to differences in experimental setups, the results of proto-SLWLA and the method proposed by Zhao et al. [66] are not included in the table. As shown in Table 3, under the 5-way 5-shot setting, the LGP method achieves the best overall performance. In contrast, under the 5-way 10-shot, 10-way 5-shot, and 10-way 10-shot configurations, VHAF consistently outperforms other methods. Although VHAF’s AUC and Macro-F1-scores are marginally lower than those of LGP in the 5-way 5-shot setting, the performance gap remains small. These findings suggest that VHAF, which leverages variational distribution inference, exhibits superior robustness and greater performance stability in scenarios with limited training data.

4.2. Few-Shot Multi-Label Intent Detection

Multi-label intent detection [67] is a core component of task-oriented dialogue systems, aiming to identify multiple potential intent labels from a single user utterance. Compared to traditional single-label detection tasks, multi-label intent recognition is more aligned with real user expressions but also presents greater modeling challenges. Figure 7 illustrates a typical example of multi-label intent detection. In recent years, this task has been widely regarded as a multi-label classification problem and has attracted considerable attention [68,69]. However, in practical applications, it still faces two major challenges: (1) the frequent emergence of new intent labels in rapidly evolving business domains, leading to extremely limited annotated data; and (2) existing methods, when modeling the complex relationships between multiple labels, typically rely on a unified semantic representation, which fails to adequately capture the inherent dependencies and semantic differences between labels, resulting in prediction uncertainty.

The challenges outlined above have prompted increasing attention toward few-shot multi-label intent detection, which aims to accurately model and predict multiple potential intent labels under severely limited supervision. In this context, Hou et al. [70] conducted one of the earliest studies in few-shot multi-label classification. To address the difficulties of threshold estimation and label representation confusion in low-resource settings, they proposed the meta-calibrated threshold (MCT) mechanism. This method combines prior domain knowledge with domain-adaptive kernel regression calibration to enable more effective threshold estimation. Additionally, they introduced anchored label representation (ALR), which leverages intent label embeddings as anchors to construct more discriminative label representations, thereby reducing representation confusion and improving the accuracy of label-sample relevance scoring. However, this approach does not fully mitigate the influence of irrelevant labels present in individual utterances. To address this limitation, Zhang et al. [71] proposed HCC-FSML, a method based on hybrid correlation computation. It incorporates instance-level and feature-level attention mechanisms to suppress negative label noise in both the support and query sets, enabling the extraction of more accurate label prototypes and similarity scores. This method effectively reduces the semantic interference arising from the inherent complexity of multi-label intent expressions. Nevertheless, Hou et al.’s [70] framework remains limited in its capacity to model semantic representations of multi-label utterances and to capture intra-class and inter-class relations. To address this challenge, Zhang et al. [72] proposed the dual-class knowledge propagation network (DCKPN), which incorporates a label semantic enhancement module to embed label name information and generate more discriminative representations. The model employs a dual-layer graph neural network at the instance and class levels to capture both sample-level feature propagation and inter-label semantic dependencies, thereby improving representation precision. Furthermore, an adaptive intent count prediction module is integrated to enable effective multi-label intent detection.

As shown in Table 10, user intents are typically composed of two primary components: an action and an object—commonly reflected in the verb–noun structure. However, existing approaches have largely overlooked this inherent compositionality. A more effective strategy would involve explicitly extracting and modeling the semantic information of both components to improve intent recognition accuracy. Motivated by this, Zhang et al. [73] addressed two core challenges in few-shot multi-label intent detection: (1) the difficulty of distinguishing the semantic structure of intents composed of actions and objects, and (2) the underutilization of hierarchical relationships among intent labels. To this end, they proposed LHS (label hierarchy-aware soft prompting), a prompt-learning framework that incorporates label hierarchies. LHS enhances sentence representations by generating label-enriched textual prompts via GPT-4 and models the hierarchical structure of action–object pairs using a graph neural network. Furthermore, it integrates a soft prompting mechanism with prototypical contrastive learning to improve the model’s discriminative capability. In addition, to address the problems of data scarcity and the challenges of cross-domain and multilingual adaptation, a coarse-to-fine prototypical learning (CFPL) approach was proposed by Zhang et al. [74]. This method enriches support set representations by augmenting them with label synonym sets to enhance semantic expressiveness. A knowledge distillation-based prototype refinement mechanism is designed to transfer coarse-grained domain knowledge from a general teacher model into fine-grained, class-specific prototypes, thereby improving class separability. Moreover, efficient fine-tuning with cross-lingual teacher models enables rapid adaptation in multilingual settings.

Recent studies have proposed innovative solutions to few-shot multi-label intent detection from the perspective of prompt learning. Addressing two central challenges, difficulty in threshold setting under limited supervision and insufficient modeling of intent label dependencies, Zhou et al. [20] introduced a two-stage prompt-based fine-tuning (PFT) method. This approach reformulates threshold estimation as an intent quantity prediction task. In the first stage, a quantity-prompting template is used to estimate the number of intents present; in the second stage, intent-prompting templates are constructed accordingly, where multiple [MASK] tokens are inserted and filled using a pre-trained language model (PLM) via masked language modeling to identify specific intent labels. To better model inter-intent dependencies, the authors further proposed a multi-view multi-head self-attention (MSA) mechanism, which effectively captures complex relationships among intents. In parallel, Zhuang et al. [21] proposed PLMA, a prompt-learning framework enhanced by large language models (LLMs). This method leverages the complementary strengths of large and small language models by using the LLM to extract intent spans and expand the label space, thereby improving semantic understanding of both the query and intent labels. Additionally, they developed an improved question-answering-style fine-tuning procedure with efficient one-step training and inference templates. These two approaches offer effective solutions to the challenges of few-shot multi-label intent detection, making substantial progress in prompt structure optimization, the modeling of intent dependencies, and the enhancement of semantic understanding.

To systematically compare the performance of different models on the few-shot multi-label intent detection task, we summarize the experimental results of various methods in Table 11 and Table 12. Since the datasets used by the PFT and PLMA methods differ, their results are not included. All experiments were conducted under consistent settings to ensure fairness and reproducibility: the same embedding models (Electra-small and BERT-base) and datasets (TourSG and StanfordLU, detailed in Chapter 5) were used, with evaluations performed under 1-shot and 5-shot few-shot multi-label scenarios (i.e., only one or five labeled samples per intent were provided for training). Micro-F1 was adopted as the unified evaluation metric, and boldface in the tables indicates the best performance under the current setting.

From the table, it is evident that on the TourSG dataset, the LHS method achieves the best performance regardless of the embedding model used, indicating its stronger adaptability and robustness for tourism-related tasks. On the StanfordLU dataset, the LHS method also attains optimal results when Electra-small is employed as the embedding model. Notably, the CFPL method was evaluated only on the StanfordLU dataset using the BERT-base embedding model; under this specific setting, it demonstrates superior performance, outperforming other methods across nearly all metrics, highlighting its strong potential when supported by high-quality embeddings. However, since CFPL has not been assessed under other configurations, its generalizability to broader scenarios remains inconclusive. These experimental results preliminarily validate the stability and wide applicability of the LHS method across various embedding models and datasets, while the exceptional performance of CFPL in limited settings calls for further comprehensive empirical investigation to substantiate its overall superiority.

4.3. Few-Shot Multi-Label Hierarchical Text Classification

Multi-label hierarchical text classification is a complex and challenging task in the field of text classification. It requires models not only to assign multiple relevant labels to a given text but also to account for the hierarchical structure among those labels. As illustrated in Figure 8, under a predefined label taxonomy, Multi-label hierarchical text classification models must predict multiple labels that typically follow a path from coarse-grained concepts to fine-grained semantics, progressively refining the categorization from higher-level to more specific labels [75]. In general, fine-grained labels are more closely aligned with the detailed semantics of the text, offering greater precision, whereas coarse-grained labels often serve as parent nodes in the hierarchy and reflect broader semantic categories. This task has relatively high practical relevance and is widely applied in areas such as product categorization, news recommendation, and biomedical literature organization.

Compared with conventional multi-label classification, multi-label hierarchical text classification incorporates structured prior knowledge into label modeling, requiring the predicted label set to maintain path-level hierarchical consistency. This significantly increases the complexity of the task. In practical applications, however, the intricate nature of label hierarchies and the demand for cross-domain transfer often lead to prohibitively high costs for obtaining sufficient annotated data. As a result, many label paths suffer from extremely limited supervision, with some labels having only a few or even zero annotated examples. Consequently, the problem of multi-label hierarchical text classification under few-shot conditions has increasingly attracted the attention of researchers.

HiMatch [76] represents one of the earliest efforts to explicitly model hierarchical label semantics in multi-label classification. The method formulates the task as a multi-label problem and enhances the model’s understanding of label hierarchies by incorporating label semantics. Its core design involves a hierarchical-aware semantic matching network, which aligns label embeddings with text representations to support top-down, layer-wise label prediction, thereby effectively exploiting hierarchical information. Although not originally designed for few-shot scenarios, HiMatch introduces a low-resource label partition on the EURLEX-57K dataset (with no more than 50 training instances per class) and demonstrates strong structural modeling capabilities and robustness under this setting. This foundational work has inspired further research on few-shot multi-label hierarchical text classification. In subsequent research, Ji et al. [77] proposed HierVerb, a multi-prompt-driven hierarchical path modeling approach. This model reformulates the task as a path-level prediction problem and introduces a hierarchical verbalizer that employs distinct prompt templates at each layer of the label hierarchy to guide pre-trained language models in generating labels step by step. By incorporating path-level sampling and a path-consistency metric, HierVerb significantly enhances the model’s ability to capture hierarchical label structures while ensuring both semantic precision and structural consistency. To further improve hierarchical modeling in few-shot scenarios, Ji et al. [78] introduced HierICRF, a unified framework designed to address the challenge of transferring unstructured semantics from pre-trained language models to structured label hierarchies. By combining a hierarchical reasoning chain with an iterative conditional random field (ICRF), the model performs layer-wise label generation and path correction. HierICRF is compatible with various pre-trained language models (e.g., BERT, T5) and demonstrates superior performance across multiple few-shot settings, achieving state-of-the-art results, particularly in terms of path and semantic consistency.

In multi-label hierarchical text classification, static threshold setting and label imbalance are common challenges, particularly in few-shot scenarios. To address this, Kim et al. [79] introduced the H2B (hierarchy-aware biased bound) loss function, which replaces the traditional fixed thresholds with learnable dynamic thresholds. This approach incorporates positive and negative label biases during the training phase to adaptively adjust the thresholds, alleviating the difficulty of recognizing underrepresented labels. Experimental results demonstrate that H2B enhances accuracy and robustness in multi-label tasks with sparse data and complex label hierarchies. Additionally, Chen et al. [80] proposed a retrieval-based contextual learning framework leveraging large language models to address the complexities of label hierarchies and sample scarcity in few-shot multi-label classification. This method constructs a retrieval database with HTC (hierarchical ag-aware context) label awareness and employs continuous training of pre-trained language models, incorporating masked language modeling (MLM), hierarchical classification (CLS), and differential contrastive learning (DCL) objectives to generate more effective label representations. Furthermore, Chen et al. designed a hierarchical iterative prediction strategy that generates labels layer-by-layer, significantly reducing the candidate label space. In the task of fine-grained event identification in aviation accident reports, Zhao et al. [81] proposed a hierarchical multi-label classification method incorporating the BERT model to address challenges such as large label sets and the difficulty in identifying underrepresented categories. By utilizing the NTSB event hierarchy, they designed three key mechanisms: a hierarchical attention mechanism to guide coarse-grained information to fine-grained classification, recursive regularization to strengthen parameter sharing across label hierarchies, and a label distribution penalty term to mitigate the challenges of rare label recognition. This method significantly improved the accuracy of rare labels, validating the effectiveness of hierarchical label information under low-resource conditions.

5. Commonly Used Datasets

As shown in Table 13, several widely used benchmark datasets for few-shot multi-label text classification are summarized. These datasets span various domains and text types and are frequently adopted for evaluating and comparing the performance of emerging methods due to their diversity and representativeness. In the table,

N

refers to the total number of instances,

L

denotes the number of unique labels, and

L C

represents the label cardinality (i.e., the average number of labels per sample). These statistics offer a comprehensive view of the dataset scale and label distribution, serving as a valuable reference for experimental setup and algorithmic evaluation. They also facilitate deeper analysis of a model’s generalization ability and adaptability under few-shot multi-label conditions.

AmazonCat-13K [82]: This dataset is derived from product data on the Amazon platform and is used for multi-label text classification tasks. Each product is associated with a title and description, along with one or more category labels. It contains approximately one million samples and 13,330 labels, with an average of about 5.04 labels per sample.

MIMIC-III [83]: A large-scale publicly available intensive care database containing clinical data of patients admitted to the intensive care unit (ICU) at Beth Israel Deaconess Medical Center in Boston, USA, from 2001 to 2012. This dataset encompasses a wide range of information, including vital signs, medication records, laboratory test results, imaging reports, diagnosis codes, and nursing notes, featuring rich content and diverse structure. Due to its extensive temporal coverage, comprehensive data scope, and open-access nature, MIMIC-III has become a critical resource in clinical research, industrial applications, and medical education, with particular significance in intensive care studies.

MIMIC-II [84]: A publicly available intensive care research database that collects clinical and physiological waveform data from four adult ICUs (MICU, SICU, CCU, CSRU) at Beth Israel Deaconess Medical Center between 2001 and 2008. This dataset includes patient demographic information, medication usage, laboratory tests, nursing notes, and high-resolution (125 Hz) physiological waveforms such as electrocardiograms and blood pressure. Due to its retention of real-world clinical noise and artifacts, it serves as an important resource for multidisciplinary research.

AAPD [85]: This dataset comprises 55,840 academic paper abstracts from the field of computer science, each annotated with multiple topic labels. The dataset contains a total of 54 labels, with an average of 2.41 labels assigned per paper. It is well-suited for multi-label classification research, particularly for exploring complex relationships between textual content and associated labels.

RCV1 [86]: Provided by Reuters, this raw text classification dataset contains over 800,000 manually annotated news articles, covering three category sets: Topics, Industries, and Regions. However, the original version suffers from several issues due to encoding errors and inconsistent expansion of hierarchical structures. For example, some documents lack essential label information, or their hierarchical label structures are incompletely unfolded. These problems may impact subsequent classification research and model training. Consequently, a revised version, RCV1-v2, was introduced to address these deficiencies.

EUR-Lex [87]: A large-scale text classification dataset consisting of European Union legal documents, widely used in multi-label text classification research. The dataset contains 19,596 legal texts covering various document types such as treaties, regulations, case law, and legislative proposals. It provides three multi-label classification schemes, among which the EUROVOC taxonomy is the most complex. EUROVOC is a hierarchical thesaurus comprising 3993 labels, with each document associated with an average of 5.37 labels, exhibiting high label sparsity and diversity.

Wiki10-31K [88]: This dataset is a large-scale corpus designed for multi-label classification research, derived from English Wikipedia articles and enriched with user-provided tags from the social bookmarking platform Delicious. It includes 20,764 Wikipedia articles, each annotated by at least 10 users. All tags have undergone preprocessing and cleaning to retain representative open vocabulary labels. A key feature of this dataset is that labels carry weight attributes reflecting the frequency of user annotations, thereby more accurately representing the importance and relevance of each label.

DBPedia [89]: DBPedia is a large-scale, multilingual, and open knowledge graph constructed by extracting structured knowledge from Wikipedia. The project covers 111 language editions of Wikipedia and employs crowdsourcing to map information boxes from different languages onto a shared ontology, enabling cross-lingual knowledge integration. It is widely used in research tasks such as the semantic web, knowledge representation, question answering, text classification, and information extraction.

TourSG [90]: TourSG is a publicly available dataset designed for multi-label intent recognition and dialog state tracking (DST) research. Initially released as part of the Dialog State Tracking Challenge (DSTC-4), this dataset is derived from interaction logs between real users and a guide-style voice dialog system. It is built around the tourism scenario of Singapore, covering six service domains: itinerary, accommodation, attractions, food, transportation, and shopping. TourSG contains a total of 25,751 multi-turn user utterances, each annotated with multiple intent labels. The dataset also provides multi-hypothesis outputs for automatic speech recognition (ASR) and spoken language understanding (SLU). Organized in JSON format, it includes dialog content, semantic annotations, and baseline system outputs, making it suitable for research in multi-label intent recognition, few-shot learning, and cross-domain generalization evaluation in dialog systems.

StanfordLU [91]: StanfordLU is a multi-domain task-oriented dialogue dataset, re-annotated and expanded based on the original Stanford Dialogue Dataset. It contains 3031 multi-turn dialogues and 8038 user utterances, covering three domains: scheduling, point-of-interest navigation, and weather queries. The dialogues were primarily collected using the Wizard-of-Oz methodology to simulate natural interactions between users and in-car assistants. Each utterance is annotated with multi-label intents and aligned semantically with a knowledge base to support dynamic information retrieval.

EURLEX57K [92]: A publicly available large-scale multi-label text classification dataset comprising 57,000 English legislative documents sourced from the European Union legal portal EUR-Lex. Each document is annotated with multiple labels derived from the EUROVOC legal thesaurus, totaling 4271 unique labels, of which only 2049 appear more than 10 times, exhibiting a pronounced long-tail distribution. The dataset supports tasks such as conventional classification, few-shot learning, and zero-shot learning. The average label cardinality is 5.07. Documents are segmented into four structured sections—title, introduction, main body, and annex—facilitating studies on the impact of document structure on classification performance.

FewAsp (multi) [58]: FewAsp (multi) is a few-shot aspect category detection dataset constructed based on Yelp [93] multi-domain reviews, focusing on modeling sentences with multiple aspects. The dataset is selected from user reviews involving multiple aspect labels in scenarios such as restaurants, hotels, and beauty spas, retaining only active users who have written at least 10 reviews to ensure data quality. It is divided into training (64 categories), validation (16 categories), and test (20 categories) sets across 100 aspect categories. The data is randomly sampled to reflect the multi-label distribution encountered in real-world applications, making it well-suited for research in few-shot multi-label learning tasks.

RCV1-v2 [86]: RCV1-v2 is a revised version of the original RCV1 dataset, developed to improve data consistency and annotation quality. This version systematically cleans and optimizes the original dataset by removing documents that violate encoding standards, completing missing hierarchical labels, and correcting erroneous regional tags. For example, the hierarchical labels within the Topics category were fully expanded, and some invalid or non-compliant documents were removed, thereby enhancing the dataset’s suitability and reliability for multi-label text classification tasks.

WOS [94]: This dataset is collected from the Web of Science and comprises 46,985 published paper abstracts along with their associated disciplines and keywords. It employs a two-level label structure: the first level includes seven major academic fields (such as Computer Science and Medical Science), and the second level consists of 134 specific subfields (e.g., 17 subfields under Computer Science and 53 subfields under Medical Science). The text classification task uses paper abstracts as input, with the first-level labels representing academic fields and the second-level labels corresponding to keyword-described subfields.

6. Commonly Used Evaluation Metrics

In multi-label text classification tasks, evaluating model performance is a crucial undertaking. Unlike traditional single-label classification, multi-label classification allows each sample to be associated with multiple category labels simultaneously, necessitating more nuanced and comprehensive evaluation metrics to accurately assess model effectiveness. This need is particularly pronounced in few-shot learning scenarios, where the extremely limited training data make the model’s generalization ability and robustness key dimensions for evaluation. Conventional classification metrics such as accuracy, precision, recall, and F1-score require extension and adaptation to account for the characteristics of multi-label settings. To more comprehensively reflect model performance at different levels, researchers have proposed various evaluation methods tailored to multi-label contexts. These metrics assess both the overall prediction quality on a per-instance basis and the classification performance across individual labels. This paper categorizes commonly used multi-label classification evaluation metrics into two groups: instance-based metrics and label-based metrics. The former evaluate model performance from the perspective of each sample, while the latter measure the model’s aggregate performance on each label.

6.1. Instance-Based Evaluation Metrics

Instance-based evaluation metrics focus on the model’s predictive performance for each individual test sample. These metrics assess the degree of overlap between the predicted label set and the true label set for each sample and then average the scores across all samples to obtain an overall measure of model performance. The key advantage of this category lies in its intuitiveness and localized focus: it directly reflects the model’s prediction quality at the level of specific samples (local performance), rather than merely capturing global trends across the entire dataset. Commonly used metrics include accuracy, precision, recall, and F1-score, which are typically defined based on the intersection and union of predicted and true label sets. However, these metrics have limitations, such as sensitivity to imbalanced label distributions or situations where the model frequently predicts empty label sets.

(1): Accuracy

Accuracy measures the overall correctness of a model’s label predictions in multi-label classification tasks. It is typically defined as the average, across all samples, of the ratio between the number of correctly predicted labels and the size of the union of predicted and true labels for each sample. In few-shot settings, accuracy provides an initial indication of model performance but can be sensitive to individual samples. Therefore, it is often used in conjunction with precision, recall, and other metrics to provide a more comprehensive evaluation. Higher accuracy values indicate better model performance.

A c c u r a c y = \frac{1}{t} \sum_{i = 1}^{t} \frac{|Z i \cap Y i|}{|Z i \cup Y i|}

(2)

where

t

denotes the total number of samples in the test set

Z i

represents the true label set of the

i - th

sample, and

Y_{i}

denotes the predicted label set for the

i - th

sample.

(2): Precision

Precision measures the accuracy of a model’s positive label predictions in multi-label classification tasks. Specifically, for each sample, precision is calculated as the ratio of correctly predicted positive labels to the total number of predicted positive labels and then averaged over all samples. In few-shot scenarios, precision helps evaluate the model’s ability to avoid false positives and is often used alongside recall to provide a more comprehensive assessment of model performance. Higher values indicate better performance.

P r e c i s i o n = \frac{1}{t} \sum_{i = 1}^{t} \frac{|Z i \cap Y i|}{|Y_{i}|}

(3)

where

t

denotes the total number of samples in the test set.

Z i

represents the true label set of the

i - th

sample, and

Y_{i}

denotes the predicted label set for the

i - th

sample.

(3): Recall

Recall measures a model’s ability to identify all relevant labels in a multi-label classification task, representing the proportion of true positive labels that are successfully predicted. Specifically, for each sample, recall is calculated as the ratio of correctly predicted labels to the total number of true labels for that sample and then averaged over all samples. In few-shot settings, recall reflects the model’s capability to detect all relevant labels, helping to assess missed detections. It is typically used alongside precision to provide a comprehensive evaluation of model performance. Higher values indicate better model performance.

R e c a l l = \frac{1}{t} \sum_{i = 1}^{t} \frac{|Z i \cap Y i|}{|Z i|}

(4)

where

t

denotes the total number of samples in the test set.

Z i

represents the true label set of the

i - th

sample, and

Y_{i}

denotes the predicted label set for the

i - th

sample.

(4): F1-Score

The F1-score is the harmonic mean of precision and recall, used to comprehensively evaluate the overall performance of a model in multi-label classification tasks. This metric balances precision and recall, making it particularly suitable for scenarios with label imbalance or where both false positives and false negatives are critical. In few-shot settings, the F1-score provides a more holistic reflection of the model’s classification ability under limited data, serving as an important complement to accuracy, precision, and recall. Higher values indicate better model performance.

F 1 = \frac{1}{t} \sum_{i = 1}^{t} \frac{2 |Z i \cap Y i|}{|Z i| + |Y i|}

(5)

where

t

denotes the total number of samples in the test set.

Z i

represents the true label set of the

i - th

sample, and

Y_{i}

denotes the predicted label set for the

i - th

sample.

(5): Top- $k$ Precision ( $P @ k$ )

The

P @ k

is a commonly used evaluation metric in few-shot or zero-shot multi-label text classification tasks, measuring the accuracy of the model within the top

k

predicted labels. Specifically,

P @ k

indicates the proportion of the top

k

predicted labels with the highest confidence scores that belong to the true label set of the sample. Its value ranges from [0, 1], where a higher value signifies better accuracy in high-confidence predictions and better ranking quality of the predicted results.

P @ k = \frac{1}{k} \sum_{i \in {rank}_{k} (z)} y_{i}

(6)

where

y \in {0, 1}^{L}

represents the vector of true labels.

{rank}_{k} (z)

represents the indices of the top

k

elements in the predicted vector after sorting it in descending order.

(6): Top- $k$ Normalized Discounted Cumulative Gain ( $N D C G @ k$ )

N D C G @ k

is a widely used ranking evaluation metric in multi-label classification and information retrieval tasks, designed to assess the quality of the model’s predictions within the top k predicted labels. This metric considers not only the relevance of the predicted labels but also their positions in the ranked list. The calculation involves two steps: first, computing the discounted cumulative gain, which assigns higher weights to relevant labels appearing at higher ranks; and second, normalizing this value to obtain

N D C G @ k

, thereby mitigating the effects of varying numbers of relevant labels across samples. The metric ranges from [0, 1], with higher values indicating that the model not only identifies relevant labels within high-confidence predictions but also ranks them more appropriately.

N D C G @ k

is particularly important for evaluating the ranking capability of models in multi-label prediction scenarios.

D C G @ k = \sum_{i \in {rank}_{k} (z)} \frac{y_{i}}{\log (i + 1)}

(7)

n D C G @ k = \frac{DCG @ k}{\sum_{i = 1}^{\min (k, {‖y‖}_{0})} \frac{1}{\log (i + 1)}}

(8)

where

y \in {0, 1}^{L}

represents the vector of true labels.

{rank}_{k} (z)

represents the indices of the top

k

elements in the predicted vector after sorting it in descending order.

{‖y‖}_{0}

denotes the number of true labels.

(7): Top-k Propensity-Scored Precision ( $P S P @ k$ )

P S P @ k

is a crucial evaluation metric for extreme multi-label classification tasks, designed to assess a model’s ability to predict tail (rare) labels. This metric leverages the propensity score, which is a function of the label’s occurrence frequency, to weight the predictions, thereby diminishing the influence of frequent head labels and emphasizing rare tail labels. By computing the weighted proportion of correct predictions,

P S P @ k

provides a more objective reflection of model performance under long-tailed label distributions, making it widely used for evaluating long-tail label effectiveness in multi-label text classification tasks.

P S P @ k = \frac{1}{k} \sum_{i \in r a n k_{k} (z)} \frac{y_{i}}{p_{i}}

(9)

where

p_{i}

denotes the propensity score of the

i - th

ranked label; the rest is the same as above.

(8): Top-k Recall ( $R @ k$ )

Recall at k (

R @ k

) is a widely used metric in few-shot and zero-shot multi-label classification tasks, designed to evaluate a model’s ability to cover the true labels within its top k highest-confidence predictions. Compared to precision at k (

P @ k

),

R @ k

emphasizes the coverage of all true labels within a limited prediction range, making it particularly suitable for scenarios with few labels or long-tailed label distributions. Additionally,

R @ k

does not rely on a fixed threshold, allowing direct performance evaluation when model outputs are ranked scores. Its value ranges from [0, 1], with higher values indicating a stronger capability of the model to successfully capture true labels within the top k predictions.

R @ k = \frac{1}{N} \sum_{i = 1}^{N} \frac{|Y_{i} \cap {\hat{Y}}_{i}^{(k)}|}{|Y_{i}|}

(10)

where

N

denotes the total number of samples in the test set,

Y_{i}

represents the ground truth label set of the

i - th

sample, and

{\hat{Y}}_{i}^{(k)}

denotes the top

k

predicted labels for the

i - th

sample, ranked by the model’s confidence scores.

6.2. Label-Based Evaluation Metrics

Label-based evaluation metrics assess model performance by analyzing predictions across the entire dataset from the perspective of each individual label. These metrics provide a comprehensive view of how well the model performs on different labels, making them particularly suitable for multi-label classification tasks. They are especially valuable in scenarios with a large number of labels or imbalanced label distributions. By computing evaluation scores for each label and aggregating them via macro- or micro-averaging, label-based metrics effectively capture the model’s ability to recognize minority labels, offering targeted insights for model refinement.

(1): Micro-F1

The micro F1-score is a widely used performance metric in multi-label classification, particularly suited for scenarios with imbalanced sample or label distributions. It treats all label predictions across the dataset as a single aggregated set, computing global precision and recall based on the total number of true positives, false positives, and false negatives. The final F1-score is then derived from these global statistics, providing a holistic measure of overall model performance.

Micro - F 1 = \frac{2 \cdot T P_{micro}}{2 \cdot T P_{micro} + F P_{micro} + F N_{micro}}

(11)

where

T P

(true positives) denotes the number of instances correctly predicted as positive,

F P

(false positives) refers to instances incorrectly predicted as positive, and

F N

(false negatives) indicates instances that were incorrectly predicted as negative despite being truly positive.

(2): Macro-F1

The macro F1-score measures the average performance of a model across all labels by first computing the F1-score for each label individually and then taking their unweighted mean. As each label contributes equally, regardless of its frequency, this metric offers a more balanced evaluation in datasets with skewed label distributions. It is particularly useful for assessing a model’s ability to handle long-tail, infrequent, or even zero-shot labels, where conventional metrics may obscure deficiencies in minority class performance.

Macro - F 1 = \frac{1}{L} \sum_{i = 1}^{L} \frac{2 \cdot T P_{i}}{2 \cdot T P_{i} + F P_{i} + F N_{i}}

(12)

where

L

denotes the total number of labels,

T P_{i}

represents the number of true positives for the

i - th

label,

F P_{i}

refers to the number of false positives for the

i - th

label, and

F N_{i}

refers to the number of false negatives for the

i - th

label.

(3): AUC

The area under the curve (AUC) quantifies a model’s overall ability to distinguish between positive and negative samples across different classification thresholds. Specifically, the AUC represents the probability that the model assigns a higher score to a randomly selected positive instance than to a randomly selected negative one. Its values range from 0 to 1, with values closer to 1 indicating superior discriminative performance. Because AUC jointly considers the true positive rate (TPR) and false positive rate (FPR) without relying on any particular classification threshold, it remains robust and generalizes well even under imbalanced class distributions.

A U C = \int_{0}^{1} T P R (F P R^{- 1} (t)) d t

(13)

Here, the true positive rate (TPR) and false positive rate (FPR) are defined as follows:

T P R = \frac{T P}{T P + F N}, F P R = \frac{F P}{F P + T N}

(14)

6.3. Summary

To provide a more comprehensive evaluation of the application of various metrics in multi-label text classification under few-shot learning scenarios, Table 14 summarizes the evaluation metrics of all relevant models and algorithms reported in the literature. Where, “√” indicates that the evaluation metric is used in the article. Specifically, Acc denotes accuracy, Pre represents precision, and R indicates recall.

As shown in Table 14, Micro-F1, Macro-F1,

P @ k

, and

n D C G @ k

are the most commonly used evaluation metrics for multi-label text classification under few-shot learning scenarios. These metrics provide a more accurate reflection of the model’s performance in the context of limited data, offering a more nuanced evaluation perspective, particularly when the sample size is small. Specifically, Micro-F1 and Macro-F1 offer a comprehensive performance evaluation across different label categories,

P @ k

assesses the classification accuracy for the top-k labels, and

n D C G @ k

further incorporates ranking and relevance, providing a more holistic analysis of the model’s performance. Therefore, selecting appropriate evaluation metrics is crucial for multi-label text classification tasks in few-shot settings, as it aids researchers in better understanding and comparing the strengths and weaknesses of different models, thereby offering more effective guidance for model optimization.

7. Conclusions

Multi-label text classification under few-shot scenarios is a critical task in natural language processing, facing the dual challenges of sample scarcity and label complexity. In recent years, significant progress has been made in this field, driven by the development of pre-trained language models and the introduction of new paradigms such as prompt learning and meta-learning. This paper systematically reviews the current technological advancements: At the data level, generative augmentation and pseudo-labeling techniques effectively address issues related to data scarcity and long-tailed label distributions, enhancing data quality and classification performance. At the model level, methods such as transfer learning, prompt learning, metric learning, meta-learning, graph neural networks, and attention mechanisms have notably improved the adaptability and generalization ability of models. Additionally, this paper provides an in-depth analysis of the core concepts, technical features, and performance of these methods across different application scenarios.

Despite notable progress, current research still faces several key challenges: (1) Data-related issues: the scarcity of training samples, long-tail label distributions, and noise problems limit the model’s ability to learn and generalize low-frequency labels. (2) Insufficient transfer and generalization capabilities: pre-trained models exhibit unstable transfer performance, especially under complex label structures, with poor performance on tail labels. (3) Inadequate label structure modeling: existing methods fail to effectively capture the hierarchical structure and semantic relationships between labels, which undermines the model’s expressive power. (4) Limited utilization of external knowledge: particularly in specialized domains such as healthcare and law, there is a lack of effective mechanisms for integrating structured knowledge with contextual information. (5) Heavy reliance on large-scale external data: the over-reliance on manual design and large-scale databases restricts the applicability of these methods in resource-constrained scenarios.

Based on the aforementioned challenges, future research can focus on the following directions: (1) Designing lightweight models with high generalization capability: integrating few-shot learning and low-resource adaptive modeling to develop efficient, concise, and interpretable model architectures, thereby enhancing the model’s adaptability and generalization ability in practical applications. (2) Multidimensional and heterogeneous knowledge fusion: integrating multiple sources of knowledge such as knowledge graphs, contextual information, and label hierarchical structures to construct a unified modeling framework that combines both structural and semantic understanding, enhancing the model’s ability to comprehend complex label relationships. (3) Fusion of prompt learning and meta-learning: combining the strong transferability of pre-trained language models with the rapid adaptability of meta-learning to build a unified few-shot multi-label learning framework suitable for applications with dynamic label distributions. (4) Fine-grained label structure modeling and hierarchical supervision mechanisms: designing label structure modeling methods with semantic hierarchy awareness to effectively capture the complex relationships between labels, and enhancing the model’s ability to recognize these relationships with the help of hierarchical supervision signals. (5) Robustness modeling against noise: introducing adversarial training, noise detection, and correction mechanisms to improve the model’s robustness and stability in noisy environments.

In conclusion, multi-label text classification under few-shot scenarios tasks are still rapidly developing, with the task being complex, widely demanded, and having vast application prospects. Future research should further explore the fusion of pre-trained language models with label structures and domain knowledge and aim to develop more efficient, generalized, and interpretable classification models to meet the new requirements of intelligent text classification in real-world tasks.

Author Contributions

Conceptualization, W.H. and Q.F.; Writing—original draft preparation, W.H.; Writing—review and editing, W.H., Q.F. and K.Z.; Supervision, X.X., S.H. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the Research Program of the National University of Defense Technology under Grant No. ZK23-58, and in part by the National Natural Science Foundation of China under Grant No. 62402510.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. 2020, 53, 1–34. [Google Scholar] [CrossRef]
Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-learning in neural networks: A survey. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2021; Volume 44, pp. 5149–5169. [Google Scholar]
Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. arXiv 2020, arXiv:2012.15723. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Falis, M.; Dong, H.; Birch, A.; Alex, B. Horses to zebras: Ontology-guided data augmentation and synthesis for ICD-9 coding. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland, 26 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 389–401. [Google Scholar]
Zhou, Y.; Qin, Y.; Huang, R.; Chen, Y.; Lin, C.; Zhou, Y. Self-training improves few-shot learning in legal artificial intelligence tasks. In Artificial Intelligence and Law; Springer Nature: Berlin, Germany, 2024; pp. 1–17. [Google Scholar]
Zhang, D.; Li, T.; Zhang, H.; Yin, B. On data augmentation for extreme multi-label classification. arXiv 2020, arXiv:2009.10778. [Google Scholar]
Xu, P.; Xiao, L.; Liu, B.; Lu, S.; Jing, L.; Yu, J. Label-specific feature augmentation for long-tailed multi-label text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 10602–10610. [Google Scholar]
Xu, P.; Song, M.; Li, Z.; Lu, S.; Jing, L.; Yu, J. Taming Prompt-Based Data Augmentation for Long-Tailed Extreme Multi-Label Text Classification. In Proceedings of the ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 9981–9985. [Google Scholar]
Zhao, Z.; Alzubaidi, L.; Zhang, J.; Duan, Y.; Gu, Y. A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations. Expert Syst. Appl. 2024, 242, 122807. [Google Scholar] [CrossRef]
Rios, A.; Kavuluru, R. Neural transfer learning for assigning diagnosis codes to EMRs. Artif. Intell. Med. 2019, 96, 116–122. [Google Scholar] [CrossRef]
Li, K.; Jing, L. Long-tailed Multi-label Text Classification via Label Co-occurrence-Aware Knowledge Transfer. In Proceedings of the 2022 European Conference on Natural Language Processing and Information Retrieval (ECNLPIR), Hangzhou, China, 19–21 July 2022; pp. 62–68. [Google Scholar]
Wang, H.; Xu, C.; Mcauley, J. Automatic multi-label prompting: Simple and interpretable few-shot classification. arXiv 2022, arXiv:2204.06305. [Google Scholar]
Livernoche, V.; Sujaya, V. A Reproduction of Automatic Multi-Label Prompting: Simple and Interpretable Few-Shot Classification. ML Reprod. Chall. 2022, 9, 33. [Google Scholar]
Wei, L.; Li, Y.; Zhu, Y.; Li, B.; Zhang, L. Prompt tuning for multi-label text classification: How to link exercises to knowledge concepts? Appl. Sci. 2022, 12, 10363. [Google Scholar] [CrossRef]
Hu, S.; Ding, N.; Wang, H.; Liu, Z.; Wang, J.; Li, J.; Wu, W.; Sun, M. Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. arXiv 2021, arXiv:2108.02035. [Google Scholar]
Yang, Z.; Wang, S.; Rawat, B.P.S.; Mitra, A.; Yu, H. Knowledge injected prompt based fine-tuning for multi-label few-shot icd coding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 25 February 2023; p. 1767. [Google Scholar]
Yang, Z.; Kwon, S.; Yao, Z.; Yu, H. Multi-label few-shot icd coding as autoregressive generation with prompt. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 5366–5374. [Google Scholar]
Zhou, X.; Yang, L.; Wang, X.; Zhan, H.; Sun, R. Two stages prompting for few-shot multi-intent detection. Neurocomputing 2024, 579, 127424. [Google Scholar] [CrossRef]
Zhuang, N.; Wei, X.; Li, J.; Wang, X.; Wang, C.; Wang, L.; Dang, J. A prompt learning framework with large language model augmentation for few-shot multi-label intent detection. In Proceedings of the ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Zhou, C.; Huang, B.; Ling, Y. A Chinese few-shot named-entity recognition model based on multi-label prompts and boundary information. Appl. Sci. 2025, 15, 5801. [Google Scholar] [CrossRef]
Yang, W.; Li, J.; Fukumoto, F.; Ye, Y. HSCNN: A hybrid-siamese convolutional neural network for extremely imbalanced multi-label text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6716–6722. [Google Scholar]
Csányi, G.M.; Vági, R.; Megyeri, A.; Fülöp, A.; Nagy, D.; Vadász, J.P.; Uveges, I. Can triplet loss be used for multi-label few-shot classification? A Case study. Information 2023, 14, 520. [Google Scholar] [CrossRef]
Hui, B.; Liu, L.; Chen, J.; Zhou, X.; Nian, Y. Few-shot relation classification by context attention-based prototypical networks with BERT. EURASIP J. Wirel. Commun. Netw. 2020, 2020, 1–17. [Google Scholar] [CrossRef]
Wang, X.; Du, Y.; Chen, D.; Li, X.; Chen, X.; Lee, Y.L.; Liu, J. Constructing better prototype generators with 3D CNNs for few-shot text classification. Expert Syst. Appl. 2023, 225, 120124. [Google Scholar] [CrossRef]
Luo, S.; Zhang, R.; Pan, L.; Wu, Z. A multi-label few-shot instance-level attention prototypical network classification method. Trans. Beijing Inst. Technol. (Nat. Sci. Ed.) 2023, 43, 403–409. [Google Scholar]
Xiao, L.; Xu, P.; Song, M.; Liu, H.; Jing, L.; Zhang, X. Triple alliance prototype orthotist network for long-tailed multi-label text classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2616–2628. [Google Scholar] [CrossRef]
Kong, F.; Zhang, R.; Guo, X.; Chen, J.; Wang, Z. Preserving label correlation for multi-label text classification by prototypical regularizations. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 3300–3310. [Google Scholar]
Rios, A.; Kavuluru, R. EMR coding with semi-parametric multi-head matching networks. In Proceedings of the conference Association for Computational Linguistics North American Chapter Meeting, New Orleans, LA, USA, 23 August 2018; p. 2081. [Google Scholar]
Yuan, Z.; Tan, C.; Huang, S. Code synonyms do matter: Multiple synonyms matching network for automatic ICD coding. arXiv 2022, arXiv:2203.01515. [Google Scholar]
Thrun, S.; Pratt, L. Learning to learn: Introduction and overview. In Learning to Learn; Springer: Berlin/Heidelberg, Germany, 1998; pp. 3–17. [Google Scholar]
Jiang, X.; Havaei, M.; Chartrand, G.; Chouaib, H.; Vincent, T.; Jesson, A.; Chapados, N.; Matwin, S. Attentive task-agnostic meta-learning for few-shot text classification. In Proceedings of the ICLR 2019 Conference Blind Submission, New Orleans, LA, USA, 28 September 2018. [Google Scholar]
Wang, R.; Su, X.A.; Long, S.; Dai, X.; Huang, S.; Chen, J. Meta-LMTC: Meta-learning for large-scale multi-label text classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8633–8646. [Google Scholar]
Xiao, L.; Zhang, X.; Jing, L.; Huang, C.; Song, M. Does head label help for long-tailed multi-label text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 14103–14111. [Google Scholar]
Zhou, F.; Qi, X.; Xiao, C.; Wang, J. MetaRisk: Semi-supervised few-shot operational risk classification in banking industry. Inf. Sci. 2021, 552, 1–16. [Google Scholar] [CrossRef]
Teng, F.; Zhang, Q.; Zhou, X.; Hu, J.; Li, T. Few-shot ICD coding with knowledge transfer and evidence representation. Expert Syst. Appl. 2024, 238, 121861. [Google Scholar] [CrossRef]
Wang, X.; Ye, Y.; Gupta, A. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6857–6866. [Google Scholar]
Rios, A.; Kavuluru, R. Few-shot and zero-shot multi-label learning for structured label spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; p. 3132. [Google Scholar]
Ge, X.; Williams, R.D.; Stankovic, J.A.; Alemzadeh, H. Dkec: Domain knowledge enhanced multi-label classification for electronic health records. arXiv 2023, arXiv:2310.07059. [Google Scholar]
Lu, J.; Du, L.; Liu, M.; Dipnall, J. Multi-label few/zero-shot learning with knowledge aggregated from multiple label graphs. arXiv 2020, arXiv:2010.07459. [Google Scholar]
Chen, L.; Yan, X.; Wang, Z.; Huang, H. Neural architecture search with heterogeneous representation learning for zero-shot multi-label text classification. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, QLD, Australia, 18–23 June 2023; pp. 1–8. [Google Scholar]
Chalkidis, I.; Fergadiotis, M.; Kotitsas, S.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. An empirical study on large-scale multi-label text classification including few and zero-shot labels. arXiv 2020, arXiv:2010.01653. [Google Scholar]
Wang, S.; Ren, P.; Chen, Z.; Ren, Z.; Liang, H.; Yan, Q.; Kanoulas, E.; de Rijke, M. Few-shot electronic health record coding through graph contrastive learning. arXiv 2021, arXiv:2106.15467. [Google Scholar]
Chen, J.; Li, X.; Xi, J.; Yu, L.; Xiong, H. Rare codes count: Mining inter-code relations for long-tail clinical text classification. In Proceedings of the 5th Clinical Natural Language Processing Workshop, Toronto, ON, Canada, 14 July 2023; pp. 403–413. [Google Scholar]
Rajaonarivo, L.; Mine, T.; Arakawa, Y. Few-shot and LightGCN learning for multi-label estimation of lesser-known tourist sites using tweets. In Proceedings of the 2023 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Venice, Italy, 26–29 October 2023; pp. 103–110. [Google Scholar]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 20, 1254–1259. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Vu, T.; Nguyen, D.Q.; Nguyen, A. A label attention model for ICD coding from clinical text. arXiv 2020, arXiv:2007.06351. [Google Scholar]
Wang, Y.; Zhou, L.; Zhang, W.; Zhang, F.; Wang, Y. A soft prompt learning method for medical text classification with simulated human cognitive capabilities. Artif. Intell. Rev. 2025, 58, 118. [Google Scholar] [CrossRef]
Yogarajan, V.; Pfahringer, B.; Smith, T.; Montiel, J. Improving predictions of tail-end labels using concatenated biomed-transformers for long medical documents. arXiv 2021, arXiv:2112.01718. [Google Scholar]
Rethmeier, N.; Augenstein, I. Self-supervised contrastive zero to few-shot learning from small, long-tailed text data. In Proceedings of the ICLR 2021 Conference Blind Submission, Vienna, Austria, 4 May 2021. [Google Scholar]
Yao, Y.; Zhang, J.; Zhang, P.; Sun, Y. A dual-branch learning model with gradient-balanced loss for long-tailed multi-label text classification. ACM Trans. Inf. Syst. 2023, 42, 1–24. [Google Scholar] [CrossRef]
Xu, H.; Vucetic, S.; Yin, W. X-SHOT: A Single System to Handle Frequent, Few-shot and Zero-shot Labels in Classification. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7 May 2024. [Google Scholar]
Schopf, T.; Blatzheim, A.; Machner, N.; Matthes, F. Efficient few-shot learning for multi-label classification of scientific documents with many classes. arXiv 2024, arXiv:2410.05770. [Google Scholar]
Pontiki, M.; Galanis, D.; Papageorgiou, H.; Androutsopoulos, I.; Manandhar, S.; AL-Smadi, M.; Al-Ayyoub, M.; Zhao, Y.; Qin, B.; Clercq, O.D.; et al. Semeval-2016 task 5: Aspect based sentiment analysis. Int. Workshop Semant. Eval. 2016, 19–30. [Google Scholar] [CrossRef]
Hu, M.; Zhao, S.; Guo, H.; Xue, C.; Gao, H.; Gao, T.; Cheng, R.; Su, Z. Multi-label few-shot learning for aspect category detection. arXiv 2021, arXiv:2105.14174. [Google Scholar]
Liu, H.; Zhang, F.; Zhang, X.; Zhao, S.; Sun, J.; Yu, H.; Zhang, X. Label-enhanced prototypical network with contrastive learning for multi-label few-shot aspect category detection. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 1079–1087. [Google Scholar]
Zhao, F.; Shen, Y.; Wu, Z.; Dai, X. Label-driven denoising framework for multi-label few-shot aspect category detection. arXiv 2022, arXiv:2210.04220. [Google Scholar]
Wang, Z.; Iwaihara, M. Few-shot multi-label aspect category detection utilizing prototypical network with sentence-level weighting and label augmentation. In Database and Expert Systems Applicationsthe, International Conference on Database and Expert Systems Applications, Penang, Malaysia, 28–30 August 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 363–377. [Google Scholar]
Zhao, S.; Chen, W.; Wang, T. Learning few-shot sample-set operations for noisy multi-label aspect category detection. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; pp. 5306–5313. [Google Scholar]
Peng, C.; Chen, K.; Shou, L.; Chen, G. Variational hybrid-attention framework for multi-label few-shot aspect category detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 14590–14598. [Google Scholar]
Guan, C.; Bai, Y.; Zhou, X. Few-shot multi-label aspect category detection based on prompt-enhanced prototypical networks. J. Shanxi Univ. (Nat. Sci. Ed.) 2024, 47, 494–505. [Google Scholar]
Guan, C.; Zhu, Y.; Bai, Y.; Wang, L. Label-guided prompt for multi-label few-shot aspect category detection. arXiv 2024, arXiv:2407.20673. [Google Scholar]
Zhao, S.; Chen, W.; Wang, T.; Yao, J.; Lu, D.; Zheng, J. Less is Enough: Relation graph guided few-shot learning for multi-label aspect category detection. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Young, S.; Gašić, M.; Thomson, B.; Williams, J.D. Pomdp-based statistical spoken dialog systems: A review. Proc. IEEE 2013, 101, 1160–1179. [Google Scholar] [CrossRef]
Qin, L.; Xu, X.; Che, W.; Liu, T. TD-GIN: Token-level dynamic graph-interactive network for joint multiple intent detection and slot filling. arXiv 2020, arXiv:2004.10087. [Google Scholar]
Xu, P.; Sarikaya, R. Exploiting shared information for multi-intent natural language sentence classification. Interspeech 2013, 3785–3789. [Google Scholar] [CrossRef]
Hou, Y.; Lai, Y.; Wu, Y.; Che, W.; Liu, T. Few-shot learning for multi-label intent detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 13036–13044. [Google Scholar]
Zhang, R.; Luo, S.; Pan, L.; Ma, Y.; Wu, Z. Strengthened multiple correlation for multi-label few-shot intent detection. Neurocomputing 2023, 523, 191–198. [Google Scholar] [CrossRef]
Zhang, F.; Chen, W.; Ding, F.; Wang, T. Dual class knowledge propagation network for multi-label few-shot intent detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 8605–8618. [Google Scholar]
Zhang, X.; Li, X.; Liu, H.; Liu, X.; Zhang, X. Label hierarchical structure-aware multi-label few-shot intent detection via prompt tuning. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 2482–2486. [Google Scholar]
Zhang, X.; Li, X.; Zhang, F.; Wei, Z.; Liu, J.; Liu, H. A coarse-to-fine prototype learning approach for multi-label few-shot intent detection. Find. Assoc. Comput. Linguist. EMNLP 2024, 2024, 2489–2502. [Google Scholar]
Sun, A.; Lim, E.-P. Hierarchical text classification and evaluation. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 521–528. [Google Scholar]
Chen, H.; Ma, Q.; Lin, Z.; Yan, J. Hierarchy-aware label semantics matching network for hierarchical text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Bangkok, Thailand, 1–6 August 2021; pp. 4370–4379. [Google Scholar]
Ji, K.; Lian, Y.; Gao, J.; Wang, B. Hierarchical verbalizer for few-shot hierarchical text classification. arXiv 2023, arXiv:2305.16885. [Google Scholar]
Ji, K.; Wang, P.; Ke, W.; Li, G.; Liu, J.; Gao, J.; Shang, Z. Domain-hierarchy adaptation via chain of iterative reasoning for few-shot hierarchical text classification. arXiv 2024, arXiv:2407.08959. [Google Scholar]
Kim, G.; Im, S.; Oh, H.-S. Hierarchy-aware biased bound margin loss function for hierarchical text classification. Find. Assoc. Comput. Linguist. ACL 2024, 2024, 7672–7682. [Google Scholar]
Chen, H.; Zhao, Y.; Chen, Z.; Wang, M.; Li, L.; Zhang, M.; Zhang, M. Retrieval-style in-context learning for few-shot hierarchical text classification. Trans. Assoc. Comput. Linguist. 2024, 12, 1214–1231. [Google Scholar] [CrossRef]
Zhao, X.; Yan, H.; Liu, Y. Hierarchical multilabel classification for fine-level event extraction from aviation accident reports. Inf. J. Data Sci. 2025, 4, 51–66. [Google Scholar] [CrossRef]
Mcauley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013; pp. 165–172. [Google Scholar]
Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.W.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 1–9. [Google Scholar] [CrossRef]
Lee, J.; Scott, D.J.; Villarroel, M.; Clifford, G.D.; Saeed, M.; Mark, R.G. Open-access MIMIC-II database for intensive care research. In Proceedings of the 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Boston, MA, USA, 30 August 2011–3 September 2011; pp. 8315–8318. [Google Scholar]
Yang, P.; Sun, X.; Li, W.; Ma, S.; Wu, W.; Wang, H. SGM: Sequence generation model for multi-label classification. arXiv 2018, arXiv:1806.04822. [Google Scholar]
Lewis, D.D.; Yang, Y.; Rose, T.G.; Li, F. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 2004, 5, 361–397. [Google Scholar]
Loza Mencía, E.; Fürnkranz, J. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Vilnius, Lithuania, 8–12 September 2024; Springer: Berlin/Heidelberg, Germany, 2008; pp. 50–65. [Google Scholar]
Zubiaga, A. Enhancing navigation on wikipedia with social tags. arXiv 2012, arXiv:1202.5469. [Google Scholar]
Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; Kleef, P.V.; Auer, S.; et al. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef]
Williams, J.D.; Raux, A.; Ramachandran, D.; Black, A. Dialog State Tracking Challenge Handbook; Microsoft Research: Redmond, WA, USA, 2012. [Google Scholar]
Eric, M.; Manning, C.D. Key-value retrieval networks for task-oriented dialogue. arXiv 2017, arXiv:1705.05414. [Google Scholar]
Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Androutsopoulos, I. Large-scale multi-label text classification on EU legislation. arXiv 2019, arXiv:1906.02192. [Google Scholar]
Bauman, K.; Liu, B.; Tuzhilin, A. Aspect based recommendations: Recommending items with the most valuable aspects based on user reviews. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 717–725. [Google Scholar]
Kowsari, K.; Brown, D.E.; Heidarysafa, M.; Meimandi, K.J.; Gerber, M.S.; Barnes, L.E. Hdltex: Hierarchical deep learning for text classification. In Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 364–371. [Google Scholar]

Figure 1. Differences between traditional multi-label text classification and few-shot multi-label text classification.

Figure 2. Overall technical framework categorizing methods for multi-label text classification under few-shot learning scenarios.

Figure 3. Overview of the PFT framework: The framework adopts a two-stage prompting approach. The left part illustrates the first stage, where the model predicts the number of intents using a prompt template (e.g., “There are [MASK]₁ intents”), and the second stage, where the model dynamically generates intent-specific prompts to identify the actual intents (e.g., “that is [MASK]₂, [MASK]₃”). The right part shows the multi-view multi-head self-attention (MSA) module, in which (a)–(c) represent head masks used to capture the correlations among global context tokens, between label and context tokens, and among intents, respectively [20].

Figure 4. The ATAML framework [33]. Where

θ

to denote all parameters of the model

(θ = {θ_{W}, θ_{ATT}, θ_{E}})

, which is divided into shared parameters

θ_{E}

and task-specific parameters

θ_{T}

, where

θ_{T} = {θ_{W}, θ_{ATT}}

.

Figure 4. The ATAML framework [33]. Where

θ

to denote all parameters of the model

(θ = {θ_{W}, θ_{ATT}, θ_{E}})

, which is divided into shared parameters

θ_{E}

and task-specific parameters

θ_{T}

, where

θ_{T} = {θ_{W}, θ_{ATT}}

.

Figure 5. Illustrates the architecture of the label attention model (LAAT), primarily designed for automatic ICD coding from clinical texts. The model consists of four key layers: (1) an embedding layer that converts input tokens into pre-trained word vectors; (2) a bidirectional long short-term memory network (BiLSTM) layer that captures contextual relationships within the text and extracts latent feature representations for each token; (3) a label attention layer that learns a weight vector for each label, highlighting text segments relevant to the corresponding ICD code to generate label-specific document representations; and (4) an output layer composed of label-specific binary classifiers that make predictions based on their respective document representations. Each classifier determines whether the input text contains the corresponding ICD code through a single-layer feedforward neural network (FFNN) [50].

Figure 6. Presents the architecture of the proto-AWATT framework. The left portion illustrates the main network of a meta-task with N = 3 classes and K = 2 support examples per class. Each small square within an instance denotes an aspect category: colored squares indicate target (relevant) aspects of interest, while white squares represent noisy (irrelevant) aspects. The right portion provides a detailed depiction of the support-set attention mechanism [58].

Figure 7. An example of single-sample multi-label intent detection [70].

Figure 8. An example of multi-label hierarchical text classification.

Table 1. Methods for multi-label text classification in few-shot settings.

Method Category	Models/Methods	Limitations	Addressed Challenges
Methods Based on Data Augmentation	LAIAugment [7]	Self-generated pseudo-labels may introduce noise and bias, leading to the accumulation of errors.	Insufficient labeled data.
	GDA [8]	When data is scarce (e.g., when only 1% of the training data is available), GDA tends to produce lower-quality results due to difficulties in fine-tuning, often performing worse than rule-based methods.	Long-tail label distribution.
	Falis et al. [6]	Relies on external NER + L tools.	Long-tail label distribution.
	LSFA [9]	Feature transfer depends on the data quality of head labels.	Long-tail label distribution.
	XDA [10]	Methods based on high-quality pre-trained models (e.g., T5) exhibit superior performance, but their high computational cost limits practical applicability.	Long-tail label distribution.
Transfer Learning-Based Approaches	Rios et al. [12]	Relies on two independent datasets (PubMed and EMR), increasing costs, and is unable to handle rare or unseen codes.	Data sparsity and long document handling.
Transfer Learning-Based Approaches	LCOAKT [13]	Relies on the construction of label co-occurrence graphs and requires further optimization in hyperparameter tuning.	Long-tail label distribution.
Prompt Learning-Based Approaches	AMuLaP [14]	Performance is limited by fixed prompt templates.	Manual design of label mappings requires extensive trial and error.
	PTMLTC [16]	Ignores the natural graph structure relationships between knowledge concepts, affecting classification performance.	Reduces reliance on large amounts of labeled data.
	KPT [17]	Relies on the quality of external knowledge bases, which may introduce noise or malicious terms.	Mitigates issues of incomplete label-to-word mapping, bias, and instability in prompt learning.
	KEPTLongformer [18]	Due to memory constraints, it cannot be directly applied to tasks with a large number of labels (e.g., 8692 ICD codes).	Long-tail label distribution and data sparsity.
	GPsoap [19]	Generation speed is slow, and it relies on a large amount of proprietary clinical data for pretraining.	Complexity of long-tail label distribution and high-dimensional label space.
	PFT [20]	Relies on the accuracy of intent prediction and is sensitive to prompt design and data sampling strategies in few-shot scenarios.	Difficulty in threshold estimation and insufficient capture of intent correlations.
	PLMA [21]	Relies on LLM-generated templates and expanded answer space, which may increase computational costs and complexity.	Alleviates data sparsity and label dependency issues.
	MPBCNER [22]	Model computational efficiency is low, and an independent decoder must be designed for each entity type.	Challenges in Chinese named entity recognition under low-resource and complex structural conditions.
Metric Learning-Based Approaches	HSCNN [23]	Relies on sampling strategies and threshold settings, with high computational complexity.	Long-tail label distribution.
	Csányi et al. [24]	Performs poorly in binary classification tasks, with significant label overlap reducing classification effectiveness.	Label overlap and sample scarcity.
	Luo et al. [27]	Relies on label words as prior knowledge, without fully mitigating noise interference from multi-label samples in the support set.	Noise interference and prototype confusion issues.
	TAPON [28]	Performance may be unstable under extreme data scarcity (e.g., when tail labels have only 1–3 documents).	Long-tail label distribution.
	ProtoMix [29]	Performance may be limited when the number of labels is extremely large.	Label correlation and overfitting issues.
	Match–CNN [30]	The sampling method for the support set is relatively simple, potentially affecting performance.	Label sparsity and insufficient key information in long texts.
	MSMN [31]	Relies on external knowledge bases (e.g., UMLS) for synonym acquisition.	Addressing the diversity of ICD code expressions in electronic health records.
Meta-Learning-Based Approaches	ATAML [33]	Performance may be limited when task differences are large or data distributions are complex.	Data scarcity.
	Meta-LMTC [34]	High computational complexity.	Long-tail label distribution.
	HTTN [35]	Meta-knowledge learning may be insufficient when head labels are limited.	Long-tail label distribution.
	MetaRisk [36]	Dependency on unlabeled data may introduce noise.	Scarcity of labeled data and insufficient multi-label combination samples.
	EPEN [37]	Relies on high-quality training samples and does not fully utilize external knowledge.	Long-tail label distribution.
Graph Neural Network-Based Approaches	ZAGCNN [39]	The model performs slightly worse than ACNN on frequent labels (e.g., 0.3% lower R@10 on MIMIC-III), and its reliance on structured label information and natural language descriptions limits its generalizability.	Information dispersion in long documents and label data sparsity.
	DKEC [40]	Performance depends on label structure and logical rule design, which may limit generalization to datasets with large label discrepancies.	Long-tail label distribution.
	KAMG [41]	The model relies on a predefined label relationship graph, resulting in high computational complexity.	Poor classification performance for small-sample and zero-shot labels.
	NAS-HRL [42]	High computational cost and reliance on a predefined heterogeneous search space limit flexibility.	Alleviates the issue of heterogeneous data feature fusion between text and labels.
	Chalkidis et al. [43]	Due to text truncation and term fragmentation, BERT-based models perform poorly on the MIMIC-III dataset. Some methods (e.g., GC-BIGRU-LWAN) rely on label hierarchies but show limited effectiveness when labels are sparse.	Label distribution imbalance and underutilization of label hierarchy.
	CoGraph [44]	The model only relies on high-frequency words and entities, without fully leveraging medical knowledge or rules.	Extremely imbalanced distribution of ICD codes.
	Chen et al. [45]	Using multiple GCN modules may lead to overparameterization and increased training difficulty.	Long-tail label distribution.
	Rajaonarivo et al. [46]	Relies on tweet data, and if there are no related tweets for a location, it cannot estimate the category.	Data scarcity in specialized domains.
Attention Mechanism-Based Approaches	LAAT [50]	The model is sensitive to hyperparameters (e.g., LSTM hidden layer size and projection dimension) and has high computational costs.	Data imbalance.
Attention Mechanism-Based Approaches	Wang et al. [51]	When training samples are insufficient, performance improvements are limited, and it struggles to distinguish between the “diagnosis” and “etiology” categories.	Lack of semantic association between pseudo-labels and original text in soft prompt learning.
Other Research Approaches	Yogarajan et al. [52]	The sequential model has lower resource requirements but performs slightly worse than long-sequence dedicated models like TransformerXL and does not fully address the zero-value issue in low-frequency label prediction.	The performance bottleneck of transformers in handling long texts.
	Rethmeier et al. [53]	On small-scale data, the model’s robustness to noise and sparse labels still has room for improvement, and performance is limited by the quality and quantity of self-supervised signals.	Alleviates the issue of high data dependence and poor performance of traditional methods in low-resource long-tail scenarios.
	DBGB [54]	The dual-branch structure may increase computational complexity and is not optimized for scenarios with extremely large label sets (e.g., millions of labels).	Long-tail label distribution.
	X-Shot [55]	Relies on pre-trained language models to generate weakly supervised data may introduce noise and is sensitive to task type overlap.	Alleviates the issue of needing separate optimization for frequent, small-sample, and zero-shot labels.
	FusionSent [56]	Training costs are high (requires training two models and merging parameters).	Labeled data scarcity and a large number of categories.

Table 2. Experimental results on the AmazonCat-13K dataset.

Method Category	Models/Methods	P@1	P@3	P@5	nDCG@3	nDCG@5
Methods Based on Data Augmentation	GDA [8] 2020	96.29	83.06	67.49	91.84	90.03
Methods Based on Data Augmentation	XDA [10] 2024	96.67	/	67.40	/	/
Metric Learning-Based Approaches	TAPON [28] 2023	95.19	80.67	65.68	89.48	87.29
Metric Learning-Based Approaches	M-PON [28] 2023	95.65	81.03	66.19	90.21	88.01
Other Research Approaches	DBGB [54] 2023	96.59	83.61	68.25	92.34	90.57
Other Research Approaches	DBGB-ens [54] 2023	96.66	83.78	68.49	92.48	90.80

Table 3. Experimental results on the RCV1 dataset.

Method Category	Models/Methods	P@1	P@3	P@5	nDCG@3	nDCG@5
Methods Based on Data Augmentation	LSFA [9] 2023	97.21	82.52	57.52	94.20	95.42
Transfer Learning-Based Approaches	LCOAKT [13] 2022	95.61	79.98	55.87	90.91	91.82
Metric Learning-Based Approaches	HSCNN [23] 2020	94.90	77.60	54.37	81.77	64.60
	ProtoMix [29] 2025	97.48	83.24	57.82	94.12	94.64
	TAPON [28] 2023	95.09	77.84	54.47	89.56	89.34
	M-PON [28] 2023	95.89	78.81	55.23	89.95	90.69
Meta-Learning-Based Approaches	HTTN [35] 2021	94.70	77.83	54.21	88.49	89.05
Meta-Learning-Based Approaches	EHTTN [35] 2021	95.86	78.92	55.27	89.61	90.86

Table 4. Experimental results on the AAPD dataset.

Method Category	Models/Methods	P@1	P@3	P@5	nDCG@3	nDCG@5
Methods Based on Data Augmentation	LSFA [9] 2023	86.95	62.88	43.43	83.96	87.53
Transfer Learning-Based Approaches	LCOAKT [13] 2022	82.83	59.34	40.51	78.49	82.24
Metric Learning-Based Approaches	ProtoMix [29] 2025	86.83	62.72	42.75	82.67	86.49
	TAPON [28] 2023	83.34	59.91	41.01	79.38	83.12
	M-PON [28] 2023	83.89	60.56	41.43	79.72	83.54
Meta-Learning-Based Approaches	HTTN [35] 2021	82.49	58.72	40.31	78.20	81.24
Meta-Learning-Based Approaches	EHTTN [35] 2021	83.84	59.92	40.79	79.27	82.67

Table 5. Experimental results on the EUR-Lex dataset.

Method Category	Models/Methods	P@1	P@3	P@5	nDCG@3	nDCG@5
Methods Based on Data Augmentation	LSFA [9] 2023	83.75	70.74	58.95	74.13	68.25
Transfer Learning-Based Approaches	LCOAKT [13] 2022	81.93	68.89	57.30	72.32	66.68
Metric Learning-Based Approaches	ProtoMix [29] 2025	87.75	74.86	62.15	78.34	72.03
	TAPON [28] 2023	80.98	67.70	56.06	70.65	64.22
	M-PON [28] 2023	81.37	68.09	56.51	70.97	64.67
Meta-Learning-Based Approaches	HTTN [35] 2021	81.14	67.62	56.38	70.89	64.42
Other Research Approaches	DBGB [54] 2023	87.61	75.21	62.54	78.53	72.30
Other Research Approaches	DBGB-ens [54] 2023	88.93	76.38	63.53	79.83	73.48

Table 6. Experimental results on the MIMIC-III-full dataset.

Method Category	Models/Methods	AUC		F1		P@K
Method Category	Models/Methods	Macro	Micro	Macro	Micro	k = 5	k = 8	k = 15
Prompt Learning-Based Approaches	GPsoap [19] 2023	/	/	13.4	49.8	/	/	/
	Reranker (MSMN + GPsoap) [19] 2023	/	/	14.6	59.1	/	/	60.5
	Concater (MSMN + GPsoap) [19] 2023	/	/	14.0	55.0	/	/	/
Metric Learning-Based Approaches	MSMN [31] 2022	95.0	99.2	10.3	58.4	/	75.2	59.9
Graph Neural Network-Based Approaches	Chen et al. [45] 2023	95.0	99.2	10.3	58.0	/	75.3	59.9
Graph Neural Network-Based Approaches	Chen et al. w/EnrichedDescriptions [45] 2023	95.2	99.2	10.8	58.6	/	75.3	60.3
Attention Mechanism-Based Approaches	LAAT [50] 2020	91.9	98.8	9.9	57.5	81.3	73.8	59.1
Attention Mechanism-Based Approaches	JointLAAT [50] 2020	92.1	98.8	10.7	57.5	80.6	73.5	59.0

Table 7. Experimental results on the MIMIC-III 50 dataset.

Method Category	Models/Methods	AUC		F1		P@K
Method Category	Models/Methods	Macro	Micro	Macro	Micro	k = 5	k = 8	k = 15
Prompt Learning-Based Approaches	KEPTLongformer [18] 2022	92.63	94.76	68.91	72.85	67.26	/	/
Metric Learning-Based Approaches	MSMN [31] 2022	92.8	94.7	68.3	72.5	68.0	/	/
Attention Mechanism-Based Approaches	LAAT [50] 2020	92.5	94.6	66.6	71.5	67.5	54.7	35.7
Attention Mechanism-Based Approaches	JointLAAT [50] 2020	92.5	94.6	66.1	71.6	67.1	54.6	35.7

Table 8. Illustrates a meta-task under a 3-way 2-shot setting. The left column lists the aspect category labels, while the right column presents the corresponding support samples. Multiple aspects mentioned within each sample are highlighted using distinct colors: non-target (noisy) aspects are marked in grey, whereas target aspects are color-coded accordingly to indicate their relevance to the respective labels.

Support Set
staff	(1) It’s the rude staff and bland food that truly ruin the experience. (2) The staff were attentive and polite, and the food exceeded our expectations in both taste and presentation!
food	(1) It’s the rude staff and bland food that truly ruin the experience. (2) The staff were attentive and polite, and the food exceeded our expectations in both taste and presentation!
experience	(1) Unforgettable dining experience! (2) Every visit has been a pleasant experience with great food, friendly service, and a cozy atmosphere.
Query Set
experience and staff staff and food food	(1) It was an awful experience. The staff was impatient, unhelpful, and completely unprofessional. I won’t be returning. (2) The lobby was stunning, our room was spotless, the food was outstanding, and the staff made us feel truly welcome. (3) We had lunch at the rooftop restaurant, and the food impressed us with its rich flavors and beautiful presentation.

Table 9. Comparison of few-shot multi-label aspect category detection models (bold indicates the best performance).

Model	5-Way 5-Shot		5-Way 10-Shot		10-Way 5-Shot		10-Way 10-Shot
Model	AUC	F1	AUC	F1	AUC	F1	AUC	F1
Proto-AWATT [58] 2021	91.45	71.72	93.89	77.19	89.80	58.89	92.34	66.76
LPN [59] 2022	95.66	79.48	96.55	82.81	94.51	67.28	95.66	71.87
LDF [60] 2022	92.62	73.38	94.34	78.81	90.87	62.06	92.93	68.23
FSO [62] 2023	96.01	81.04	96.67	82.22	94.93	70.26	95.71	72.46
VHAF [63] 2024	97.09	84.64	97.57	87.31	96.01	75.92	96.78	79.43
ProtPrompt [64] 2024	95.73	82.49	96.81	85.49	94.80	72.43	95.94	76.53
LGP [65] 2024	97.67	85.22	97.86	86.08	95.89	75.01	96.35	76.97

Table 10. User utterances and their corresponding labels in StanfordLU [73]. Where, brown represents verbs, while gray represents nouns.

User Utterances	Intent Labels
What is the date and Intent labels time of my next lab appointment?	Request_date, Request_time
Tell me Redwood City ’s forecast today.	Request_weather

Table 11. Micro-F1-scores for 1-shot multi-label intent detection on the TourSG dataset. Ave. denotes the average score.

Model			It	Ac	At	Fo	Tr
1-shot	Electra-small	ALR + MCT [70] 2021	39.98	51.55	55.16	52.16	55.36
		DCKPN [72] 2023	44.21	55.91	59.74	56.55	57.48
		LHS [73] 2024	47.07	57.11	60.75	57.38	59.21
	BERT-base	ALR + MCT [70] 2021	44.58	57.11	60.34	56.49	60.18
		DCKPN [72] 2023	48.19	58.32	60.93	58.22	61.05
		LHS [73] 2024	51.05	59.24	62.10	59.12	61.69
5-shot	Electra-small	ALR + MCT [70] 2021	44.21	51.37	55.76	54.50	55.37
		HCC-FSML [71] 2023	45.06	53.36	59.18	56.80	57.48
		DCKPN [72] 2023	47.76	55.83	59.48	60.06	60.23
		LHS [73] 2024	47.28	58.30	60.82	61.04	60.44
	BERT-base	ALR + MCT [70] 2021	46.80	54.79	59.95	59.11	60.13
		HCC-FSML [71] 2023	48.64	57.60	60.69	60.78	60.59
		DCKPN [72] 2023	49.58	56.93	60.65	61.26	60.89
		LHS [73] 2024	52.87	58.30	61.60	62.35	61.23

Table 12. Micro-F1 scores for multi-label intent detection on the StanfordLU dataset.

Model		1-Shot			Ave.	5-Shot			Ave.
Model		Sc	Na	We	Ave.	Sc	Na	We	Ave.
Electra-small	ALR + MCT [70] 2021	40.61	40.76	46.16	42.51	51.83	46.44	54.17	50.82
	HCC-FSML [71] 2023	/	/	/	/	59.08	58.34	70.65	62.69
	DCKPN [72] 2023	52.08	51.37	66.29	56.58	55.04	55.64	75.32	62.00
	LHS [73] 2024	64.48	58.24	73.79	65.50	67.61	67.53	78.14	71.09
BERT-base	ALR + MCT [70] 2021	42.55	56.95	53.14	50.88	52.17	60.36	59.63	57.39
	HCC-FSML [71] 2023	/	/	/	/	54.69	64.41	68.64	62.58
	DCKPN [72] 2023	53.81	58.48	74.02	62.10	57.81	63.71	93.83	71.78
	LHS [73] 2024	65.98	67.64	80.12	71.25	70.37	75.37	89.79	78.51
	CFPL [74] 2024	67.11	68.04	80.57	71.91	70.28	75.89	93.56	79.91

Table 13. Commonly used datasets (URLs accessed on 2 July 2025).

Datasets	Brief Description	$N$	$L$	$L C$	Access URL
AmazonCat-13K	Amazon product categorization dataset	1,493,021	13,330	5.04	http://manikvarma.org/downloads/XC/XMLRepository.html
MIMIC-III	A high-quality clinical database containing both structured and unstructured data	39,771	6932	13.6	https://mimic.physionet.org/
MIMIC II	A publicly available dataset for intensive care medicine research	21,104	7042	36.7	https://archive.physionet.org/pn5/mimic2db
AAPD	An academic paper dataset in the field of computer science	55,840	54	2.41	https://github.com/lancopku/SGM
RCV1	Reuters news article dataset	806,791	103	3.24	http://www.daviddlewis.com/resources/testcollections/rcv1
EUR-Lex	European Union legal documents dataset	19,596	3993	5.37	http://eur-lex.europa.eu/
Wiki10-31K	Wikipedia article dataset	20,764	30,938	18.64	http://nlp.uned.es/social-tagging https://github.com/yourh/AttentionXML/tree/master/data
DBPedia	A large-scale, multilingual, open knowledge graph constructed by extracting structured knowledge from Wikipedia	381,025	298	-	https://www.dbpedia.org/
TourSG	A dataset designed for multi-label intent recognition and dialogue state tracking	25,751	102	-	https://github.com/AtmaHou/FewShotMultiLabel
StanfordLU	A multi-domain task-oriented dialogue dataset	8038	32	-	https://github.com/AtmaHou/FewShotMultiLabel
EURLEX57K	European Union legislation dataset	57,000	4271	5.07	https://github.com/iliaschalkidis/lmtc-eurlex57k
FewAsp (multi)	Few-shot multi-label aspect category detection dataset	40,000	100	-	https://github.com/1429904852/LDF
RCV1-V2	Reuters news articles dataset	804,414	103	3.24	http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm
WOS	Academic paper dataset	46,985	141	7	http://archive.ics.uci.edu/index.php

Table 14. Common evaluation metrics for multi-label text classification under few-shot scenarios.

Model/Method	Acc	Pre	R	F1	Micro-F1	Macro-F1	P@k	nDCG@k	PSP@k	R@k	AUC
LAIAugment [7]		√	√	√
GDA [8]							√	√
Falis et al. [6]					√				√	√
LSFA [9]							√	√	√
XDA [10]							√		√
Rios et al. [12]					√	√
LCOAKT [13]				√			√	√
AMuLaP [14]	√			√
PTMLTC [16]					√	√
KPT [17]					√
KEPTLongformer [18]					√	√	√
GPsoap [19]		√	√	√	√	√	√			√
PFT [20]	√
PLMA [21]				√
MPBCNER [22]				√
HSCNN [23]					√	√	√	√
Csányi et al. [24]		√	√		√	√
Luo et al. [27]				√
TAPON [28]					√	√	√	√	√
ProtoMix [29]							√	√
Match–CNN [30]		√	√		√	√	√			√
MSMN [31]					√	√	√
ATAML [33]	√				√	√
Meta-LMTC [34]								√		√
HTTN [35]				√			√	√
MetaRisk [36]	√				√	√
EPEN [37]					√	√
ZAGCNN [39]						√	√			√
DKEC [40]					√	√	√
KAMG [41]								√		√
NAS-HRL [42]								√
Chalkidis et al. [43]								√
CoGraph [44]	√	√	√	√
Chen et al. [45]					√	√	√
Rajaonarivo et al. [46]				√
LAAT [50]					√	√	√
Wang et al. [51]		√	√		√	√
Yogarajan et al. [52]					√	√
DBGB [54]						√	√	√	√
FusionSent [56]		√	√	√
Proto-AWATT [58]						√					√
LPN [59]						√					√
LDF [60]						√					√
Proto-SLWLA [61]						√					√
FSO [62]						√					√
VHAF [63]						√					√
ProtPrompt [64]						√					√
LGP [65]						√					√
Zhao et al. [66]						√					√
ALR + MCT [70]					√
HCC-FSML [71]				√
DCKPN [72]					√
LHS [73]					√
CFPL [74]					√
HiMatch [76]					√	√		√		√
HierVerb [77]					√	√
HierICRF [78]					√	√
H2B [79]					√	√
Chen et al. [80]					√	√
Zhao et al. [81]		√	√	√

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, W.; Fan, Q.; Yan, H.; Xu, X.; Huang, S.; Zhang, K. A Survey of Multi-Label Text Classification Under Few-Shot Scenarios. Appl. Sci. 2025, 15, 8872. https://doi.org/10.3390/app15168872

AMA Style

Hu W, Fan Q, Yan H, Xu X, Huang S, Zhang K. A Survey of Multi-Label Text Classification Under Few-Shot Scenarios. Applied Sciences. 2025; 15(16):8872. https://doi.org/10.3390/app15168872

Chicago/Turabian Style

Hu, Wenlong, Qiang Fan, Hao Yan, Xinyao Xu, Shan Huang, and Ke Zhang. 2025. "A Survey of Multi-Label Text Classification Under Few-Shot Scenarios" Applied Sciences 15, no. 16: 8872. https://doi.org/10.3390/app15168872

APA Style

Hu, W., Fan, Q., Yan, H., Xu, X., Huang, S., & Zhang, K. (2025). A Survey of Multi-Label Text Classification Under Few-Shot Scenarios. Applied Sciences, 15(16), 8872. https://doi.org/10.3390/app15168872

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Survey of Multi-Label Text Classification Under Few-Shot Scenarios

Abstract

1. Introduction

2. Modeling and Current Research Status of Multi-Label Text Classification Under Few-Shot Scenarios

2.1. Mathematical Description

2.2. Differences Between Conventional Multi-Label Text Classification and Multi-Label Text Classification Under Few-Shot Scenarios

3. Technical Approaches

3.1. Methods Based on Data Augmentation

3.2. Model-Based Training Approaches

3.2.1. Transfer Learning-Based Approaches

3.2.2. Prompt Learning-Based Approaches

3.2.3. Metric Learning-Based Approaches

3.2.4. Meta-Learning-Based Approaches

3.2.5. Graph Neural Network-Based Approaches

3.2.6. Attention Mechanism-Based Approaches

3.3. Other Research Approaches

3.4. Multi-Model Performance Evaluation Under Similar Conditions

4. Scenario-Specific Studies

4.1. Few-Shot Multi-Label Aspect Category Detection

4.2. Few-Shot Multi-Label Intent Detection

4.3. Few-Shot Multi-Label Hierarchical Text Classification

5. Commonly Used Datasets

6. Commonly Used Evaluation Metrics

6.1. Instance-Based Evaluation Metrics

6.2. Label-Based Evaluation Metrics

6.3. Summary

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI