HFGNN-Proto: Hesitant Fuzzy Graph Neural Network-Based Prototypical Network for Few-Shot Text Classiﬁcation

: Few-shot text classiﬁcation aims to recognize new classes with only a few labeled text instances. Previous studies mainly utilized text semantic features to model the instance-level relation among partial samples. However, the single relation information makes it difﬁcult for many models to address complicated natural language tasks. In this paper, we propose a novel hesitant fuzzy graph neural network (HFGNN) model that explores the multi-attribute relations between samples. We combine HFGNN with the Prototypical Network to achieve few-shot text classiﬁcation. In HFGNN, multiple relations between texts, including instance-level and distribution-level relations, are discovered through dual graph neural networks and fused by hesitant fuzzy set (HFS) theory. In addition, we design a linear function that maps the fused relations to a more reasonable range in HFGNN. The ﬁnal relations are used to aggregate the information of neighbor instance nodes in the graph to construct more discriminative instance features. Experimental results demonstrate that the classiﬁcation accuracy of the HFGNN-based Prototypical Network (HFGNN-Proto) on the ARSC, FewRel 5-way 5-shot, and FewRel 10-way 5-shot datasets reaches 88.36%, 94.45%, and 89.40%, respectively, exceeding existing state-of-the-art few-shot learning methods.


Introduction
In recent years, the great success of deep learning has promoted the development of multitudinous fields such as computer vision and natural language processing [1][2][3], but the effectiveness of deep learning models relies on a large amount of labeled data. The generalization ability of deep learning models is severely limited when labeled data are scarce. Humans, on the other hand, have the ability to learn quickly and can easily build awareness of new things with just a few examples. This significant gap in learning between machine learning models and humans inspires researchers to explore few-shot learning (FSL) [4].
Inspired by the human-learning process, researchers proposed a meta-learning strategy for FSL, which utilizes the distribution of similar tasks to learn how to identify unseen classes accurately and efficiently with a small amount of training data. A cross-task metalearner learns from multiple similar tasks and provides better initialization for unseen classes based on the knowledge acquired from prior experience. One of the typical metalearning methods is the Prototypical Network [5], which computes the class prototypical representation of the support set and classifies the query sample to the nearest prototype.
Prototypical Network [5] and its variants [6][7][8][9] have been widely used for few-shot text classification tasks. Different from the Prototypical Network that computes class prototypes and query sample embeddings separately, MLMAN [6] interactively encodes text based on the matching information between the query and support sets at the local and instance levels. Gao et al. [7] proposed a hybrid attention-based prototypical network that employs instance-level attention and feature-level attention to highlight important instances and features, respectively. Sun et al. [8] improved the Prototypical Network by using feature level, word level, and instance level multi cross attention. Geng et al. [9] proposed the Induction Network that induces a better class-level representation using a dynamic routing algorithm [10]. Although the above methods have considered the intra-class similarity of the support set and the relations between support and query, they ignore inter-class dissimilarity and the relations between query samples. Furthermore, these methods only measure instance-level relations while neglecting other substantive relations. Due to the complexity and diversity of textual forms, simple relations have difficulty describing the true connection between texts and even introduce additional noise to the model. Instead, we measure the commonalities and differences among all samples in the task from multiple aspects. Inspired by the Distribution Propagation Graph Neural Network [11], we introduce the distribution-level information of one sample to all other support samples into few-shot text classification. To better explore relations information, a dual-graph structure consisting of instance graphs and distribution graphs is adopted in HFGNN. The instance graph models instance-level relations and directional relations based on instance features. The distribution graph aggregates instance-level relations to model distribution-level relations and the distance relations. The relations in our model represents the similarity between samples at multiple levels. However, similarity is a fuzzy concept, which is not clearly defined. To solve the problem of multiple fuzzy relations, we introduce HFS [12] theory, which can handle multi-attribute decision-making problems well for the HFGNN model. We design the membership functions corresponding to the multi-attribute relations for the comprehensive evaluation to avoid the loss of relations information. In addition, we use a linear function that can further enhance stronger relations and weaken weaker relations to help the model generate more reasonable graph structures and provide inductive biases for the enhancement of instance features. Finally, a prototypical network takes the instance features generated by HFGNN as input and quickly classifies query samples. HFGNN-Proto adopts an episodic strategy [13] for meta-training in an end-to-end manner. It has a stronger generalization ability and can adapt to new classes without retraining.
In summary, our main contributions are summarized as follows: 1.
We propose an HFGNN-Proto model that comprehensively considers multiple substantive relations between texts for few-shot text classification. Subsequent ablation experiments demonstrate the effectiveness of multi-attribute relations.

2.
To ensure the integrity of multi-attribute relations, we develop a new hesitant fuzzy strategy to fuse all relations into a more precise relational representation.

3.
Considering the noise impact of fully connected graph on information transfer, we design a linear function to provide inductive biases for the transfer of relations in graph neural networks.

4.
To verify the effectiveness of our model, we conduct extensive experiments on the ARSC and FewRel datasets. Experimental results demonstrate that the proposed HFGNN-Proto model achieves a significant improvement over other few-shot methods.

Few Shot Learning
Early studies mainly applied fine-tuning [14] and data augmentation [15] to alleviate the overfitting problem caused by insufficient training data but achieved unsatisfactory results. In the meta-learning strategy [16][17][18][19], transferable knowledge that can guide the learning of models are extracted from various tasks, so that the model has the ability of learning to learn. Current meta-learning methods mainly include optimization-based methods [18][19][20] and metric learning [5,13,21]. Some representative few-shot learning methods and their corresponding descriptions are listed in Table 1. MAML [18] MAML trains with a set of initialization parameters, and on the basis of the initial parameters, one or more steps of gradient adjustment can achieve the purpose of quickly adapting the model to new tasks with only a small amount of data.
SNAIL [20] A novel combination of temporal convolution and soft attention to learn the optimal optimization strategy.
ATAML [22] ATAML facilitates task-agnostic representation learning through task-agnostic parameterization and enhances the adaptability of the model to specific tasks through attention mechanism.

Metric Learning
In metric-based methods, instances are mapped into the feature space, the distance between the query and support sets are measured, and the classification is completed using the nearest neighbors concept.
Siamese Network [23] Siamese network contains two parallel neural networks that are trained to extract pair-wise sample features, and the Euclidean distance between features are measured.
Matching Network [13] It generates a weighted K-nearest neighbor classifier based on the cosine distance between sample features.
Relation Network [21] Different from the Siamese network and Matching network which adopt a single and fixed metric, Relation network compares relations with a nonlinear metric learned by a neural network.
MsKPRN [24] MsKPRN extends the Relation Network to be position-aware and integrates multi-scale features.
MSFN [25] MSFN learns a multi-scale feature space and similarities between the multi-scale and class representation are computed.
Adaptive Metric Learning Model [26] Yu et al. [26] proposed an adaptive metric learning model that is able to automatically determine the best weighted combination for emerging few-shot tasks from a set of metrics obtained by meta-learning.
Knowledge-Guided Relation Network [27] Sui et al. [27] proposed a knowledge-guided metric model that uses external knowledge to imitate human knowledge and generate relational networks that can apply different metrics to different tasks.

Other Methods
There is no unified core idea for these methods. They solve few-shot tasks in different ways, but all achieve competitive results.
BERT-PAIR [28] BERT-PAIR combines query and each support sample into a sequence and utilizes BERT [29] to predict whether each pair expresses the same class.
LsSAML [30] It utilizes the information implied by class labels to assist pretrained language models extracting more discriminative features.
SALNet [31] This method trains a classifier from labeled data through an attention mechanism and collects lexicons containing important words for each category, and then uses new data labeled by the combination of classifiers and lexicons to guide the learning of the classifier.

Graph Neural Network
Graph neural networks (GNNs) were originally designed to process graph-structured data. GNNs can efficiently handle data structures containing complex relations and discover potential connections between data with the ability to transform and aggregate neighbors. Some of the GNN models for few-shot tasks are listed in Table 2. Table 2. Some of the current GNN models for few-shot tasks and their descriptions.

Model Description
Simple GNN [32] Garcia et al. [32] constructed a graph model in which the query and all support samples are closely connected and used a node-focused GNN to transfer instance-level relations and label information.
TPN [33] This method further considers the relations among query samples.
EGNN [34] It adopts an edge-labeling framework to explicitly model the intra-class similarity and inter-class dissimilarity of samples, and dynamically update node and edge features to achieve complex information interactions.
However, these models are all designed for image classification tasks, and these methods only transfer instance-level relations in GNNs, which make it difficult to handle elusive NLP tasks. In contrast, the HFGNN model proposed in this study considers relations between samples from multiple perspectives, and the accurate and sufficient relations help the model construct more discriminative features.

Multi-Criteria Decision-Making
Zadeh [35] proposed fuzzy set theory to address problems related to fuzzy, subjective, and imprecise judgments. However, this theory lacks the ability to solve the problem of multi-criteria decision-making(MCDM). In this regard, Torra [12] proposed HFS, which determines the corresponding evaluation index and membership function according to the different attributes of the elements in the universe. HFS is a powerful tool for solving problems involving many uncertainties.
In recent years, some more efficient MCDM methods have been proposed. Deveci et al. [36] explored a novel approach that integrates Combined Compromise Solution (CoCoSo) with the context of type-2 neutrosophic numbers to overcome the challenging decision process in Urban freight transportation tasks. Pamucar et al. [37] developed a novel integrated decision-making model which is based on Measuring Attractiveness by a Categorical Based Evaluation TecHnique (MACBETH) for calculating the criteria weights and Weight Aggregated Sum Product ASsessment (WASPAS) methods under the fuzzy environment with Dombi norms.
Considering the operating efficiency of the graph neural network model and the simplicity and effectiveness of the HFS theory, we introduce the HFS theory instead of other complex MCDM methods into the dual graph neural networks to fuse the relations between few-shot examples.

Problem Definition
The few-shot classification task trains a classifier that can accommodate new classes not seen in training, for which only a few examples are available in each class. Usually, the dataset used for this task is divided into two parts: the large training set C train containing a series of categories and the target test set with a disjoint set of new classes C test . C test contains a support set S with a few labeled samples. If the support set contains N categories and each category contains K samples, the target problem is called an N-way K-shot problem. In principle, the classification model can be directly trained with the support set, but the performance will be poor because K is too small. Therefore, it is necessary to perform meta-learning on the training set. Meta-learning extracts transferable knowledge on the training set to help the model perform better few-shot learning on the support set, thereby improving the classification accuracy of the query samples in the test set.
In meta-learning, the episodic strategy [13] constructs training episodes to simulate FSL test scenario so that the classifier can perform well with a small number of annotations. More specifically, the training episode is formed by first randomly selecting N categories from the training set and then choosing K samples within each selected class to act as the support set , as well as a fraction of the remaining samples to serve as the query set Q = {(x i , y i )} n q=1 , where x i represents the sample, and y i = {1, 2, . . . , N} represents the label corresponding to the sample. During the metalearning process, the support set is used to train the model to minimize the prediction loss on the query set. Meta-training performs this procedure iteratively episode by episode until the model converges.

Overview
As shown in Figure 1, HFGNN-Proto consists of three parts: the text embedding component, the HFGNN component, and the prototypical network component.

Text Embedding
To better reflect the semantic information of the text, the pre-trained language model BERT [29] is used to extract text semantic feature representations. BERT utilizes a masked language model for pre-training and employs a deep bidirectional transformer component, which is capable of generating language representations containing contextual information, to build the entire model. Given the input text x = [w 1 , w 2 , . . . , w n ], a special token [CLS] used for classification is inserted at the beginning of the word sequence, and the output of the token in the last transformer is the text embedding. The output of BERT contains information about the context of the text x, and it can be represented as h = f emb (x | θ emb ), where h ∈ R d , d represents the output dimension of BERT, and θ emb represents the parameters in the BERT encoder, which is fine-tuned during the training process.

Overview
This section introduces the HFGNN model in detail. As shown in Figure 1b, the HFGNN model contains three modules: a relation generator, a relation fusion module and an instance node updater. In each layer, the multi-attribute relations in task T are learned by , E dist l , T ) in the generator and fused by HFS theory in the relation fusion module. The fused relation is refined by the linear function and then passed to the instance node updater to update instance features.
More specifically, in the relation generator, the initial features extracted by BERT are used to compute instance-level relations E inst l as well as the orientation difference N isnt l of instance features in the instance graph. For the distribution graph, the instancelevel relations E isnt l are aggregated to construct distribution features V dist l of samples and distribution-level relations E dist l and distances D dist l are computed between distribution features. The above relations are transmitted to the relation fusion module to obtain the hesitant fuzzy relation R h l . A linear function F(·) converts R h l to the final relation representation R l . The instance node updater combines R l and instance features V inst l−1 to construct a hesitant fuzzy graph G h l = (V inst l−1 , R l , T ), and updates the instance features by aggregating the information of neighbors. This process is continuously repeated in HFGNN to fully explore the multi-attribute relations between samples in the task. The process of relation transfer is shown in Figure 2.

Relation Generator
Before the iteration process starts, we need to initialize the nodes in the dual-graph. The instance node is initialized by the output of BERT, and the initial node of text x i in the instance graph is denoted as v inst 0,i = f emb (x i ). The node in distribution graph is a N × K dimension vector, in which each element represents the instance-level relations between the sample and other support samples. These instance-level relations are aggregated to reflect the overall distribution of the sample in the support set. Since the current instance-level relations are unknown, the distribution graph node features are initialized according to the following rules: where T = N × K represents the number of support samples in the task, δ(y i , y j ) is the Kronecker delta function, which outputs one when label y i = y j and zero otherwise. S and Q represent the support set and query set, respectively.

Instance-level Relations
In the first layer, the edge features representing instance-level relations in the instance graph are first computed by the instance-edge-compute function: where f inst e,l : R m → R is a neural network that maps m-dimensional edge features to onedimensional values in a fixed range, consisting of a linear-BN-LeakyReLU block, a single linear layer, and a sigmoid layer with parameters θ inst e.l . Direction Relations Then, the cosine similarity between instance features, which can reflect the difference in the direction of the features, are calculated: where the value of the directional relations range between [−1, 1] and are inversely proportional to the magnitude of the directional difference. Distribution-level Relations Next, in the distribution graph, instance-level relations are aggregated and distribution-level feature representations are generated through the dist-node-update function: where represents the concatenation operator and f dist v,l : R 2T → R T is composed of a linear layer and LeakyReLU with parameters θ dist v,l . The dist-edge-compute function takes distribution features as input and computes edge features representing the distribution-level relations of the samples: where f dist e,l : R T → R transforms the distribution features using the combination of a linear-BN-LeakyReLU block, a single linear layer, and a sigmoid layer. The parameters in f inst e,l can be expressed as θ dist e,l . Distance Relations In HFGNN, the distance relations between distributed nodes are measured by the following methods: where SUM(·) represents the sum of the distance values in each dimension.

Relations Fusion Module
First, we determine the membership function corresponding to each relation. Since the instance-level and distribution-level relations have been standardized by the sigmoid layer to a one-dimensional value, f inst e,l and f dist e,l can be considered as the membership function of the instance-level and distribution-level relations, and e inst l,ij and e dist l,ij are the corresponding membership values. The membership functions of the remaining two relations are defined as follows: In the l-th iteration, the membership function of the directional relations between instance features is: In the l-th iteration, the membership function of the distribution feature distance relations is: ).
The membership values obtained from the relation membership functions defined above are all in the range [0, 1] and are proportional to the relations strength. The membership functions convert the relations between the samples into the corresponding hesitant fuzzy set, and all the membership values of the relations between a sample and itself are set to 1 to construct an ideal HFS h P , which can be used as a standard to measure the similarity between different samples. For example, the HFS between x i and itself can be represented as {1, 1, 1, 1}, while the HFS between x i and x j is {e inst l,ij , e dist l,ij , The similarity between the HFS and the ideal set can be measured by a distance metric, and it is inversely proportional to the distance calculation result. Here, the hesitant standard Euclidean distance is used as a measure. Assuming h l,ij is the HFS between x i and x j , the hesitant standard Euclidean distance between h l,ij and h P can be expressed as: where 1 l represents the largest number of elements in the HFS h l,ij and h P , and h σ(β) l,ij and h σ(β) P are the β-th largest values in h l,ij and h P , respectively. The similarity between HFSs, that is, the hesitant fuzzy relation r h l,ij , can be expressed as: Hesitant fuzzy relations contain multi-attribute similarity relations information. Considering that the nodes in the graph are easily affected by noise caused by adjacent nodes with low correlation, the hesitant fuzzy relations are further adjusted by a linear function F(·): where F(·) further strengthens the relations with high similarity (r h l,ij > β), while weakening the relations with extremely low similarity (r h l,ij < α). The transformed final relation r l,ij has strong inductive biases.

Instance-Node Updater
The instance node updater combines r l,ij and V inst l−1 to construct a hesitant fuzzy graph. Note that the hesitant fuzzy graph is no longer a fully connected structure since a part of relations R l are zero. The inst-node-update function in the hesitant fuzzy graph aggregates neighbors to update instance features: where represents the connection operator and f h v,l is a linear-BN-LeakyReLU block with parameters θ h v,l . The completion of the update marks the completion of an iteration. In HFGNN, the instance feature v isnt l,i output from the current layer is taken as the in-put of the next layer to start a new round of iteration. Then, the above process is repeated until more discriminative instance features are constructed.

Prototypical Network
The Prototypical Network embeds the instance features enhanced by HFGNN into the prototypical space, computes the class prototypical representation, and classifies query samples. The class prototype representation is the mean of instance embeddings for each class in the support set: where v i represents the instance feature of the last layer in the HFGNN, f φ is a linear layer with learnable parameters θ φ , and S k represents the support samples labeled as category k. The probability that query x belongs to one category in support set S can be calculated as follows: where d(·) is the distance calculation function between vectors. The cross-entropy loss function is used to train HFGNN-Proto, and the parameters θ emb , θ inst e,l , θ dist v,l , θ dist e,l , θ h v,l and θ φ in the model are optimized by minimizing the following loss values:

Datasets
We conduct experiments on two widely used few-shot text classification datasets to evaluate HFGNN-Proto.
The Amazon Review Sentiment Classification (ARSC) dataset contains English reviews for 23 domains of products on Amazon. For each product domain, Yu et al. [26] constructed three binary classification tasks with different scoring thresholds. These buckets form 23 × 3 = 69 tasks in total. Following previous works, 12 tasks in 4 domains (Books, DVD, Electronics, and Kitchen) are selected as the target test set, and each category in the test set contains only five labeled support samples. We create a 2-way 5-shot classification task on this dataset.
The Few-Shot Relation Classification (FewRel) dataset [38] contains of 100 relations from Wikipedia, and each relation consists of 700 instances. The number of relations in FewRel that are used for the training, validation and test sets is 64, 16 and 20, respectively. Following the settings used by Sun et al. [6], we conduct 5-way 5-shot and 10-way 5-shot experiments on the FewRel dataset. It should be noted that the label of test samples in this dataset has not been released to the public. The classification results are submitted to the FewRel evaluation website [39] provided by Gao et al. [38] for online evaluation of effectiveness.

Meta-Training and Meta-Testing
The experiments on ARSC consist of 20,000 iterations with eight episodes randomly selected for meta-training in each iteration, while FewRel consists of 10,000 iterations with four episodes per iteration. Our model is trained by the Adam optimizer with an initial learning rate of 1 × 10 −5 and a weight decay rate of 5 × 10 −8 . The learning rate decays by 0.01 every 100 iterations.
Following previous work, few-shot classification accuracy is used as the evaluation metric for performance. For FewRel, every 1000 iterations, the model is evaluated on 600 randomly drawn validation episodes to find the best parameters. The model with the best parameters is finally applied to the test set, which contains 10,000 test episodes.
The classification results are uploaded and evaluated online. For ARSC, tests are performed every 100 iterations. Note that the support set for testing in ARSC was determined by Yu et al. [26]. Consequently, we just need to sample the query from the target task to form the test episode. The average classification accuracy of the 12 target tasks in ARSC is used as the final accuracy.

Parameters Setting
The text encoder adopts Hugging Face's implementation of BERT (base version) [40], and the parameters of the encoding layer are initialized by the pre-trained model publicly provided by Google. BERT-base converts text into 768-dimensional feature vectors. The HFGNN-Proto model is conducted with two GNN layers (l = 2) to prevent features from being over-smoothed. In order to maintain the balance between the degree of weakening and strengthening of the relations by the linear function F(·), we keep the sum of α and β as always 1. We perform small-scale experiments to determine the value of α and β on a small subset of ARSC following the experimental setup above. According to the results shown in Figure 3, the thresholds α and β are set to 0.3 and 0.7, respectively.
The linear layer f φ (·) in the prototypical network embeds features into a 128-dimensional prototypical space.

Results and Analysis
We compare the effect of HFGNN-Proto on the ARSC and FewRel datasets with many baseline models. The experimental results are shown in Tables 3 and 4, respectively. Table 3. Comparison of the average classification accuracy (%) on ARSC.

5-Way 5-Shot 10-Way 5-Shot
Finetune [36] 68.66 55.04 kNN [36] 68.77 55.87 Meta Network [41] 80.57 69.23 Graph Network [32] 81.28 64.02 SNAIL [20] 79.40 68.33 Prototypical Network [5] 89.05 81.46 HATT-Proto [7] 90.12 83.05 HAPN [8] 91.02 84.16 EGNN-Proto [42] 92.29 86.09 BERT-PAIR [28] 93 Results on ARSC The results in Table 3 show that HFGNN-Proto achieves a classification accuracy of 88.36% and outperforms most previous baseline models. Some existing FSL methods, such as Matching Network [13], MAML [18], and Relation Network [21], perform poorly on ARSC despite the fact that they possess outstanding performance in the vision domain. Induction Network [9] uses a routing mechanism to induce better class representations, which improves the classification accuracy to a new level, but the performance of HFGNN-Proto still far exceeds it. Compared with these methods, the reason why HFGNN-Proto improves the classification accuracy is that it fully learns the potential relations of samples in the meta-task at the multi-attribute level.
Results on FewRel The results in Table 4 show that the 5-way 5-shot and 10-way 5-shot classification accuracy of HFGNN-Proto on the FewRel dataset are 94.45% and 89.40%, respectively, which both exceed the state-of-the-art method BERT-PAIR [28]. The classification performance in the 5-shot scenario is 1.23% higher than that of BERT-PAIR, and the accuracy improvement in the 10-shot scenario setting is even more significant, with an accuracy improvement of 2.28% over that of BERT-PAIR. EGNN-Proto [42] also uses the combination of GNNs and Prototypical Network, but the effect is far less than that of our model. EGNN-Proto uses the fully connected graph structure to transmit instance-level relations and edge label information. This process faces the problem of single relational information and the influence of noise caused by a fully connected structure. The relations in HFGNN-Proto are more sufficient, and the function F(·) provides a more reasonable graph structure for the update of instance features and effectively avoids the influence of irrelevant noise.

Comparison with PLMs
We also compare the performance of HFGNN-Proto with the pretrained language model GPT and its variants on the real-world few-shot classification datasets constructed by Alex et al. [43]. The results are shown in Table 5. We can see that the performance of GPT-2 [44] and GPT-Neo [43] is far inferior to our model, while GPT-3 [45] with up to 175 billion parameters achieves the highest performance. It is worth noting that our proposed HFGNN-Proto method achieves performance close to GPT-3 with several orders of magnitude fewer parameters than GPT-3. Table 5. Comparison of HFGNN-Proto and PLMs. ADE and NIS are datasets constructed by Alex et al. [43].

Ablation Experiment
To analyze the influence of different modules in HFGNN-Proto on the performance, we conduct ablation experiments on ARSC.
The abundant information in multi-attribute relations builds more reasonable feature representations, and measuring multiple relations is crucial to the performance of the model. To support this idea, we compare the classification performance of HFGNN-Proto with different relation combinations. The experimental results in Table 6 show that the accuracy is greatly improved after HFGNN-Proto learns distribution-level relations on the basis of the initial instance-level relations, and the best performance is achieved when all relations mentioned in Section 4.3.2 are considered, corresponding to the best result reported in Table 3. We further report the effect of HFS strategy, the linear function F(·), and the layers of GNN on model performance as shown in Table 7. We can see that the best performance is achieved when the number of GNN layers is 2, corresponding to the result reported in Table 3. More layers of GNN did not further improve the performance. The table also demonstrates the effectiveness of HFS and the linear function F(·). HFGNN-Proto performs poorly when we replace the HFS with the averaging strategy to process the relations, which proves the contribution of HFS to the performance. On the other hand, the performance of the model without F(·) is significantly lower, indicating that our designed function generates a more reasonable graph structure and provides inductive biases for the transfer of relations.

Visualization
To further analyze the benefit of HFGNN on instance features, we randomly select an episode from the FewRel 10-way 5-shot test set and visualize the support set features before and after HFGNN transformation through t-SNE, as shown in Figure 4. The initial features extracted by BERT diverge in space and are interleaved with each other. After the transformation of the first layer in the HFGNN, the features belonging to the same class are aggregated, and the features of different classes are far away from each other. After the second transformation, the distribution of features in the space is further improved. These results fully demonstrate the effectiveness of the HFGNN in discovering relations between samples and enhancing instance features.

Limitations
General GNNs can efficiently perform edge classification. The edges between nodes represent the similarity between samples, and the classification results can be directly generated by the edges. However, our proposed HFGNN method does not support edge classification. This is because we change the edges in hesitant fuzzy graph according to Equation (11). These edges with weak relations are directly cut off, which results in the inability of HFGNN to perform efficient edge classification.
Another limitation of HFGNN-Proto is that the model can only handle English tasks currently, and it is still a challenge for our model to handle few-shot tasks in other languages.

Conclusions
In this paper, we propose a HFGNN-Proto model that can fully explore the multiattribute relations between samples for few-shot text classification. Abundant relation information helps the model to better handle complex NLP tasks. Relations are transmitted in a dual-graph and integrated by HFS theory. The use of HFS effectively avoids the loss of information and improves the accuracy of the overall relation representation. Moreover, the linear function further improves the rationality and accuracy of the relations, which helps the model construct a more accurate hesitant fuzzy graph for message transmission and provides strong inductive biases for feature enhancement. Finally, a prototypical network can perform more quick and efficient classification based on the enhanced features. HFGNN-Proto achieves better generalization on unseen tasks given its ability to discover precise potential connections between texts. Experimental results demonstrate that HFGNN-Proto outperforms existing state-of-the-art few-shot models. In the future, we will focus on exploring more substantial relations among few-shot samples, trying to use other MCDM methods to handle relations, and generalizing HFGNN-Proto to few-shot tasks in other domains.

Conflicts of Interest:
The authors declare no conflict of interest.