Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.


Introduction
In Large-Scale Text Categorization (LSTC) we are confronted with textual classification problems where a very large and structured set of possible classes are employed.For the general case, not limited exclusively to text, the term eXtreme Multi-Label categorization (XML) is also often used.Usually, in these cases we are dealing with multi-label learning problems where models learn to predict more than one class or label to be assigned to a given input text.
Conventional approaches in multi-label learning either convert the original multi-label problem into a set of single-label problems or adapt well known single-label classification algorithms to handle multi-label datasets.In the context of LSTC and XML research, evolutions of both types of method, that employ what has been called label embedding (LE) or label compression (LC), have recently emerged, trying to take advantage of label dependencies to improve categorization performance.LE methods try to take advantage of label dependencies to improve categorization performance.The starting premise of LE is to convert the large label spaces to a reduced-dimensional representation space (the embedding space) where the actual classification is performed, the results of which are then transformed back to the original label space.
Autoencoders (AEs) are a classical unsupervised neural network architecture able to learn compressed feature representations from original features.Usually AEs are symmetrical networks with a series of layers that learn to transform their input to a latent space of lower dimension (encoder) and another series of layers that learn to regenerate that input from the latent space (decoder), both of them are connected by a small layer that acts as an information bottleneck.Training is carried out in an unsupervised way, presenting the same training vector in the input layer and in the output layer.AE are typically employed in data pre-processing, discarding the decoder and using the learned encoder to create rich representations of the input data useful in further processing.
Automatic semantic indexing is often modeled as an LSTC or XML problem.This task seeks to automate assigning to a given input document a sets of descriptors or indexing terms taken from a controlled vocabulary in order to improve further searching tasks.MeSH (Medical Subject Headings) is a large semantic thesaurus commonly used in the management of biomedical literature.MeSH labels are semantic descriptors arranged into 16 overlapping concept sub-hierarchies, which are employed to index MEDLINE, a collection of millions of biomedical abstracts.
Given this context, in this paper, a multi-label lazy learning approach is presented to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation.This method is an evolution from the traditional k-Nearest Neighbors (k-NN) algorithm which exploits an AE trained to map the large label space to a reduced size latent space and to regenerate the output labels from this latent space.Our contributions are as follows: • We have employed MEDLINE as a huge labeled collection to train large label-AEs able to map MeSH descriptors onto a semantic latent space where label interdependence is coded.

•
Our proposal adapts classical k-NN categorization to work in the semantic latent space learned by these AEs and employs the decoder subnet to predict the final candidate labels, instead of applying simple voting schemes like traditional k-NN.

•
Additionally, we have evaluated different document representation approaches, using both sparse textual features and dense contextual representations.We have studied their effect in the computation of inter-document distances that are the starting point to find the set of neighbor documents employed in k-NN classification.
The remainder of this article is organized as follows.Section 2 presents the background and context of this paper.Section 3 describes the details of the proposed method and its components.Finally, Section 4 discusses the experimental results obtained by our proposals and Section 5 summarizes the main conclusions of this work.

Related Work
This work is framed at the confluence of three research fields: (1) large-scale multilabel categorization, (2) autoencoders and (3) semantic indexing.This section provides a brief review of the most relevant contributions in the state of the art of these topics in relation to our label autoencoder proposal.

Multi-Label Categorization
In multi-label learning [1] examples can be assigned simultaneously to several not mutually exclusive classes.This task differs from single-label learning (binary or multiclass) and has its own characteristics that make it more complex, while being able to model many relevant real-world problems.Formally, given L = {l 1 , l 2 , . . ., l l } the finite set of labels in a multi-label learning task and D = {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n )} the set of multi-label training instances, where x i is the i-example feature vector and y i ⊆ L is the set of labels for that example, the multi-label categorization task aims to build a multi-label predictor f : x ′ −→ y ′ , with y ′ ⊆ L, able to produce good classifications on incoming test instances from T = {x ′ 1 , x ′ 2 , . . ., x ′ m }.The scientific literature on multiple-label learning [2,3] usually identifies two main groups of approaches when dealing with this problem: algorithm adaptation methods and problem transformation methods.Algorithm adaptation approaches extend and customize single-label machine learning algorithms in order to handle multi-label data directly.Several adaptations of traditional learning algorithms have been proposed in the literature, such as boosting (AdaBoost.MH) [4], support vector machines (RankSVM) [5], multi-label k-nearest neighbors (ML-kNN) [6] and neural networks [7].On the other hand, problem transformation methods transform a multi-label learning problem into a series of single-label problems which already have well-established resolution methods.The solutions of these problems are then combined to solve the original multi-label learning task.For example, Binary Relevance (BR) [8], Label Powerset (LP) [9] and Classifier Chains (CC) [10] transform multi-label learning problems into binary classification problems.
A relevant aspect in multi-label learning approaches is the treatment given to interlabel dependencies.The simplest methods, such as BR, do not take into account correlation between labels, assuming label independence and neglecting the fact that some labels are more likely to co-exist.This assumption brings advantages in parallelization and training efficiency, but at the cost of lower performance in many real-word tasks that exhibit complex inter-label dependencies.Other approaches, like LP and CC, try to capture the dependencies between labels using different strategies.For example, CC sequentially creates a set of binary classifiers where labels predicted by previous classifiers are part of the features employed in successive classifications.
Recent research in multi-label learning propose a label embedding (LE) or label compression (LC) approach that tries to properly exploit correlation between label information by transforming the label space into a latent label space of reduced dimensionality.The actual categorization is performed in that latent space, where correlation between labels is implicitly exploited and a proper decoding process will map the projected data back onto the original label space.Early work in LE [11,12] typically considered linear embedding functions and worked with fairly small label space sizes.Other approaches overcome the limitations of linear assumptions evolving to non-linear embeddings [13][14][15], including several methods based on conventional or deep neural networks [16][17][18].
Finally, a prominent field in multi-label learning that have been attracting lots of research in recent times is eXtreme Multi-label Classification (XML) [19,20].XML is a multi-label classification task in which learned models automatically label a data sample with the most relevant subset of labels from a large label set, with sizes ranging from thousands to millions.This is a challenging problem due to the scale of the classification task, label sparsity and complex label correlations.The ability to handle label correlations and the scalability of LE approaches [14] have shown many advantages in XML making embeddings one of the most popular approaches for tackling XML problems.

Autoencoders in Multi-Label Learning
Autoencoders (AEs) [21,22] are a family of unsupervised feedforward Neural Network architectures that jointly learn an encoding function, which maps an input to a latent space representation, and a decoding function, which maps from the latent space back onto the original space.Figure 1 shows this symmetric encoder-decoder structure, with;

•
An encoder function Enc : X → Z, which maps the input vectors into a latent (often lower-dimensional) representation though a set of hidden layers.

•
A decoder function Dec : Z → X, which acts as an interpreter of the latent representation and reconstructs the input vectors though a set of hidden layers, usually symmetric with the encoding layers.

•
A middle hidden layer representing in the latent space Z an encoding of the input data.
Training the model to reproduce the input data at its output, AE jointly optimizes the parameters of encoder Enc and decoder Dec functions and can learn in its hidden layer richer non-linear encoding features that can represent complex input data in a reduced dimensionality latent space.Most practical applications of AE exploit this latent representation in (1) data compression or hashing tasks, (2) classification tasks, using AE to reduce input features dimensionality with minimal information loss, (3) anomaly detection, by analyzing outliers and abnormal patterns in generated embeddings, (4) visualization tasks on the encoded space or (5) data reconstruction and noise reduction in image processing.Usage of AEs in single-label and multi-label learning, including XML, has already been reported in many research works and AEs are frequently part of pre-processing steps performing dimensionality reduction in order to improve categorization performance and speed.Methods like AE-kNN [23] train an AE on high dimensional input features from a training dataset and employ the encoder sub-net as an input feature compressor, transforming the original input space into a lower-dimensional one where a conventional instance-based k-NN algorithm does the labeling.
The use of AEs over the label space has been less common in the literature.With the advent and explosion of XML methods, research proposals that try to take advantage of the capabilities of AEs to capture non-linear dependencies among the labels have appeared.
Wicker et al. [16], in a pioneering work in the use of label space AEs, introduce MANIC, a multi-label classification algorithm following the problem transformation approach which extracts non-linear dependencies between labels by compressing them using an AE.MANIC uses the encoder part to replace training labels with a reduced dimension version and then trains a base classifier (a BR model in their proposal) using the compressed labels as new target variables.C2AE (Canonical-Correlated AutoEncoder) [17] and Rank-AE (Ranking-based Auto-Encoder) [18] follow a similar idea, which was later generalized in [24].These approaches perform a joint embedding learning by deriving a compressed feature space shared by input features and labels.An input space encoder and a label space encoder, sharing the same hidden space, and a decoder that converts this hidden space back to the original label space are trained together to create a deep latent space that embeds input features and labels simultaneously.

Semantic Indexing in the Biomedical Domain
Controlled vocabularies provide an efficient way of accessing and organizing large collections of textual documents, specially in domains where a simple text-based representation of information is too ambiguous, like the biomedical or legal domains.Automatic semantic indexing seeks to build systems able to annotate an arbitrary piece of text with relevant controlled vocabulary terms.Aside from pure natural language processing (NLP) based methods, most of them following Named Entity Recognition (NER) or Entity Linking strategies, many approaches to semantic indexing model the assignment of controlled vocabulary terms as a multi-label categorization problem.
The Medical Subject Headings (MeSH) thesaurus [25] is a controlled and hierarchicallyorganized vocabulary, developed and maintained by the National Library of Medicine (https://www.nlm.nih.gov/), which was created for categorizing and searching citations in MEDLINE and the PubMED database.MEDLINE (https://www.nlm.nih.gov/medline/medline_overview.html) is an NLM bibliographic database that contains more than 29 mil-lion references to journal articles in life sciences published from 1966 to present, published in more than 5200 journals worldwide in about 40 languages.Each MEDLINE citation contains the title and abstract of the original article, author information, several metadata items (journal name, publishing dates, etc.) and a set of MeSH descriptors that describe the content of the citation and were assigned by NLM annotators to help indexing and searching MEDLINE records.The task of identifying the MeSH terms that best represent a MEDLINE article is manually performed by human experts.The average number of descriptors per citation in MEDLINE 2021 edition was 12.68.
MeSH vocabulary is composed of MeSH subject headings (commonly known as descriptors) which describe the subject of each article and a set of standard qualifiers (subheadings) which narrow down the MeSH heading topic.Additionally, Check Tags are a special subset of 32 MeSH descriptors that cover concepts mentioned in almost every article (human age groups, sex, research animals, etc.).MeSH descriptors are arranged in a hierarchy with 16 top-level categories that constitute a collection of overlapping topic sub-thesauri.A given descriptor may appear at several locations in the hierarchical tree and can have several broader terms and several narrower terms.The 2021 edition of the MeSH thesaurus is composed of 29,917 unique headings, hierarchically arranged in one or more of the 16 top-level categories, with the distribution shown in Table 1.Automatic indexing with MeSH poses great research challenges.(1) Beyond its large size, the distribution of descriptors follows a power-law, where a few labels (Check Tags and very general descriptors) appear in a large number of citations, whereas most descriptors are employed to annotate very few abstracts According to statistics in [26], only 1% of all MeSH headings which have more than 5000 occurrences contribute to more than 40% of indexing.(2) Simultaneously indexing within 16 top-level overlapping topic sub-hierarchies leads to complex label interdependency.
Over the years several proposals have attempted to tackle the problem of automatic MeSH indexing.The Medical Text Indexer (MTI) [27] is a tool in permanent development by NLM for internal usage as a preliminary annotation tool of incoming MEDLINE citations.MTI is based on a combination of NLP based concept finding performed with MetaMap [28], k-NN prediction using descriptors from PubMed-related citations and several hand-crafted rules and state-of-the-art machine learning techniques that have been incorporated over years of development.
Semantic indexing with MeSH descriptors has also been boosted in recent years by competitions such as the BioASQ challenge [29], which, since 2013, has been organizing an annual shared-task dedicated to semantic indexing in MEDLINE.Several state-of-the-art methods for MeSH indexing were introduced by teams participating in this challenge, most of them modeling the task as a multi-label learning problem [30].Some relevant recent developments in MeSH indexing are MeSHLabeler [31], DeepMeSH [32], MeSH Now [33], AttentionMeSH [34], MeSHProbeNet [35], FullMeSH [26], BERTMeSH [36] and k-NN methods using ElasticSearch and MTI such as [37].

Materials and Methods
Our proposal models semantic indexing over MeSH as a multi-label categorization problem with the following specific characteristics: • MEDLINE provides us with an extensive collection of manually annotated documents to train our models.

•
MeSH is a rich hierarchical thesaurus with a large set of descriptors and complex label co-occurrence.
The approach that we describe in this work tries to take advantage of these characteristics through the use of label autoencoders (label-AEs).Our method starts by training a label-AE using the historical information available in MEDLINE.Once trained, the components of this AE allow us to (1) convert the MeSH descriptors assigned to a MEDLINE citation to an embedded semantic space using the encoder part and (2) use the decoder part and a simple threshold scheme to return from that reduced-dimensional space back to the MeSH descriptor space.
The proposed multi-label classification follows a label embedding approach.Our method aims to take advantage of the reduced-dimensional semantic space learned by the label-AE so that a lazy learning scheme operates on it performing the actual classification.This results in a k-NN classifier enriched with the compact semantic information provided by the AE components.This section details the elements that make up our proposal.

Similarity Based Categorization (k-NN)
The k-Nearest Neighbor (k-NN) algorithm [38] is a lazy learning method which classifies new samples using previous classifications of similar samples assuming the new ones will fall into the same or similar categories.For a given test instance, x, the k most similar instances (the k-nearest neighbors), denoted as N(x), are taken from the training set according to a certain similarity measure.Votes on the labels of instances in N(x) are taken to determine the predicted label for that test instance x.
Approaches based on k-NN have been widely used in large-scale multi-label categorization in many domains, including MEDLINE documents [37,39,40].This preference for this lazy learning method is mainly due to its scalability, minimum parameter tuning and, despite its simplicity, its ability to deliver acceptable results when a large number of training samples are available.
The basic k-NN method we employ in our proposal follows these steps: 1.
Create an indexable representation from the textual contents of every document (MEDLINE citations in our case) in the training dataset.Two different approaches for creating these indexable representations, dense and sparse, were evaluated in our study as is shown in Section 3.2.

2.
Index these representations in a proper data structure in order to efficiently query it to retrieve sets of similar documents.

3.
For each new document to annotate, the created index is queried using the indexable representation of the new document.The list of similar documents retrieved in this step together with their corresponding similarity measures are used to determine the following results: (a) expected number of labels to assign to the new document (b) ranked list of predicted labels (MeSH descriptors in our case) The first aspect conforms to a regression problem, which aims to predict the number of labels to be included in the final list, depending on the number of labels assigned to the most similar documents identified during the query phase and on their respective similarity scores.In our method the number of labels to be assigned is predicted by simply averaging the length of label lists in neighbor samples.
The other task is a multi-label classification problem, which aims to predict an output label list based on the labels manually assigned to the most similar documents.Our method creates the ranked list of labels using a simple majority voting scheme.Since this is actually a multi-label categorization task, there are as many voting tasks as there were candidate labels extracted from the neighboring documents retrieved by the indexing data structure.For each candidate label, positive votes come from similar documents annotated with it and negative votes come from neighbors not including it.The topmost candidate labels are returned as classification output.

Document Representation
As noted in the preceding section, our proposal indexes representations of the training documents in order to implement the similarity function that provides the set of neighbors and their similarity scores.In this work we have evaluated two different approaches in document representation, which determine their respective indexing and query schemes, together with document preprocessing.

•
Sparse representations created by means of several NLP based linguistically motivated index term extraction methods, employed as discrete index terms in an Apache Lucene index (https://lucene.apache.org/).• Dense representation created by using contextual sentence embeddings based on Deep Learning language models, stored in a numeric vector index.

Sparse Representations
This approach is essentially a large multi-label k-NN classifier backed by an Apache Lucene index.Lucene is an open-source indexing and searching engine, that implements a vector space model for Information Retrieval, providing several similarity ranking functions, such as BM25 [41].Textual content of training documents is preprocessed in order to extract a set of discrete index terms which Lucene conveniently stores in an inverted index.When labeling, text from new documents is preprocesed and the extracted index term are treated as query terms and linked together using a global OR operator to conform the final query sent to the indexing engine to retrieve the most similar documents and their corresponding similarity scores.
In our case, we have employed the BM25 similarity function.The scores provided by the indexing engine are similarity measures resulting from the engine's internal computations and the weighting scheme being employed, which do not have a uniform and predictable upper bound.In order for these similarity scores to behave like a real distance metric, we have applied a normalization procedure, that transforms them into a pseudo-distance in [0, 1].
Regarding sparse document representation we have evaluated several linguistically motivated index term extraction approaches as introduced in [40] for a similar problem in Spanish.We employed the following methods: Stemming based representation (STEMS).This was the simplest approach which employs stop-word removal, using a standard stop-word list and the default stemmer from the Snowball project (http://snowball.tartarus.org).

Morphosyntactic based representation (LEMMAS).
In order to deal with morphosyntactic variation we have employed a lemmatizer to identify lexical roots and we also replaced stop-word removal with a content-word selection procedure based on partof-speech (PoS) tags.
We have delegated the linguistic processing tasks to the tools provided by the spaCy Natural Language Processing (NLP) toolkit (https://spacy.io/).In our case we have employed the PoS tagging and lemmatization information provided by spaCy, using the biomedical English models from the ScispaCy project (https://allenai.github.io/scispacy/).
Only lemmas from tokens tagged as a noun, verb, adjective, adverb or as unknown words are taken into account to constitute the final document representation, since these PoSs are considered to carry most of the sentence meaning.

Noum phrases based representation (NPS).
In order to evaluate the contribution of more powerful NLP techniques, we have employed a surface parsing approach to identify syntactic motivated nominal phrases from which meaningful multi-word index terms could be extracted.
Noun Phrase (NP) chunks identified by spaCy are selected and the lemmas of constituent tokens are joined together to create a multi-word index term.

Dependencies based representation (DEPS).
We have also employed as index terms triples of dependence-head-modifier extracted by the dependency parser provided by spaCy.

Named entities representation (NERS).
Another type of multi-word representation taken into account is named entities.We have employed the NER module in spaCy and the ScispaCy models to extract general and biomedical named entities from document content.

Keywords representation (KEYWORDS).
The last kind of multi-word representation we have included is keywords extracted with statistical methods from the textual content of articles.We have employed the implementation of the TextRank algorithm [42] provided by the textacy library (https://textacy.readthedocs.io).

Dense Representations
The recent rise of powerful contextual language models such as BERT and similar approaches have boosted the performance of multiple language processing tasks and Transformer based solutions dominate the state-of-the-art in many NLP areas.A natural evolution of these contextual word embeddings is to move them towards embeddings at the sentence-level with approaches such as those in the Sentence Transformers [43] project (https://www.sbert.net/)that provides pre-trained models to convert sentences in natural languages into fixed-size dense vectors with enriched semantics.
We have taken advantage of dense semantic representations of whole sentences as a basis for converting a search for similar documents into a search for similar vectors in the dense vector space where documents from the training dataset are represented.
We have employed the sentence-transformers/allenai-specter model (https: //huggingface.co/sentence-transformers/allenai-specter) to represent a given MEDLINE abstract as a dense vector.This is a conversion of the AllenAI SPECTER model [44], originally trained to estimate the similarity of two publications, to SentenceTransformers, which exploits the citation graph to generate document-level embeddings of scientific documents.This model returns a 768-dimension vector from inputs in the form paper[title] + '[SEP]' + paper [abstract].
Once we have the dense representations of the training documents using this procedure, we use the FAISS [45] library (https://github.com/facebookresearch/faiss) to create a searchable index of these dense vectors.This index allows us to efficiently calculate distances between dense vectors and determine for the dense vector associated with a given test document (our query vector) the list of k closest training dense vectors using the Euclidean distance or other similarity metrics.
With this mechanism of similarity between dense vectors we can apply the k-NN classification procedure described previously.In this case we can use the real distances provided by the FAISS library between the query vector generated from the text to be annotated and the most similar k dense vectors directly.

Label Autoencoders
Our proposal is a special case of eXtream Multi-Label categorization (XML) using a label embedding approach.In our case a lazy learning method works on a low dimensional projection of the label space build with a label autoencoder (label-AE).
Our method is similar to MANIC [16].Both of them learn a conventional AE.In MANIC the encoder is applied to the entirety of labels from the training examples and uses thresholds to convert their embeddings to a smaller binary label space in which Binary Relevance classifiers are trained.In our case the encoder only acts on the subset of training examples that are part of the neighbors set, N(x).The embedded vectors of neighbors are averaged and the decoder transforms that average vector to the original label space.The AEs used by C2AE [17] and Rank-AE [18] are very different to ours.They jointly train two input subnets that share the inner embedding layer, one generates embeddings from the input features and the other one generates the same embedding space from labels, These two subnets are trained together with an output subnet that decodes the reduced embedding space to the actual label space.In the annotation phase, only the subnet that creates the embedding from input features and the decoding subnet are employed.
The first step of our proposal involves training a label-AE using the set of labels taken from the training samples.In the experiments reported in this paper, those labels are the lists of MeSH descriptors assigned to the MEDLINE citations in our training dataset.For MeSH, this results in a very large label-AE, with >29 K units in the input layer and another >29 K output neurons.Also, input and output vectors are extremely sparse, with an average of 12 values set to 1. On the other hand, the set of training samples is very large and can reach several million if the entire MEDLINE collection is used.
A tentative preliminary study was performed on a portion of the MEDLINE and MeSH datasets.In those preliminary runs we assessed the reconstruction capability of the trained AEs.As a result, the topology and main parameters of the label-AE scheme used in the experiments reported in this paper have been defined as follows: • Encoder with 2 hidden layers of decreasing size.The second step of our method is to extract the internal representations for the training documents and store them in the corresponding index.As is shown in Section 3.2 an Apache Lucene textual index is employed for NLP based sparse representations and an FAISS index stores the dense contextual vector representations.
Once we have trained our label-AE and a properly indexed version of the training dataset is available, to annotate a new MEDLINE citation x, we apply the following procedure, illustrated in Figure 2: 1.
The index is queried and the set N(x) = {n 1 , n 2 , . . ., n k } with the k documents closest to x is retrieved, along with their respective distances to x, (d i for each n i ∈ N(x)).
• Depending on the representation being used, title and abstract of x are converted into a sparse set of Lucene indexing terms or into a dense vector.

•
Once the respective index (Lucene or FAISS) is queried, an ordered list of most similar citations is available, together with an estimate of their distances to the query document x.
-BM25 scores converted to a pseudo-distance in [0, 1] with Lucene index euclidean distance between dense representations with FAISS index 2.
The encoder is applied to translate the set of labels assigned to each neighbor n i ∈ N(x) into the reduced semantic space, computing ⃗ z i = Enc(y n i )) ∀x i ∈ N(x), with y n i the set of labels in neighbor n i .

3.
We create the weighted average vector ⃗ z ′ = ∑ k i=1 w i w TOTAL • ⃗ z i in the embedding space, where w TOTAL = ∑ k j=1 w j .Several distance weighting schemes have been discussed in k-NN literature [38].
In our case we have employed two: (1) weight neighbors by 1 minus their distance (w i = 1 − d i ) and ( 2) weight neighbors by the inverse of their distance squared

4.
The decoder is used to convert this average vector ⃗ z ′ from the embedding space to the original label space as y ′ = Dec(⃗ z ′ ) Various cutting and thresholding schemes can be used to binarize this vector and return the list of predicted labels.
• Estimate the number of labels to return, r, from the sizes of label sets of documents in N(x), as described in citecual, and return the r predicted labels with the highest score.• Apply a threshold on the activation of decoder output neurons to decide which labels have an excitation level high enough to be part of the final prediction.

Results and Discussion
This section conducts an exhaustive set of experiments on a large portion of the MEDLINE collection.In these experiments we validate the effectiveness of our proposal of a multi-label k-NN text classifier assisted by a label-AE in a complex semantic indexing task.Different parameters and options were evaluated on the test dataset in order to determine the best setting for our system with the aim to answer the following research questions: • What is the effect on classification performance of the choice of training document representations?Are there substantial differences between sparse term-based similarity and dense vector-based similarity?• What are the best parameterizations for label-AEs (size of embedding representation layer, sizes of encoder and decoder layers, etc)?What are the effects of retrieving different number of neighbor documents on the classification performance and how affects the weighting scheme employed when creating the average embedded vector?
In this section we provide a description of our evaluation data and the performance metrics being employed and discuss the experimental results.The source code used to carry out the reported experiments is available at https://github.com/fribadas/labelAE-MeSH.

Dataset Details and Evaluation Metrics
Our experiments were conducted on a large textual multi-label dataset created as a subset of the 2021 edition of MEDLINE/PubMed baseline files (ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline), which comprises over 6 million citations from 2010 onwards.For convenience, the actual dataset was retrieved from the BioASQ challenge [29] repository (http://www.bioasq.org/)rather than from the original sources.BioASQ organizers retrieved the citations from the MEDLINE sources, extracting the relevant elements (PMID, ARTICLETITLE, ABSTRACTTEXT, MESHHEADINGSLIST, JOURNALNAME and YEAR) and distributed then conveniently formatted as a JSON document.Table 2 summarizes the most relevant characteristics of the resulting dataset.In our study we have employed two complementary sets of evaluation metrics that are commonly used in evaluating multi-label and XML problems.

•
The evaluation of binary classifiers typically employs Precision (P), which measures how many predicted labels are correct, Recall (R), which counts how many correct labels the evaluated model is able to predict, and F-score (F), which combines both metrics by calculating the harmonic mean of P and R. In multi-class and multi-label problems these metrics are generalized by calculating their Macro-averaged and Microaveraged variants.A Macro-averaged measure computes a class-wide average of the corresponding measure while a Micro-averaged one computes the corresponding measure on all examples at once and, in the general case, uses to have the advantage of adequately handling the class imbalance.In our evaluation we followed the BioASQ challenge proposal [29] that employs the Micro-averaged versions of Precision (MiP), Recall (MiR) and F-score (MiF) as main performance metrics, using MiP as a ranking criteria.

•
In XML, where the number of candidate labels is very large, metrics that focus on evaluating the effectiveness in predicting correct labels and generating an adequate ranking in the predicted label set are frequently used.Precision at top k (P@k) computes the fraction of correct predictions in the top k predicted labels.Normalized Discounted Cummulated Gain at top k (nDCG@k) [46] is a measure of the ranking quality at the top k predicted labels, which evaluates the usefulness of a predicted label according its position in the result list.In our experimental results, we report the average P@k and nDCG@k on the testing set with k = 5 and k = 10, in order to provide a measure of prediction effectiveness.

Experimental Results
In the first place we have evaluated the performance of the different approaches described in Section 3.2 for document representations.Secondly, another set of experiments has evaluated the performance of the k-NN method assisted with label-AEs described in Section 3.3 using different AE configurations.

Dense vs. Sparse Representations
In order to evaluate the influence of the document representation being used on the categorization performance we have performed a battery of experiments comparing the use of the dense representations with contextual vectors (runs DENSE) and the use of different alternatives for extracting sparse representations.In particular, we have evaluated the performance of single terms extracted by stemming (runs STEMS) and lemmatization (runs LEMMAS), the combination of the different methods for extracting compound terms (runs MULTI where we combine NERS, NPS and KEYWORDS) and the joint use of the terms extracted using all the methods described in Section 3.2 (runs ALL).The effect of the number of neighbors considered in each case has also been evaluated, taking k values in {5, 10, 20, 30, 50.100}.
As can be seen from the results shown in Table 4 and summarized in Figure 3, for these experiments the dense representation performs worse than most sparse representations in all performance metrics being considered and for all values of k.The contribution of multi-word terms in the sparse representations is very limited.Although the best results are obtained by combining all term extraction methods (runs ALL), it is observed that in all metrics the results obtained using single-word terms of type STEMS and LEMMAS dominate.We hypothesize that when applying this kind of k-NN method on a relatively large dataset (>6 M documents in our case) the contribution of more sophisticated representation methods is diluted.In smaller datasets the use of very specific and precise multi-word terms can help to greatly improve the representation of a document when searching for similar ones.In this context it is surprising that an apriori simpler approach such as the extraction of sparse representations and the use of the Apache Lucene similarity performs better than the transformer-based contextual representations that currently dominate in the NLP research.An in-depth review of this phenomenon is beyond the scope of this paper, it may be due to the lack of a prior fine-tuning phase with the employed MEDLINE dataset, a poor suitability as a similarity metric of the Euclidean distance computed by the FAISS library or an inherent limitation of large pre-trained language models based on transformers as is discussed in [47].

STEMS
With respect to the number of neighbors to consider in the k-NN classification, the best results are usually obtained with k = 20 and k = 30, which is in line with previous publications [39] in MeSH semantic indexing.

k-NN Prediction with Label Autoencoders
Regarding the experiments evaluating the performance of our proposal of a label-AE as a mechanism for improving k-NN classification, our objective has been to evaluate three aspects: (1) the performance of different label-AE topologies (2) the effect of the distance weighting scheme used to create the average vectors feeding the decoder and (3) the most appropriate threshold values to generate the list of predicted labels from the reconstruction of the label space provided by the decoder.
Table 3 shows the characteristics of the label-AEs we have used in this series of experiments We have employed a fixed neural network architecture, using two fully connected layers in both encoder and decoder and one fully connected layer as embedding layer.We have trained and evaluated an encoder, named SMALL label-AE, that uses a 64-dimensional embedding vector and an initial encoder and final decoder layer with 1024 neurons.We have also employed two AE architectures with a 128-dimensional embedding space with two encoder-decoder sizes, one with layers of 2048 and 256 neurons, called MEDIUM label-AE, and another with encoder-decoder layers of 4096 and 512 neurons, denoted as LARGE label-AE.We aimed to evaluate the effect of the size and the number of parameters in the learned label-AEs on their quality in the label encoding and reconstruction tasks.
The detailed results obtained with the SMALL label-AE are shown in Table 5, those for the MEDIUM label-AE in Table 6 and those for the LARGE label-AE in Table 7. Regarding the thresholds to be applied on the decoder output to create the list of predicted labels, two values have been evaluated, selecting those labels whose output activation exceed the value 0.5 in one case and the value 0.75 in the other.In this way we intended to evaluate the effect of considering more or less demanding selection criteria in conforming the predicted label list.Finally, the effect of the two distance weighting schemes introduced in Section 3.3 to combine the embedded vectors has been evaluated in the different scenarios.In both cases, weighting by 1 minus distance (DIFFERENCE) and weighting by the inverse of distance squared (SQUARE), the DENSE representation and the SPARSE representation using all of the term extraction methods have been employed, using a number of neighbors k ∈ {5, 10, 20, 30, 50, 100}.Figure 4 summarizes the MiF, MiP and MiR results for the best configurations of label-AE, threshold, distance weighting scheme and k.As can be seen in Table 6 the best results in all metrics are obtained with the MEDIUM label-AE, showing the 64-dimensional embedding space of the SMALL label-AE as apparently incapable of adequately capturing relationships between MeSH descriptors and reconstructing them later.The comparison between the performances of MEDIUM label-AE and LARGE label-AE apparently confirms that 2048 dimensions in the input layer of the encoder are sufficient to provide an embedded representation capable, once reconstructed by the decoder, of offering a performance in terms of MiF similar to that of a basic k-NN method, improving its precision values at the cost of a slightly reduced recall.Regarding the P@k values and the measurement of ranking quality using nDCG@k, the label-AE results are also able to equal those of the basic k-NN method.However, in this case it is noteworthy that the label-AE method does not exceed the basic k-NN approach despite its good performance with respect to the MiP metric.With respect to the thresholds, both values have similar performance without great differences, being slightly better to prefer the stricter output criterion provided by the value 0.75.Performance with sparse representations is still better than with dense context vectors, and there is a slight tendency to get better results using fewer neighbors than the basic k-NN method.Finally, the results using the inverse of distance squared as the distance weighting scheme are superior in all scenarios, because it boosts the contribution of the most similar examples when constructing the average embedded vector.When comparing the MEDIUM label-AE best results from Table 6 with the basic k-NN best results from Table 4 we can see that they show very similar MiF values, which in the case of the MEDIUM label-AE model is obtained with relatively high values of MiP at the expense of lower values in MiR, whereas the basic k-NN method offers values more uniform in both metrics.After a detailed analysis of the predictions made by both models we have found that the number of labels predicted by the MEDIUM label-AE model is substantially smaller.In our study we have obtained that the average number of labels predicted for each document by the simple k-NN method is 13.34 for the sparse representation and 13.13 for the dense representation.For the MEDIUM label-AE model with a threshold of 0.50 its average length is 10.44 labels with the sparse representation and 10.61 with the dense representation, whereas with a threshold of 0.75 we have, respectively, 8.65 and 8.69 labels.This behavior makes the basic k-NN method start with an initial advantage in providing better values for MiR.
We hypothesize that the proposed label-AE assisted k-NN method is capable of (1) providing faithful embedded vectors via its encoder and (2) acceptably reconstructing the output labels from the averaged embedded vectors using its decoder, hence offering high values in MiP, but it leaves behind the less frequent labels, which have few training examples to assert their presence in the encoder and decoder weights.On the other hand, the results seem to indicate that the basic k-NN method is able to satisfactorily circumvent the treatment of infrequent labels, at least in a large datasets such as the one we are dealing with.This is probably due to the very nature of the k-NN method.These infrequent labels appear in documents with very specific contents, which leads to a very particular set of neighbors that target the k-NN classifier to these rare labels.
In order to try to combine the best aspects of both approaches, which are the high MiP of our proposed k-NN with label-AE and the best recall capabilities of the classical k-NN method, we have carried out a battery of additional tests.For this purpose, we have taken as a starting point the labels predicted by the label-AE method and combined them with the predictions provided by the basic k-NN method.To build the final set of output labels, to the labels predicted by the label-AE based method we add labels taken from the basic k-NN prediction until the number of output labels predicted by the basic k-NN is reached.
Table 8 shows the results obtained by combining according to the described scheme the predictions of the basic k-NN model with the predictions provided by the MEDIUM label-AE.In the case of the MiF, MiP and MiR metrics all of them are substantially improved with respect to the values obtained with these methods separately.In contrast, the values of P@k and nDCG@k are penalized.
Although these results improve those provided by the basic k-NN method and those of the k-NN method assisted with our label-AE, they are far from those offered by the best state-of-the-art semantic indexing systems for MeSH.If we take as a reference the latest editions of the BioASQ challenge [29,48], which proposes an evaluation scenario very similar to the one presented in this work, we see that the best systems are capable of reaching MiF scores somewhat higher than 70%, while the Default MTI (Medical Text Indexer) reached values between 53% in the first edition of the challenge and values in the range 62-64% in the last two editions.The baseline used in the first editions of this challenge, which performed a simple string match of the label text, reached values around 26%.

Conclusions and Future Work
In this paper we propose a novel multi-label text categorization method able to deal with a very large and structured label space, that it is suitable to be applied in semantic indexing tasks using controlled vocabularies, such it is the case of the Medical Subject Headings (MeSH) thesaurus.The proposed method trains a large label-AE capable of simultaneously learning an encoder function that transforms the original label space into a reduced-dimensional space, along with a decoder function that transforms vectors from that space back into the original label space.The proposal adapts classical k-NN categorization to work in the semantic latent space learned by this label-AE.
We have proposed and evaluated several document representation approaches, using both sparse textual features and dense contextual representations.We have evaluated their contribution in finding neighboring documents employed in the k-NN classification.
An exhaustive study on a large portion of MEDLINE collection has been carried out to evaluate different strategies in the definition and training of label-AEs for the MeSH thesaurus and to verify the suitability of the proposed classification method.The results obtained confirm the ability of the learned label-AEs to capture the latent semantics of MeSH thesaurus descriptors and leverage that representation space in the k-NN classification.
As a future work, a direct application of the method described in this paper is to test the usefulness of the label-AEs learned for MeSH on related thesauri in other languages.An example of such a thesaurus is the DeCS (Descriptores en Ciencias de la Salud, Health Sciences Descriptors) controlled vocabulary (http://decs.bvsalud.org/),which is a trilingual (In Portuguese, Spanish and English) version of MeSH, retaining its structure and adding a collection of specific descriptors.We hypothesize that it is possible to leverage the semantic information about MeSH condensed in the learned encoders and decoders to advantage of it in multilingual biomedical environments.

Figure 3 .
Figure 3. Summary of performance metrics with sparse vs. dense representations for values of k with best MiF values.

Figure 4 .
Figure 4. Summary of MiF, MiP, MiR metrics for values of k and distance weighting with best MiF values in each label-AE configuration (SMALL, MEDIUM, LARGE).

Table 1 .
Descriptor distribution in MeSH top-level categories.

Table 2 .
Evaluation dataset statistics and MeSH descriptor distribution.

Table 3 .
Configuration of label autoencoders in our experiments.

Table 4 .
Performance metrics with sparse vs. dense representations.

Table 5 .
Performance with SMALL label-AE.

Table 6 .
Performance with MEDIUM label-AE.

Table 7 .
Performance with LARGE label-AE.

Table 8 .
Performance mixing results from basic k-NN with MEDIUM label-AE.