You are currently viewing a new version of our website. To view the old version click .
Computers
  • Article
  • Open Access

20 January 2025

Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis

,
,
and
1
Instituto Universitario de Matemática Pura y Aplicada, Universitat Politècnica de València, 46022 València, Spain
2
E.T.S. Ingeniería, Universitat de València, 46100 Valéncia, Spain
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling

Abstract

We present a new word embedding technique in a (non-linear) metric space based on the shared membership of terms in a corpus of textual documents, where the metric is naturally defined by the Boolean algebra of all subsets of the corpus and a measure μ defined on it. Once the metric space is constructed, a new term (a noun, an adjective, a classification term) can be introduced into the model and analyzed by means of semantic projections, which in turn are defined as indexes using the measure μ and the word embedding tools. We formally define all necessary elements and prove the main results about the model, including a compatibility theorem for estimating the representability of semantically meaningful external terms in the model (which are written as real Lipschitz functions in the metric space), proving the relation between the semantic index and the metric of the space (Theorem 1). Our main result proves the universality of our word-set embedding, proving mathematically that every word embedding based on linear space can be written as a word-set embedding (Theorem 2). Since we adopt an empirical point of view for the semantic issues, we also provide the keys for the interpretation of the results using probabilistic arguments (to facilitate the subsequent integration of the model into Bayesian frameworks for the construction of inductive tools), as well as in fuzzy set-theoretic terms. We also show some illustrative examples, including a complete computational case using big-data-based computations. Thus, the main advantages of the proposed model are that the results on distances between terms are interpretable in semantic terms once the semantic index used is fixed and, although the calculations could be costly, it is possible to calculate the value of the distance between two terms without the need to calculate the whole distance matrix. “Wovon man nicht sprechen kann, darüber muss man schweigen”. Tractatus Logico-Philosophicus. L. Wittgenstein.

1. Introduction

Nowadays, the fast improvement of natural language processing (NLP) can be seen in how all the applications of this discipline have reached impressive heights in the handling of natural language in many applied fields, ranging from the translation from one language to another, to the generalization of the use of generative artificial intelligence [1,2,3,4]. A fundamental tool for these applications is the so-called word embeddings. Broadly speaking, word embeddings allow representations of semantic structures as subsets of linear spaces endowed with norms. The distance provided by such a norm has the meaning of measuring the relations among terms in such a way that two items being close means that they are close with respect to their meaning. The main criticism of this technique when it was first introduced is that, once a word embedding is created, its representation remains static and cannot be adapted to different contexts [1,3]. This is a problem for applications, as there are words whose meaning clearly depends on the context: take the word “right”, for example. However, since the late 2010s, these models have evolved towards more flexible architectures, incorporating structural elements and methods that allow the integration of contextual information [5,6,7].
On the other hand, let us recall that the technical tools for natural language processing have their roots in some broad ideas about the semantic relationships between meaningful terms, but they are essentially computed from “experimental data”, that is, searching information in concrete semantic contexts provided by different languages. Thus, one of the basic ideas underlying language analysis methods could be called the “empirical approach”, which states that the meaning of a term is reflected in the words that appear in its contextual environment. An automatic way of applying this principle and transforming it into a mathematical rule is to somehow measure which sets of semantically non-trivial documents in a given corpus share two given terms. This is the essence of the method based on the so-called semantic projections used in this paper [8,9].
The aim of this article is to present a new mathematical formalism to support all NLP methods based on the general idea of contextual semantics. Thus, we propose an alternative to the standard word embeddings, which is based on a different mathematical structure (instead of normed spaces) developed using algebras of subsets endowed with the usual set-theoretic operations (Figure 1). In a given environment and for a given specific analytical purpose, the meaning of a term may be given by the extent to which it shares semantic context with other terms, and this can be measured by means of the relations among the sets representing all the semantic elements involved [10,11]. The value of the expression “the term v shares the semantic environment with the term w” can be measured by a real number in the interval [ 0 , 1 ] given by a direct calculation, which is called the semantic projection of v onto w . This representation by an index belonging to [ 0 , 1 ] is also practical when we think about the interpretation of what a semantic projection is, since it can be understood as a probability—thus facilitating the use of Bayesian tools for further uses of the models—as well as a fuzzy measure of how a given term belongs to the meaning of another term—thus facilitating the use of arguments related to fuzzy metrics.
Figure 1. Word embedding versus set-word embedding.
Thus, we present here a mathematical paper, in which we define our main concepts and obtain the main results on the formal structure that supports empirical approaches to NLP based on the analysis of a corpus of semantically non-trivial texts, and how meaningful terms share their membership in these texts. The main outcome of the paper is a theoretical result (Theorem 2) which aims to demonstrate that our proposed method is universal, in the sense that any semantic structure which is representable through conventional word embeddings can also be represented as a set-word embedding, and vice versa. However, our approach to constructing representations of semantic concepts, while limited to the idea that semantic relationships are expressed through distances, is much more intuitive. This is because it is grounded in the collection of semantic data associated with a specific context. Thus, the key advantage of our model lies in its adaptability to contextual information, allowing it to be applied across various semantic environments.
This paper is structured as follows. In the subsequent subsections of this Introduction, we outline the specific context of our research, covering standard word embeddings, families of subsets in classical NLP models, and relevant mathematical tools. We will also give an overview of the word embedding methods that have appeared recently and are closely related to the purpose we bring here. After the introductory Section 1 and Section 2, Section 3 introduces the fundamental mathematical components of our semantic representations, demonstrating key structural results and illustrating how these formal concepts relate to concrete semantic environments. To ensure a comprehensive understanding of our modeling tools, Section 4 presents three particular examples and applications. Following this motivational overview, Section 5 details the main result of the paper (Theorem 2), explaining its significance and providing further examples. Section 6 is devoted to the discussion of the result, built around a concrete example that allows the comparison of our tool with, firstly, the Word2Vec word embedding, and in a second part, with a more recent selection of word embeddings that we have considered suitable for their relation to our ideas. Finally, Section 7 offers our conclusions.

3. Set-Word Embeddings: The Core

Word embeddings are understood to take values in vector spaces. However, as we explain in the Introduction, our contextual approach to automatic meaning analysis suggests that perhaps a different embedding procedure based on set algebra would better fit the idea of what a word embedding should be. Let us show how to achieve this.
We primarily consider a set of words W , and a σ algebra Σ ( S ) of subsets of another set S.
Definition 1.
A set-word embedding is a one-to-one function ι : W Σ ( S ) ,
In this setting, it is natural to consider the context provided by the Jaccard (or Tanimoto) index, which is a similarity measure that has applications in many areas of machine learning that is given by the expression
J ( A , B ) = | A B | | A B | ,
where A and B are finite subsets of a set W . It is also relevant that the complement of this index,
D ( A , B ) = 1 J ( A , B ) = | A B | | A B | | A B | ,
is a metric [19] called the Jaccard distance. The general version of this measure is the so-called Steinhaus distance, which is given by
S μ ( A , B ) = μ ( A B ) μ ( A B ) μ ( A B ) ,
where μ is a positive finite measure on a σ algebra Σ , and A , B are measurable sets ([20], S.1.5). The distance proposed in [21] is a generalization of these metrics.
Let us define a semantic index characteristic of the embedding i , which is based on a kind of asymmetric version of Jaccard’s index. In our context, and because of the role it plays for the purposes of this article, we call it a semantic index of one meaningful element onto another in a given context. In the rest of this section, we assume that μ is a finite positive measure on a set S , acting on a certain σ algebra Σ ( S ) of subsets of S .
Definition 2.
Let A , B Σ ( S ) . The semantic index of B on A is defined as
P A ( B ) : = μ ( A B ) μ ( A ) .
Therefore, once a measurable set A is fixed, the semantic index is a function P A : Σ R + .
Those functions are called semantic projections in [9]. Roughly speaking, this rate provides information about which is the “proportion of the meaning” of A that is explained/shared by the meaning of B . But this is a mathematical definition, and the word “meaning” associated with a subset A Σ is just the evaluation of the measure μ on it. As usual, μ plays the role of measuring the size of the set A with respect to a fixed criterion.
In [9], the definition of semantic projection was made in order to represent how a given conceptual item (canonically, a noun) is represented in a given universe of concepts. With the notation of this paper, given a term A and a finite universe of words U , the semantic projection ( α u ( A ) ) u U is defined as a vector ([9], S.3.1) in which each coordinate is given by
α u ( A ) = p ( A ) | B = | A B | | A | ,
where B is the subset that represents the noun u of the universe in the word embedding ([9], S.3.1). These coefficients essentially represent a particular case of the same idea as the semantic index defined above, as for μ being the counting measure,
α u ( A ) = p ( A ) | B = | A B | | A | = P A ( B ) .
We call it semantic index instead of semantic projection to avoid confusion.
Let us now define the framework of non-symmetric distances for the model.
Definition 3.
Let Σ ( S ) be a σ algebra of subsets of a given set S . Let μ be a finite positive measure on S . We define the functions
q μ L ( A , C ) : = μ ( A C c ) , q μ R ( A , C ) : = μ ( A c C ) , A , B Σ ( S ) ,
as well as
q μ : = q μ L + q μ R .
Let us see that the so-defined expressions provide the (non-symmetric) metric functions that are needed for our construction. As in the case of the Jaccard index mentioned above, there is a fundamental relationship between q μ and P B , which is given in (iii) of the following result.
Lemma 1.
Fix a σ-algebra Σ ( S ) on S , and let μ be a finite positive measure on the measurable sets of S . Then,
(i)
q μ L and q μ R are conjugate quasi-metrics on the set of the classes of μ a.e. equal subsets in Σ ( S ) .
(ii)
q μ is a metric on this class.
(iii)
For every A , B Σ ( S ) ,
q μ L ( A , B ) = μ ( A ) 1 P A ( B ) and q μ R ( A , B ) = μ ( B ) 1 P B ( A ) .
Proof. 
Note that, for every A , B , C Σ ( S ) ,
A C c = ( A C c B c ) ( A C c B ) ( A B c ) ( B C c ) .
Consider
q μ L ( A , C ) = μ ( A C c ) and q μ R ( A , C ) = μ ( A c C ) .
Clearly,
q μ L ( A , B ) = q μ R ( B , A ) .
and so both expressions define conjugate functions. Then,
q μ L ( A , C ) : = μ ( A C c ) μ ( A B c ) + μ ( B C c ) = q μ L ( A , B ) + q μ L ( B , C ) .
and the same happens for q μ R , and so both q μ L and q μ R are quasi-pseudo-metrics. To see that they are quasi-metrics, we have to prove (ii), that is, that the symmetrization
q = q μ L + q μ R
is a distance (on the class of μ a.e. equal sets). But if q μ ( A , B ) = 0 , we have that both
μ ( A B c ) = 0 and μ ( A c B ) = 0 .
This can only happen if μ ( A ) = μ ( A B ) = μ ( B ) , and so A = B μ a.e. As q μ is symmetric and satisfies the triangle inequality, we obtain (ii), and so (i).
(iii) For every A , B Σ ( S ) ,
q μ L ( A , B ) = μ ( A B c ) = μ ( A ) μ ( A B ) = μ ( A ) 1 P A ( B ) ,
and q μ R ( A , B ) = q μ L ( B , A ) = μ ( B ) 1 P B ( A ) .
It can be seen that the metric q μ that we have defined coincides with the so-called symmetric difference metric (the Fréchet–Nikodym–Aronszyan distance, which is given by d ( A , B ) = μ ( A B ) ( A B ) (see [20], S.1.5). However, our understanding of relations between sets is essentially non-symmetric, since semantic relations are, in general, not symmetric: to use a classical example, in a set-based word representation, it makes sense for the word “king” to contain the word “man”, but not vice versa. It seems natural that this is reflected in the distance-related functions for comparing the two words. This is why we introduce metric notions using quasi-metrics as primary distance functions. It must be said that, in the context of the abstract theory of quasi-metrics, the distance q μ , max = max { q μ L , q μ R } is used instead of q μ (see, for example, [22]). It is equivalent to q μ and plays exactly the same role as q μ , max from the theoretical point of view: we use q μ for its computational convenience, as we will see later.
Summing up the information presented in this section, we have that the fundamental elements of a set-word embedding are the following.
  • A set of terms (words, short expressions, …) W , on which we want to construct our contextual model for the semantic relations among them.
  • A finite measure space ( S , Σ ( S ) , μ ) , in which the σ algebra of subsets have to contain the subsets representing each element of W , and S can be equal to W or not.
  • The word embedding itself: an injective map ι : W Σ ( S ) , that is well defined as a consequence of the requirement above.
  • The quasi-metrics q μ L and q μ R , which, together with the metric q μ , give the metric space ( Σ ( S ) , q ) that supports the model.
Remark 1.
In principle, it is required that μ be positive. However, this requirement is not necessary in general for the construction to make sense, and it is even mandatory to extend the definition for general (non-positive) finite measures, as will be shown in the second part of the paper. What is actually required is that the symmetric differences of the elements of the range of the set-word embedding have positive measure. We will see that this is, in fact, weaker than being a positive measure.
The next step is to introduce a systematic method for representing the features we need to use about the elements of W that would help to enrich the mathematical structure of the representations. This is achieved in the next subsection.

3.1. Features as Lipschitz Functions on Set-Word Embeddings

By now we have already introduced the basic mathematical tools to define a set-word embedding. The method we used to perform it made them derived from a standard index, which we call the semantic index of a word w 2 W with respect to another word w 1 W . It has a central meaning in the model. We prove that it is a Lipschitz function in Theorem 1 below. In a sense, we think of it as a canonical index that shows the path to model any new features we want to introduce into the representation environment.
Our idea is also influenced by the way some authors propose to represent the “word properties” for the case of vector-valued word embeddings i : W R n (we use “i” for this map to follow the standard notation for inclusion maps, but the reader should be careful because it can sometimes be confusing). Linear operations on support vector spaces are widely used for this purpose [1,8,16,23,24]. Thus, linear functions are sometimes used to define what are called semantic projections, which can be written as translations of linear functionals of the dual space V * of the vector space V . For example, to represent the feature “size” in a word embedding of animals, it is proposed in [8] to use a regression line, and this defines what they call a “semantic projection”. The same idea is used in [9]; in this case, the semantic projections are given by the coordinate functionals of the corresponding vector-valued representation.
This opens the door to our next definition. But in our case, we do not have any kind of linear structure, so we have to understand our functions as elements of the “dual” of the metric space ( Σ ( S ) , q ) . If we take the empty set Σ ( S ) as a distinguished point 0 in ( Σ ( S ) , q ) , this dual space is normally defined as the Banach space of real-valued Lipschitz functions that are equal to 0 at , L i p 0 ( Σ ( S ) ) , L i p ( · ) . The norm is defined as L i p ( φ ) + | φ ( ) | if we do not assume that φ is 0 at , and then L i p ( Σ ( S ) ) is also a Banach space (see, e.g., [25,26] for more information on these spaces).
Definition 4.
An index representing a word-feature on the set-word embedding is a real-valued Lipschitz function φ : ( Σ ( S ) , q ) R . In the case that there is a constant q μ > 0 such that φ satisfies also
q μ ( A , B ) Q φ ( A ) + φ ( B ) for   all A , B Σ ( S ) ,
we will say that φ is a q μ Katětov (or q μ -normalized) function (index). We write K a t ( φ ) for the infimum of such constants Q .
Lipschitz functions with L i p ( φ ) = 1 and satisfying the second condition for q μ = 1 are often called Katětov functions (see, for example, ([20], S.1)). Under the term q μ -normalized index, the second requirement in this definition is used in [27] in the context of a general index theory based on Lipschitz functions. More information on real functions satisfying both requirements above can be found in this paper.
A standard case of q μ Katětov Lipschitz functions (called sometimes Kuratowski functions, standard indices in [27]) are the ones given by Σ ( S ) B φ A ( B ) : = q μ ( A , B ) for a fixed A Σ ( S ) . Indeed, note that for A Σ ( S ) ,
| φ A ( B ) φ A ( C ) | = | q μ ( A , B ) q μ ( A , C ) |
q μ ( B , C ) q μ ( A , B ) + q μ ( A , C ) = φ A ( B ) + φ A ( C )
for every B , C Σ ( S ) . This, together with a direct computation using φ A ( A ) = 0 , show that L i p ( φ A ) = 1 and K a t ( φ A ) = 1 . Indeed, note that under the assumption that S is finite, we can easily see the following:
(1)
For every Lipschitz function φ : Σ ( S ) R , we can find a real number r , a non-negative Lipschitz function φ * : Σ ( S ) R + , and a set A Σ ( S ) such that φ * = φ + r and φ ( A ) = 0 .
(2)
L i p ( φ ) = L i p ( φ * ) and L i p ( φ * ) · K a t ( φ * ) 1 , that is, φ can be translated to obtain a non-negative function obtaining the value 0 at a certain set and preserving the Lipschitz constant; also, a direct calculation, as per the one above for φ A , shows that the product of L i p ( φ * ) can never be smaller than 1 .
The functions described above are a particular application of the so-called Kuratowski embedding [28], which is defined as the map k : X C b ( X ) given by the formula k ( x ) ( y ) = d ( x , y ) d ( x 0 , y ) , x , y X , where ( X , d ) is a metric space, C b ( X ) is the Banach space of all bounded continuous real-valued functions on X, and x 0 is a fixed point of X . If we take x = x 0 , we obtain our functions.
The above notion motivates the following definition of compatibility index. For a Lipschitz index φ , the compatibility index given below gives a measure of how close φ is to the given metric in space, such that a small value (it always has to be greater than 1) indicates that there is a close correlation between the relative values of φ in two generic subsets A and B and the value of the distance q m u ( A , B ) . Conceptually, a small value of C o m ( φ ) represents the (desired) deep relationship between the semantic index and the metric in the space, thus functioning as a quality parameter for the model.
Definition 5.
Let φ : Σ ( S ) R be a Lipschitz q μ Katětov positive function attaining the value 0 only at one set A Σ ( S ) . We call the constant C o m ( φ ) given by
C o m ( φ ) = L i p ( φ ) · K a t ( φ )
the compatibility index of φ with respect to the metric space ( Σ ( S ) , q μ ) . We have already shown that C o m ( φ ) 1 .
The importance of the compatibility index in the proposed metric model is clear: the smaller the constant C o m ( φ ) , the better the index φ fits the metric structure. That is, whenever we use φ to model any semantic feature in the context of the set-word embedding ι : W Σ ( S ) , we can expect the feature represented by φ to “follow the relational pattern between the elements of W” as closely as C o m ( φ ) is small. Let us illustrate these ideas with the next simple example.
Example 1.
Two different features concerning a set of nouns (for example, two adjectives) do not necessarily behave in the same fashion. For example, take W = { w1 = l i o n , w2 = h o r s e , w3 = e l e p h a n t } , and two properties: φ 1 , which represents the adjective “big”, and φ 2 , which represents “fierceness”. We can define the degree of each property by
φ 1 ( w 1 ) = 1 , φ 1 ( w 2 ) = 2 φ 1 ( w 3 ) = 3 ,
and
φ 2 ( w 1 ) = 3 , φ 2 ( w 2 ) = 0 φ 2 ( w 3 ) = 1 .
Consider the trivial set-word embedding ι : W D 3 , where D 3 = { 1 , 2 , 3 } , ι ( w i ) = i for i = 1 , 2 , 3 , and μ is the counting measure on the σ algebra of subsets of D . Then,
q μ ( i , j ) = μ ( { i , j } ) = 2 , i , j = 1 , 2 , 3 if i j , and 0 otherwise .
Obviously, both indices are Lipschitz, and L i p ( φ 1 ) = 1 and L i p ( φ 2 ) = 3 / 2 . On the other hand, both of them are Katětov functions with K a t ( φ 1 ) = 2 / 3 and K a t ( φ 2 ) = 2 .
Let us compute the compatibility indices associated with both φ 1 and φ 2 . First, note that φ 1 has to be translated to attain the value 0 ; let us define φ 1 * = φ 1 1 . We have that K a t ( φ 1 * ) = 2 , so
C o m ( φ 1 * ) = L i p ( φ 1 * ) · K a t ( φ 1 * ) = 1 · 2 = 2 ,
and
C o m ( φ 2 ) = L i p ( φ 2 ) · K a t ( φ 2 ) = ( 3 / 2 ) · 2 = 3 .
Following the explanation we have given for C o m , we have that φ 1 better fits q μ than φ 2 .
Note that we cannot expect any kind of linear dependence among the representation provided by ι and the functions representing these properties. For example, the index that concerns the size φ 1 can be given by the line ι ( w i ) = i , while the second one, φ 2 , does not satisfy any linear formula. In fact, it does not make sense to try to write it as a linear relation, as there is no linear structure on the representation by the metric space ( Σ ( D 3 ) , q μ ) .
From the conceptual point of view, this is the main difference of our purpose of set-word embedding versus the usual vector-word embedding. The union or intersection of two terms has a semantic interpretation in the model: the semantic object created by considering two terms together in the first case, and the semantic content shared by the two terms, respectively. However, the addition of two terms in their vector space representations or the multiplication of the vector representation of a given word by a scalar have dubious meanings in the models, although they are widely used [8,17,23].
Let us show now that the semantic index P A is a Lipschitz index, and even 1 Lipschitz, and q μ Katětov with K a t ( P A ) = 1 for μ being a probability measure and A = S . The main idea underlying the following result, which is fundamental to our mathematical construction, is the existence of a deep relationship between semantic indexes and the metric defined by a given measure μ . Essentially, this means that the model can be structured around the notion of semantic index, which has the clear role of a measure of the shared meaning between two terms, and an associated metric, which allows the comparison and measurement of distances between semantic terms.
Theorem 1.
For every A Σ ( S ) , the function Σ D P A ( D ) R is Lipschitz with constant L i p ( P A ) 1 / μ ( A ) . That is,
| P A ( C ) P A ( B ) | 1 μ ( A ) q μ ( C , B ) , C , B Σ ( S ) .
Moreover, for A = S , we also have
q μ ( C , B ) μ ( S ) P S ( C ) + P S ( B ) , C , B Σ ( S ) ,
and then C o m ( P S ) = 1 .
Proof. 
As μ ( A ) 1 P A ( B ) = q μ L ( A , B ) , taking into account that
q μ L ( A , B ) q μ L ( A , C ) q μ L ( C , B ) ,
(the same for q μ R ), we have that
P A ( C ) P A ( B ) = ( 1 P A ( B ) ) ( 1 P A ( C ) )
= 1 μ ( A ) μ ( A ) ( 1 P A ( B ) ) μ ( A ) ( 1 P A ( C ) )
= 1 μ ( A ) q μ L ( A , B ) q μ L ( A , C ) 1 μ ( A ) q μ L ( C , B ) .
The symmetric calculations give
P A ( B ) P A ( C ) 1 μ ( A ) q μ L ( B , C ) = 1 μ ( A ) q μ R ( C , B ) .
Therefore,
| P A ( C ) P A ( B ) | = P A ( C ) P A ( B ) , P A ( B ) P A ( C )
1 μ ( A ) q μ L ( C , B ) + q μ R ( C , B ) = 1 μ ( A ) q μ ( C , B ) .
For the last statement, just note that
q μ ( C , B ) = μ ( C B c ) + μ ( C c B )
μ ( S ) μ ( C S ) μ ( S ) + μ ( B S ) μ ( S ) = μ ( S ) P S ( C ) + P S ( B ) .
In particular, if μ is a probability measure, the function P S above satisfies the inequalities
| P S ( C ) P S ( B ) | q μ ( C , B ) P S ( C ) + P S ( B ) , C , B Σ ( S ) ,
and so it is a Katětov function such that C o m ( P S ) = L i p ( P S ) · K a t ( P S ) = 1 · 1 = 1 .
This result is fundamental to the interpretation of the model. Roughly speaking, it states that the main index on which we have relied conforms completely to the metric structure. Since μ is a probability measure, the semantic index P A for the case A = S represents the rate (the score per one) of information contained in any information set B Σ ( S ) , which is measured by μ ( B ) . Thus, Theorem 1 ( C o m ( P S ) = 1 ) means that this fundamental quantity absolutely fits the space ( Σ ( S ) , q m u ) , the main tool of our embedding procedure.

3.2. How to Apply a Set-Word Embedding for Semantic Analysis: An Example

To finish this section, let us sketch how the proposed model can be used for semantic analysis. We only intend to show some general ideas in this paper, and compare them with the ones that are usual tools in the context of the vector-word embeddings.
Let us focus our attention on a binary property that regards a set of words representing nouns of a certain language. Write ( Σ ( S ) , q μ ) for the metric space defined by all the subsets of S , with ( S , Σ ( S ) , μ ) a probability measure space, and let ι : W Σ ( S ) the set-word embedding in which we base our model (note that this “ ι ” is not the usual “i” used before to denote a standard word embedding.) The set S could be, for example, a class of properties of the animals: average size, taxonomic distance, common color, eat grass or not…, and ι embeds every animal in the set of properties that it has. The measure μ quantifies the relevance in the model that each of the properties in S has.
The studied feature is described by the values 0 or 1 ; so we call the Lipschitz functional with values in { 0 , 1 } representing the feature a classifier. For example, in a given set of animals W, the property of having two legs is represented by 0 , and having four legs, by 1 . Let us write ϕ for the Lipschitz map representing the property “having two/four legs”. Let us consider a specific situation, and how to solve it using the proposed set-word embedding.
  • Suppose that we know the value of the classifier ϕ at a subset ι ( W 0 ) ι ( W ) , but it is unknown out of W 0 .
  • The value of ϕ in ι ( W ) ι ( W 0 ) can be estimated using the evaluation on some terms of the original universe and then extending using a McShane–Whitney type formula. This extension provides the equation
    ϕ ^ ( a ) = 1 2 sup b S ϕ ( b ) L i p ( ϕ ) d ( a , b ) + 1 2 inf b S ϕ ( b ) + L i p ( ϕ ) d ( a , b )
    which gives an estimate of the value of ϕ ( a ) with values in the interval [ 0 , 1 ] for elements that do not belong to ι ( W 0 ) .
  • Therefore, the structure of the metric space together with the explained Lipschitz regression tool provide information about the expected values of ϕ in the set ι ( W ) . Since it takes values in the interval [ 0 , 1 ] , but not necessarily in [ 0 , 1 ] , we can interpret the values provided by ϕ ^ in probabilistic terms: it gives the probability that a given element of W W 0 has two or four legs. Also, we can interpret it as the fuzzy coefficient of that element belonging to the set of four-legged animals.

5. Set-Based Word Embeddings Versus Vector-Valued Word Embeddings: General Set Representation Procedure

The notion of set-word embedding is primarily thought to be as simple as possible. The complexity of the model and all the features that can be added to it are supposed to be performed by means of Lipschitz functions acting on the set metric space. However, it can be easily seen that both constructions are essentially equivalent, although our set-based construction aims to be simpler. In this section, we prove an equivalence result that shows a canonical procedure to pass from a class of models to the other class, and vice versa.
The following result also provides explicit formulae for the transformation. It is, therefore, the main result of the present work, as it allows us to identify any word embedding with a set-word embedding using a canonical procedure. The main advantage of this idea is that the basic information in the set algebra model lies in the measure of the size of the shared meaning between two terms, whereas a standard set-word embedding in a Euclidean space usually occurs automatically, and there is no possibility to interpret how the distances are obtained and what they mean. Thus, set-word embeddings are interpretable, while standard word embeddings are not. We will return to this central point in the Conclusions section of the article.
In addition to its simple formal structure and the advantages of defining word embedding features as pure metric objects (Lipschitz maps, without the need for any kind of linear relation), there is a computational benefit that makes it, in a sense, better. We will explain it after the theorem. As usual, we assume that an isometry between metric spaces is, in particular, bijective.
Theorem 2.
Let W be a (finite) set of word/semantic terms of a certain language. Then, the following statements hold.
(i)
If i : W X is a metric word embedding into a finite metric space ( X , d ) , there is a set-word embedding ι : W Σ ( S ) into a σ algebra of subsets Σ ( S ) and a (non-necessarily positive) measure μ on S such that ( i ( W ) , d ) and ( ι ( W ) , q μ ) are isometric.
(ii)
Conversely, if there is a finite set-word embedding ι : W Σ ( S ) , there is a metric word embedding i : W X into a metric space ( X , d ) such that ( i ( W ) , d ) and ( ι ( W ) , q μ ) are isometric.
Moreover, every metric word embedding can be considered as normed-space-valued.
Proof. 
Let us first prove (i). Consider the set of terms W , and the set i ( W ) X . Write | W | = n . We can assume that n > 2 ; for n = 1 there is nothing to prove, and for n = 2 the result is trivially proved by a direct construction with a set, S, of two elements.
Number the elements of W and identify i ( W ) with the set { 1 , , n } , which is considered to have the same metric d as i ( W ) . Write D = ( d i , j ) i , j = 1 n , d i , j = d j , i for the metric matrix of d . We want to construct a measure space ( S , Σ ( S ) , μ ) and a word embedding ι : W Σ ( S ) such that ( i ( W ) , d ) and ( Σ ( S ) , q μ ) are isometric.
Set S = { ( i , j ) : 1 i j n } , Σ ( S ) the σ algebra of the subsets of S , and consider the word embedding ι : W Σ ( S ) given by
ι ( w i ) = { ( j , i ) : 1 j i } { ( i , j ) : i < j n } , i N ,
where w i W is the word with the number i . We have to find a measure μ on S such that for i j ,   τ i , j = μ ι ( w i ) ι ( w j ) μ ι ( w i ) ι ( w j ) = d i , j = d j , i . That is, for 1 i , j n ,   i j ,   τ i , j = k = 1 , k j i μ ( { ( k , i ) } ) + k = i + 1 , k j n μ ( ( i , k ) ) + k = 1 , k i j μ ( ( k , j ) ) + k = j + 1 , k i n μ ( ( j , k ) ) = d i , j . Note that τ i , i = 0 for all 1 i n . Write T for the symmetric matrix T = ( τ i , j ) i , j n , where τ i , j = τ j , i . Let us write all the equations above using a matrix formula. Consider the matrix M = ( m i , j ) i , j = 1 n , with m i , j = 1 if i j , and m i , i = 0 , for all 1 i , j n .
Write the symmetric matrix N of the coefficients x i , j = μ ( { ( i , j ) } ) = x j , i ,   1 i j n that we want to determine. Take any diagonal matrix Δ and define the symmetric matrix D * = D + Δ . Note that we can write an equation that coincides with T = D for all the elements out of the diagonal as
M · N + N · M = D * ,
in which the elements of the diagonal take arbitrary values that can be used to normalize the coefficients.
Now we claim that M 1 = M ( n 2 ) I n n 1 , where I n is the identity matrix of order n . Indeed, note that M 2 has all the elements of the diagonal equal to n 1 , and all the elements out of the diagonal are equal to n 2 . Thus,
M · ( M ( n 2 ) I n ) = ( M ( n 2 ) I n ) · M = M 2 ( n 2 ) M = ( n 1 ) I n .
Now, consider the equations
M 1 M N M 1 + M 1 N M M 1 = N M 1 + M 1 N = M 1 D * M 1 ,
that give N · ( M ( n 2 ) I n ) + ( M ( n 2 ) I n ) · N = ( n 1 ) M 1 D * M 1 . Then, N M ( n 2 ) N + M N ( n 2 ) N = ( n 1 ) M 1 D * M 1 , and so, using that M N + N M = D * , we obtain the symmetric matrix
N = 1 2 ( n 2 ) D * ( n 1 ) M 1 D * M 1 .
This gives a result that is not unique, as it depends on the diagonal matrix Δ . Note that, due to the required additivity of the measure μ , the set of all the values μ ( { ( i , j ) } ) determine a measure μ , which is not necessarily positive.
The proof of (ii) is obvious; take a finite measure space ( S , Σ ( S ) , μ ) and a set-word embedding ι : W Σ ( S ) ,   | W | = n . Suppose that we have a numbering in W . By definition, the set X = { x 1 = ι ( w 1 ) , , x n = ι ( w n ) } Σ ( S ) is a metric space with the distance q μ , so the map w i x i ( X , q μ ) is the required word embedding.
We only need to prove that we can assume that ( X , q μ ) is a subset of a normed space ( E , · E ) , and so we can define the word embedding ι to have values into ( E , · E ) . There are a lot of different representations that can be used; probably the simplest one is given by the identification of the metric space ( X , q μ ) with the Arens–Eells space ( A E ( X ) , · A E ) , which is explained in the Introduction. It is well known (see, for example, [26], Ch.1) that there is an isometric Lipschitz inclusion h : ( X , q μ ) ( A E ( X ) , · A E ) given by h ( x ) = m x , 0 , where 0 is a (could be arbitrarily chosen) distinguished element of X, and m x , 0 is the molecule defined by x and 0 . Therefore, h ι : W A E ( X ) is the desired vector word embedding. Of course, given that h ι ( W ) is a finite set in the Banach space ( A E ( X ) , · A E ) , we can represent it with coordinates to obtain a vector-like representation such as the reader might identify with a usual vector word embedding. This finishes the proof. □
Remark 2.
The proof of Theorem 2 gives useful information, which is, in fact, its main contribution for the representation of word embeddings by means of subsets. It gives an explicit formula to compute, given a vector-valued word embedding i, a measure space whose standard metric space structure provides an isometric representation to the one given by i . Indeed, for n 2 , it is given by the formula
N = 1 2 ( n 2 ) ( D + Δ ) ( n 1 ) M 1 ( D + Δ ) M 1 ,
where Δ is a diagonal matrix with free parameters, M = 0 1 1 1 0 1 1 1 0 , and
M 1 = M ( n 2 ) I n n 1 = 1 n 1 2 n 1 1 1 2 n 1 1 1 2 n .
This equation gives the values of the measure μ for the atoms of the class
{ ( 1 , 1 ) , ( 1 , 2 ) , , ( i , j ) , , ( n , n ) : 1 i j n }
that are represented in the symmetric matrix N . They can be used to compute the values of the distances between the elements of the representation,
ι ( w i ) = { ( k , i ) : 1 k i } { ( i , k ) : i + 1 k n } , i = 1 , , n ,
which are given by
q μ ( ι ( w i ) , ι ( w j ) ) = μ ( ι ( w i ) ι ( w r ) ) μ ( ι ( w i ) ι ( w r ) ) .
In the model, the diagonal matrix Δ and the metric  q μ ,  are key elements that shape its behavior. Adjusting the diagonal entries of Δ—which work as parameters of the model—tunes the metric, modifying the relative weights of dimensions to meet specific constraints or properties. The metric  q μ  defines the distance measure, guiding how distances between terms are evaluated by means of their representation as classes of subsets. Together, Δ and the construction of  q μ  allow the model to adapt flexibly to problem-specific requirements, ensuring robustness and alignment with desired outcomes.
Example 2.
Take n = 3 , and consider the word embedding i : W W , where W = { w 1 , w 2 , w 3 } . These terms can be, for example, three nouns in a model that we want to develop, as { m a n , k i n g , q u e e n } . The distance matrix suggested below is coherent with the idea that “man” is, in a sense, close to “king” (a king is a man), and “king” is close to “queen” (both belong to royalty), but, on a comparative scale, “man” and “queen” are not so close.
  • Endow W with the metric d given by the matrix D = 0 1 2 1 0 1 2 1 0 . We proceed as in the proof of Theorem 2; obtain the set
    X = { ( 1 , 1 ) , ( 1 , 2 ) , ( 1 , 3 ) , ( 2 , 2 ) , ( 2 , 3 ) , ( 3 , 3 ) } .
    We consider the set-word embedding ι : W Σ ( X ) defined as in the proof of the theorem; that is, for instance, ι ( w 1 ) = { ( 1 , 1 ) , ( 1 , 2 ) , ( 1 , 3 ) } (below for the other terms). This means, for example, that the word “king” is represented by a set of three (vector) indices, { ( 1 , 1 ) , ( 1 , 2 ) , ( 1 , 3 ) } , and so on. These indexes could represent different characteristics of the semantic term but, from a formal point of view, they are only distinguished indexes. Take the free parameter diagonal matrix as Δ = 0 0 0 0 0 0 0 0 0 . Then,
    N = 1 2 0 1 2 1 0 1 2 1 0 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 0 1 2 1 0 1 2 1 0 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
    = 1 2 0 1 2 1 0 1 2 1 0 2 1 1 / 2 1 1 / 2 0 1 / 2 1 1 / 2 1 = 1 0 0 0 0 0 0 0 1 .
    This matrix gives the values of the measure μ for the elements of the atoms of the measure space: μ ( { ( 1 , 1 ) } = 1 ,   μ ( { ( 1 , 2 } ) = 0 ,   μ ( { ( 1 , 3 ) } ) = 0 , and so on. In the model, they give the measure (understood as weight) of the different characteristics that represent each of the atomic indices ( k , j ) . Using that to verify that the calculations coincide with the values of the original distance is straightforward. For example,
    q μ ( ι ( w 1 ) , ι ( w 2 ) ) = μ ( { ( 1 , 1 ) , ( 1 , 2 ) , ( 1 , 3 ) , ( 2 , 2 ) , ( 2 , 3 ) } ) μ ( { ( 1 , 2 ) } )
    = μ ( { ( 1 , 1 ) } ) + μ ( { ( 1 , 3 ) } ) + μ ( { ( 2 , 2 ) } ) + μ ( { ( 2 , 3 ) } )
    = 1 + 0 + 0 + 0 = 1 = d ( i ( w 1 ) , i ( w 2 ) ) ,
    while
    q μ ( ι ( w 1 ) , ι ( w 3 ) ) = μ ( { ( 1 , 1 ) , ( 1 , 2 ) , ( 1 , 3 ) , ( 2 , 3 ) , ( 3 , 3 ) } ) μ ( { ( 1 , 3 ) } )
    = μ ( { ( 1 , 1 ) } ) + μ ( { ( 1 , 2 ) } ) + μ ( { ( 2 , 3 ) } ) + μ ( { ( 3 , 3 ) } )
    = 1 + 0 + 0 + 1 = 2 = d ( i ( w 1 ) , i ( w 3 ) )
    which, of course, coincide with the corresponding coefficients of the original metric matrix D . Note that the measure μ is equal to 0 for most of the atoms, so it cannot distinguish between certain non-empty sets of the generated σ algebra. However, the formula of our distance defined using the set algebra separates between all the elements of the canonical set-word embedding ι : { w 1 , w 2 , w 3 } Σ ( X ) , given in this case, as we said, by (see Figure 3)
    ι ( w 1 ) = { ( 1 , 1 ) , ( 1 , 2 ) , ( 1 , 3 ) } , ι ( w 2 ) = { ( 1 , 2 ) , ( 2 , 2 ) , ( 2 , 3 ) } ,
    and ι ( w 3 ) = { ( 1 , 3 ) , ( 2 , 3 ) , ( 3 , 3 ) } .
    Figure 3. Sets defining the representation of the set-word embedding.
    The subsets represented in Figure 3 can be understood as subsets of different features (the indices ( k , j ) ) that allow the representation of the original semantic environment { m a n , k i n g , q u e e n } , in the sense that each subset of features represents different semantic objects with their own semantic role in the model.
  • Let us show now the same computations for a different metric matrix D such that all the distances between the three points are different. In this case, we obtain
    N = 1 2 0 1 2 1 0 3 2 3 0 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 0 1 2 1 0 3 2 3 0 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
    = 1 2 0 1 2 1 0 3 2 3 0 2 0 1 / 2 1 1 / 2 1 3 / 2 1 3 / 2 2 = 0 0 0 0 1 0 0 0 2 .
Example 3.
Now take n = 4 and W = { w 1 , w 2 , w 3 , w 4 } . Then, we have that, for D = 0 1 2 2 1 0 3 2 2 3 0 2 2 2 2 0 , the same computations as in the other examples give the following matrix defined by the values of the measure μ for the atoms of the representation, which are, in this case,
S = { ( 1 , 1 ) , ( 1 , 2 ) , ( 1 , 3 ) , ( 1 , 4 ) , ( 2 , 2 ) , ( 2 , 3 ) , ( 2 , 4 ) , ( 3 , 3 ) , ( 3 , 4 ) , ( 4 , 4 ) } .
The measure matrix N is
1 4 0 1 2 2 1 0 3 2 2 3 0 2 2 2 2 0 1 4 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 0 1 2 2 1 0 3 2 2 3 0 2 2 2 2 0 2 3 1 3 1 3 1 3 1 3 2 3 1 3 1 3 1 3 1 3 2 3 1 3 1 3 1 3 1 3 2 3
= 1 4 0 1 2 2 1 0 3 2 2 3 0 2 2 2 2 0 2 0 2 3 0 4 4 2 2 4 6 1 3 2 1 4 = 1 / 2 1 / 4 0 1 / 4 1 / 4 1 1 / 4 0 0 1 / 4 3 / 2 1 / 4 1 / 4 0 1 / 4 1 .
For example,
q μ ( ι ( w 2 ) , ι ( w 3 ) ) = μ ( ι ( w 2 ) ι ( w 3 ) ) μ ( ι ( w 2 ) ι ( w 3 ) )
= μ ( { ( 1 , 2 ) , ( 2 , 2 ) , ( 2 , 3 ) , ( 2 , 4 ) , ( 1 , 3 ) , ( 3.3 ) , ( 3 , 4 ) } μ ( { ( 2 , 3 ) } )
= μ ( { ( 1 , 2 ) } ) + μ ( { ( 2 , 2 ) } ) + μ ( { ( 2 , 4 ) } ) + μ ( { ( 1 , 3 ) } ) + μ ( { ( 3.3 ) } ) + μ ( { ( 3.4 ) } )
= 1 / 4 + 1 + 0 + 0 + 3 / 2 + 1 / 4 = 3 ,
which coincides with the coefficient of D in the position ( 2 , 3 ) , that is, d ( i ( w 2 ) , i ( w 3 ) ) . Note that, according to the measure matrix N , the measure μ is not positive: there are atoms for which the measure is negative, and others for which the measure equals 0 . However, the standard formula for the associated set distance gives a proper distance matrix, as it coincides with the metric matrix D .
It should be noted that we need the most abstract notion of set-word embedding to obtain that, in general, any (metric) word embedding can be written as a set-word embedding: the measure μ obtained for the representation is not necessarily positive, but it has to be positive for all the sets defined as symmetric differences of the elements of the σ algebra in the range of the representation ι : W Σ ( S ) , that is, for the elements such as ( ι ( w i ) ι ( w j ) ) ( ι ( w i ) ι ( w j ) ) , for which we need positive measure evaluations to obtain a suitable metric.
Note also that the standard application of the procedure, provided by the equations explained above, lies in the identification of semantic terms with elements of a σ algebra of subsets. The size of such subset structures increases exponentially with the number of semantic terms, which could compromise the scalability of the method. This means that it might be necessary to imagine a representation procedure using subsets for each specific problem, which would detract from the generality of our technique.
Remark 3.
Direct applications of word embeddings, such as word similarity and analogy completion, should be approached from a different point of view than in the case of vector-valued word embeddings. In the case of word similarity, the linear space underlying the representation facilitates the task, as the scalar product inherent in Euclidean space provides the correlation as well as the cosine. In our case, however, a different approach is necessary, since, a priori, there is no associated linear space other than the one given by our representation theorem, which is often too abstract, as we just said. But it is possible to obtain a quantification of the notion of word similarity by relating to each word the calculated semantic index with respect to the other semantic terms of the model, thus constructing a vector whose correlation with any other vector associated with another word can already be calculated. Again, the advantage of our method compared to embedding words in a Euclidean space is interpretability: each coordinate of the vector thus constructed represents the extent to which the word and any other term in the model share their meaning, in the sense indicated by the semantic index used. Analogy completion could also be better performed in the case of embeddings of sets of words, as the representation of the word and any other term in the model share the same meaning, in the sense indicated by the semantic index used. Analogy completion could also be better realized in the case of word-set embeddings, since the representation is based on the extent to which two terms share their meaning, and then the logical relationships are supported by the mathematical formalism.

6. Discussion: Comparing Multiple Word Embeddings with the Set-Word Embedding in an Applied Context

In this section, we use the tools developed through the paper to obtain a set-word embedding related to the semantic indices (semantic projections) that are explained in Section 3 (Definition 2). We follow the model presented in Section 4.3 to define a word embbeding associating with each term a class of subsets of documents in which the word appears. To facilitate reproducibility for the reader, we opted to use the Google search engine for the calculations rather than relying on a scientific document repository, which may have access restrictions. For this application, we utilize the results provided by Google searches, meaning that the measure of the set representing a given term corresponds to the number of documents indexed by Google that contain the word in question.
Let us fix the term “gold” and consider the set of words given by the Word2vec embedding (Figure 4). As usual, the associated vector space is endowed with the Euclidean norm. To compare our word-set embedding with this one, we use as working environment the 10 words that are close with respect to the distance provided by the embedding to the fixed term, including “gold”. The terms are {gold, silver, medal, blue, reserves, coin, rubber, diamonds, tin, timber}, and the distances to the term “gold” can be found in the second column of Table 1.
Figure 4. A 3D representation of the embedding of the closest words to the term “gold”, according to the Euclidean distance, using Word2vec Google News (71291x200) (http://projector.tensorflow.org/ (accessed on 3 January 2025)).
Table 1. Table with the words, the values of the Euclidean distance with the word “gold”, and the corresponding values of the elements of the set-word embedding.
The other values of Table 1 complete the information to compute the metric q μ provided by the set-word embedding using Google search. Following the theoretical development explained in Section 3, the basic elements to understand our model are the semantic indices P A and P B , which are shown in Figure 5 and Figure 6, respectively.
Figure 5. Semantic indices P A for A = gold ( q = q μ ).
Figure 6. Conjugate semantic indices P B for B for each of the selected words.
Recall that we have q μ = q μ L + q μ R , and using the formulas
q μ L ( A , B ) = μ ( A ) 1 P A ( B ) and q μ R ( A , B ) = μ ( B ) 1 P B ( A )
given by Lemma 1, we can compute Table 1. In this table, the measures μ are written in billions (1,000,000,000). To simplify calculations, we will divide by this number in the distance definition computed below, reducing unnecessary complexity.
To facilitate the comparison of the metrics, in Figure 7 and Figure 8 we divide the distance associated with the set-word embedding by eight. Obviously, this change of scale does not affect the properties relevant to the comparison. As can be seen, the distributions obtained for the two metrics are similar, although there are significant differences. Figure 8 shows the same information as Figure 7, but in it we have removed the term “blue” to facilitate the comparison of the values of the rest of the terms, as its large values disturb the overall view.
Figure 7. Comparison of the results provided by Word2vec and the set-word embedding defined by Google search for the term “gold”.
Figure 8. Comparison of the results provided by Word2vec and the set-word embedding excluding the term “blue” for a better visualization.
The primary reason for these differences is the influence of context. Traditional early word embeddings rely on a fixed, universally applicable mapping of semantic distances, while the set-word embedding allows the incorporation of contextual information derived from the large variety of documents indexed by Google. Although the boundaries of this contextual information are not entirely clear, it is evident that the relationships between semantic terms are influenced by the volume of information available across internet documents. This explains, for instance, why the term “blue” is quite distant from “gold”, as the various meanings of “blue” are not statistically related to the meaning of “gold” in a significant proportion.
The term “medal” is closer to “gold” than to “silver” in the information repository queried by Google. This makes sense when we consider that “gold medal” is a much more prevalent phrase on the internet, highlighting common usage patterns. In contrast, while gold and silver are similar in their inherent nature as metals, their association in language is less frequent compared to the popularized use of “gold medal”. Hence, usage, rather than inherent similarity, drives this result. The same can be argued regarding the terms “reserves”, “rubber”, or “coin”. The relationships provided by the word-set embedding with the other words, “diamonds”, “tin”, and “wood”, are more conventional, as can be seen by comparing them with the results provided by Word2vec.

6.1. Advanced Models

Let us introduce now in the discussion other word embeddings that have been designed following different ideas. We now compare some advanced word embedding models, focusing on their ability to capture semantic relationships between precious metals, other commodities, and related terms. The evaluation includes performance metrics and similarity analysis using cosine similarity and L2 distance. As this information is quite exhaustive, we preferred to move some of it to Appendix A.
Microsoft Research’s E5 models, including E5-small (384 dimensions) and E5-base (768 dimensions), represent significant advances. Trained by contrastive learning on diverse datasets, they are versatile in their applications. We also examined Microsoft’s MiniLM models, which offer effective alternatives through knowledge distillation. The L6 variant emphasizes speed, while L12-v2 offers deeper semantic understanding. Finally, ByteDance’s BGE-small model, optimized for retrieval tasks, combines contrastive learning and masked language modeling to deliver high performance in a compact form factor. These models offer different trade-offs between efficiency and semantic accuracy. More information can be found in Appendix A (see the bibliographic references to related papers in this appendix).
Table 2 provides the results of the distances of the term “gold” to the other terms using these word embeddings, which can be compared with the results given by our set-word embedding and by Word2Vec. Figure 9 gives a clear picture of how these other embeddings behave in comparison with the previous ones. In order to compare in a better way, all the models are normalized to have an average value of about eight.
Table 2. L2 distances of the term “gold” to the other terms using the word embeddings E5-small L2 distance, E5-base L2 distance, MiniLM-L6 L2 distance, BGE-small L2 distance, and MiniLM-L12-v2 L2 distance.
Figure 9. Comparison of the results provided by Word2vec and the set-word embedding including the other word embeddings of Table 2.
Figure 9 presents the results of the seven models included in the comparison, with the set-word embedding highlighted in purple. To facilitate a clearer comparison, we scaled the results by appropriate factors to ensure that the representations are centered. In Figure 10, we show only two selected models alongside the set-word embedding for a more focused analysis. As observed in both figures, the set-word embedding produces smoother results overall (the example is chosen for this to happen), although it still follows the primary trends exhibited by the other models, albeit with some slight deviations.
Figure 10. Comparison of the results provided by the set-word embedding and the models E5-small and MiniLM-L6, which are the ones that have a similar distribution.
For the three models selected for Figure 10 (set-word embedding, E5-small, and MiniLM-L6), the trends are more clearly visible, with the set-word embedding shown in purple. This allows a more focused comparison among these specific models, highlighting the distinctive patterns in their performance.
As can be seen, the comparison of the models in this section reveals that, while they all produce different outputs, the variations are not substantial, and some common trends emerge, especially after adjusting the scale for better visualization. The set-word embedding provides a more stable result compared to the other models. However, when considering the combined contributions of the other models, the set-word embedding generally aligns with the overall averages of their performance.

6.2. Remarks on the Comparison of Semantic Models

In summary, the results of this analysis suggest that if the meanings of words are derived from their relationships within a given semantic environment, those meanings fundamentally depend on the measurement tool used to establish these connections. In fact, it seems more practical for many applications to assume that no isolated relationships between terms fully define a word’s meaning. These relationships are always shaped by context, and—this being the main theoretical contribution of this paper—by the measurement tool used to uncover these connections.
Let us give some hints about possible future applications of our ideas. Let us provide some observations on how other commonly used NLP tools can be adapted to our formal context. Techniques such as prompt engineering and few-shot learning, which are fundamental to improve the performance of linguistic models, could be effectively integrated with set-word embeddings. Prompt engineering [30,31,32] involves crafting specific instructions or queries to guide models like GPT toward generating precise and contextually relevant responses. In our context, the construction of these prompts could be informed by the semantic indices used in set-word embeddings, with logical relationships between terms directly translating into formal properties of the model, thereby simplifying the prompt creation process. A similar approach applies to few-shot learning [33,34], which enables models to perform tasks with minimal examples by leveraging their generalization capabilities within the context of the provided prompt.
Additionally, set-word embeddings can serve as a foundation for developing new procedures for LLM-based semantic search (see, for example, ref. [35] and the references therein). This advanced search methodology employs large language models (LLMs) to move beyond traditional keyword matching, identifying implicit relationships and deeper meanings in text. By incorporating set-word embeddings as the underlying semantic framework, this approach could achieve greater interpretability and semantic precision compared to other word embedding options. The integration of these interpretable embeddings would enhance the relevance and contextual accuracy of retrieved information, offering a more robust and insightful framework for semantic search.
Regarding the limitations of the proposed set-word embeddings, we have shown that our procedure is general from a theoretical point of view, in the sense that it covers all situations where standard vector word embeddings are applied (Theorem 2). However, computing the equations to relate both models could be computationally expensive. As can be seen, the size of the matrices involved in the computations could be huge in real cases, and the number of computations could increase exponentially, since in the basic elements of the model are subsets of the initial set of indices, thus increasing with the power 2 N of the original set of terms N . This problem could also carry over to the usual applications of our model, as the construction of the power set underlying the representation could break the potential scalability of the procedure. Other methods of identifying semantic terms with elements of a σ algebra of subsets would have to be used, adapting them to specific contexts, as shown in the examples presented in Section 4. This could restrict their use, as a concrete representation would have to be invented for each application. The development of alternative systematic approaches for defining set-word embeddings depending on the context, along with their analysis and comparison with existing word embedding techniques, represents a key focus of our future work.

7. Conclusions

Set-word embeddings represent a more general approach—though fundamentally distinct from standard methods—from the outset in defining word embeddings on structured spaces. Rather than relying on a linear space, which can sometimes introduce confusion in the representation, the set-word embedding associates each term with a subset within a class of subsets, S, where the class itself has some structural properties. This set-based approach offers a more flexible and intuitive framework for capturing semantic relationships. Furthermore, this original set can always be embedded within a more robust set structure, such as the topology generated by S . This allows the application of topological tools to better understand the embedding process. In this paper, we illustrated the case where S is embedded in the σ algebra Σ ( S ) generated by S itself. In this context, a canonical structure can be established, first treating it as a measure space ( S , Σ ( S ) , μ ) , and subsequently as a metric space. The embedding representation, then, is viewed as an embedding in a measure space, offering a new perspective on word embeddings that allows for richer and more flexible analysis.
This set-based embedding framework not only enables a deeper understanding of the structural properties underlying word representations but also provides a means to apply advanced mathematical techniques, such as measure theory and topology, to the problem. By leveraging these tools, we can gain insights into the continuity, convergence, and general behavior of word embeddings in a more formalized and rigorous way. This approach also opens the door to the exploration of new types of embeddings, which could potentially capture more complex relationships between words and their meanings.
The main limitation of the proposed technique lies in its practical performance, particularly in terms of scalability. Enhancing the algorithms for real-time applications may be challenging. However, there is an advantage over other methods: the computation of distances between two semantic terms can be performed independently of the other coefficients in the metric matrix. This is because the semantic indices are defined using external information sources that are specific to each pair of terms. Therefore, if only a small subset of distances is needed, the method could remain competitive, even in the context of large and demanding information resources.
Finally, it is worth noting that this methodology offers a flexible way of defining word embeddings and introduces a shift in the way we understand them. As demonstrated in the example in Section 6, the widely accepted assumption in NLP that context shapes word relationships can be taken a step further from our perspective. Not only does context influence the relational meaning of words, but the measurement tool used to capture these relationships also defines a specific way of interpreting them. Set-word embeddings provide access to a variety of these measurement tools: for instance, each document repository creates a relational structure that reflects how terms interact within that particular context. The distance q μ , determined by a given measure μ on the set-word embedding, thus defines a unique interpretation of the words’ meanings.
Let us conclude the article by mentioning what is possibly one of the main benefits of the proposed set-word embeddings. Unlike the vast majority of word embeddings, the set-word embedding method offers clear advantages in terms of interpretability and explainability compared to contemporary methods. Although modern language models often rely on post hoc explanation methods such as feature importance, saliency maps, concept attribution, prototypes, and counterfactuals [36], the set-based approach provides inherent interpretability through its fundamentally distinct mathematical foundation, where each term is associated with a subset within a class of subsets, S, which can be embedded in a measure space ( S , Σ ( S ) , μ ) , and subsequently as a metric space. This transparent mathematical structure addresses the lack of formalism in terms of problem formulation and clear and unambiguous definitions identified as a key challenge in XAI research [37]. The method’s representation through struc- tured spaces rather than traditional linear space, which can sometimes introduce confusion, provides a clearer framework for semantic analysis, contrasting with current embedding approaches that face faithfulness issues and often require complex post hoc explanations [38]. Furthermore, as noted in this article, the topology generated by S allows the application of topological tools to better un- derstand the embedding process, offering a level of theoretical interpretability that aligns with the call for more attention to interpreting ML models [37]. This mathematical rigor addresses the need for faithful-by-construction approaches to model interpretability, though with more solid theoretical foundations than some current self-explanatory models that fall short due to obstacles like label leakage [38].

Author Contributions

Conceptualization, P.F.d.C. and E.A.S.P.; methodology, E.A.S.P.; software, C.A.R.P.; validation, C.S.A.; formal analysis, C.A.R.P. and E.A.S.P.; investigation, P.F.d.C., C.A.R.P. and E.A.S.P.; data curation, C.S.A.; writing—original draft preparation, E.A.S.P. and C.A.R.P.; writing—review and editing, P.F.d.C.; visualization, C.S.A.; supervision, E.A.S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Generalitat Valenciana (Spain), grant number PROMETEO CIPROM/2023/32.

Data Availability Statement

Data is contained within the article.

Acknowledgments

We would like to acknowledge the support of Instituto Universitario de Matemática Pura y Aplicada and Universitat Politècnica de València.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

This paper presents selected examples of state-of-the-art word embedding models, focusing on their ability to capture semantic relationships between precious metals, commodities, and related terms. The evaluation includes both performance metrics and a detailed similarity analysis using cosine similarity and L2 distance measures.
Let us give first an overview of the models we present in the Discussion section of the paper. The E5 family of models, developed by Microsoft Research [39], represents a significant advancement in text embeddings. We evaluate both E5-small (384 dimensions) and E5-base (768 dimensions). These models were trained using contrastive learning on diverse datasets including web content, academic papers, and domain-specific documentation. MiniLM models (L6 and L12-v2) are lightweight alternatives developed by Microsoft [40]. Using knowledge distillation techniques, they compress BERT-like architectures while maintaining competitive performance. The L6 variant emphasizes efficiency, while L12-v2 provides more nuanced semantic understanding. BGE-small, developed by ByteDance [41], is optimized for efficient retrieval tasks. It employs a combined training strategy of contrastive learning and masked language modeling, achieving strong performance despite its compact architecture. The results are shown in the following subsections.

Appendix A.1. Performance Metrics

Table A1 shows the computational performance of each model. The metrics include model loading time, average embedding time per input, and embedding dimensionality.
Table A1. Model performance comparison.
Table A1. Model performance comparison.
ModelLoad Time (ms)Avg Embed Time (ms)Dimension
E5-small3015.22177.77384
E5-base4201.97255.57768
MiniLM-L61688.44112.07384
BGE-small2864.15233.54384
MiniLM-L12-v24450.30213.06384
Regarding the performance characteristics of the models, it can be said that MiniLM-L6 demonstrates superior efficiency with the lowest load and embedding times. On the other hand, E5-base requires significant computational resources but provides higher dimensionality. MiniLM-L12-v2 shows unexpectedly high load time despite having similar architecture to L6, while BGE-small maintains balanced performance metrics.

Appendix A.2. Similarity Analysis

We analyze both cosine similarity and L2 distance matrices for the ten key terms involved in the example that we follow in Section 6: Gold, Silver, Medal, Blue, Reserves, Coin, Rubber, Diamonds, Tin, and Timber. For cosine similarity, higher values (closer to 1.0) indicate greater similarity. For L2 distance, lower values indicate greater similarity.

Appendix A.2.1. E5-Small Results

The E5-small model shows strong relationships between precious metals and related terms (see cosine similarity and L2 distance in Table A2 and Table A3, respectively). Gold–Silver–Medal form a highly cohesive cluster (cosine > 0.938), and precious metals show a strong correlation (Gold–Silver: 0.951). Consistently low L2 distances between related terms are obtained, and Timber shows relatively weaker relationships across all terms.
Table A2. E5-small cosine similarity matrix.
Table A2. E5-small cosine similarity matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold1.0000.9510.9520.9160.9250.9420.9220.9410.9150.908
Silver0.9511.0000.9380.9280.9220.9360.9110.9280.9320.899
Medal0.9520.9381.0000.9310.9440.9500.9420.9520.9360.931
Blue0.9160.9280.9311.0000.9340.9080.9220.9310.9290.908
Reserves0.9250.9220.9440.9341.0000.9300.9240.9390.9350.912
Coin0.9420.9360.9500.9080.9301.0000.9260.9340.9110.909
Rubber0.9220.9110.9420.9220.9240.9261.0000.9270.9060.914
Diamonds0.9410.9280.9520.9310.9390.9340.9271.0000.9160.911
Tin0.9150.9320.9360.9290.9350.9110.9060.9161.0000.893
Timber0.9080.8990.9310.9080.9120.9090.9140.9110.8931.000
Table A3. E5-small L2 distance matrix.
Table A3. E5-small L2 distance matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold0.0000.4990.4870.6520.6200.5400.6340.5520.6560.685
Silver0.4990.0000.5620.6100.6360.5750.6840.6140.5900.723
Medal0.4870.5620.0000.5900.5340.5050.5480.5000.5720.592
Blue0.6520.6100.5900.0000.5830.6870.6370.6020.6060.689
Reserves0.6200.6360.5340.5830.0000.6020.6310.5660.5820.677
Coin0.5400.5750.5050.6870.6020.0000.6220.5890.6780.685
Rubber0.6340.6840.5480.6370.6310.6220.0000.6210.7000.671
Diamonds0.5520.6140.5000.6020.5660.5890.6210.0000.6640.685
Tin0.6560.5900.5720.6060.5820.6780.7000.6640.0000.744
Timber0.6850.7230.5920.6890.6770.6850.6710.6850.7440.000

Appendix A.2.2. E5-Base Results

E5-base shows extremely high cosine similarities across all terms, with notable patterns (Table A4). All cosine similarities are above 0.96, suggesting potential overgeneralization. Gold–Diamonds shows the strongest relationship (cosine: 0.992) and L2 distances show more differentiation than cosine similarities (Table A5). Finally, Rubber–Timber shows a surprisingly strong relationship (L2: 1.898).
Table A4. E5-base cosine similarity matrix.
Table A4. E5-base cosine similarity matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold1.0000.9830.9870.9690.9880.9870.9810.9920.9830.986
Silver0.9831.0000.9690.9860.9840.9750.9910.9860.9850.990
Medal0.9870.9691.0000.9600.9790.9840.9670.9810.9710.974
Blue0.9690.9860.9601.0000.9750.9600.9780.9690.9650.976
Reserves0.9880.9840.9790.9751.0000.9790.9850.9850.9780.988
Coin0.9870.9750.9840.9600.9791.0000.9760.9820.9860.980
Rubber0.9810.9910.9670.9780.9850.9761.0000.9840.9870.995
Diamonds0.9920.9860.9810.9690.9850.9820.9841.0000.9840.988
Tin0.9830.9850.9710.9650.9780.9860.9870.9841.0000.988
Timber0.9860.9900.9740.9760.9880.9800.9950.9880.9881.000
Table A5. E5-base L2 distance matrix.
Table A5. E5-base L2 distance matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold0.0003.2622.8044.3272.7412.8173.5592.2433.2963.020
Silver3.2620.0004.4273.0663.2273.9812.4012.9973.1312.527
Medal2.8044.4270.0004.8733.5853.1204.6503.4314.3244.090
Blue4.3273.0664.8730.0003.9864.8933.9094.4224.6993.961
Reserves2.7413.2273.5853.9860.0003.6563.1233.0833.7592.779
Coin2.8173.9813.1204.8933.6560.0003.9263.3282.9543.530
Rubber3.5592.4014.6503.9093.1233.9260.0003.2372.8721.898
Diamonds2.2432.9973.4314.4223.0833.3283.2370.0003.1962.738
Tin3.2963.1314.3244.6993.7592.9542.8723.1960.0002.770
Timber3.0202.5274.0903.9612.7793.5301.8982.7382.7700.000

Appendix A.2.3. MiniLM-L6 Results

MiniLM-L6 shows more differentiated relationships than the E5 models (see cosine similarity and L2 distance in Table A6 and Table A7, respectively). Clear clustering of precious metals (Gold–Silver: 0.832) is observed, and lower overall similarities suggest better discrimination. Timber consistently shows the lowest similarities with metals, and L2 distances align well with semantic relationships.
Table A6. MiniLM-L6 cosine similarity matrix.
Table A6. MiniLM-L6 cosine similarity matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold1.0000.8320.7630.6850.6940.7760.6090.6960.5700.542
Silver0.8321.0000.7760.6470.5630.6820.5460.6600.6480.449
Medal0.7630.7761.0000.6430.6530.6880.5810.6550.5720.511
Blue0.6850.6470.6431.0000.5850.6190.5690.6060.6010.424
Reserves0.6940.5630.6530.5851.0000.7230.5770.5930.4380.470
Coin0.7760.6820.6880.6190.7231.0000.6440.6640.6100.453
Rubber0.6090.5460.5810.5690.5770.6441.0000.6150.5620.582
Diamonds0.6960.6600.6550.6060.5930.6640.6151.0000.5600.468
Tin0.5700.6480.5720.6010.4380.6100.5620.5601.0000.404
Timber0.5420.4490.5110.4240.4700.4530.5820.4680.4041.000
Table A7. MiniLM-L6 L2 distance matrix.
Table A7. MiniLM-L6 L2 distance matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold0.0000.7860.8491.0261.0090.8381.1121.0311.2171.225
Silver0.7860.0000.9011.1571.2841.0761.2871.1541.1701.436
Medal0.8490.9010.0001.0911.0720.9871.1501.0971.2131.265
Blue1.0261.1571.0910.0001.2171.1391.2171.2101.2131.427
Reserves1.0091.2841.0721.2170.0000.9691.2011.2281.4351.366
Coin0.8381.0760.9871.1390.9690.0001.0731.0921.1711.353
Rubber1.1121.2871.1501.2171.2011.0730.0001.1721.2461.188
Diamonds1.0311.1541.0971.2101.2281.0921.1720.0001.2961.397
Tin1.2171.1701.2131.2131.4351.1711.2461.2960.0001.473
Timber1.2251.4361.2651.4271.3661.3531.1881.3971.4730.000

Appendix A.2.4. BGE-Small Results

BGE-small demonstrates balanced semantic understanding. There are strong relationships between related metals (Silver–Coin: 0.794), and moderate cross-category similarities. The model provides a clear distinction between metal and non-metal terms. Also, we find consistent L2 distances supporting cosine similarity patterns (see cosine similarity and L2 distance in Table A8 and Table A9, respectively).
Table A8. BGE-small cosine similarity matrix.
Table A8. BGE-small cosine similarity matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold1.0000.7830.7180.6370.6210.7330.6460.7500.6540.626
Silver0.7831.0000.7110.7510.6350.7940.6330.7360.7420.629
Medal0.7180.7111.0000.6190.6180.6460.5660.6680.6130.606
Blue0.6370.7510.6191.0000.5540.6780.6980.6960.6330.643
Reserves0.6210.6350.6180.5541.0000.6440.5350.5680.5930.643
Coin0.7330.7940.6460.6780.6441.0000.6460.6690.7310.562
Rubber0.6460.6330.5660.6980.5350.6461.0000.6220.6050.615
Diamonds0.7500.7360.6680.6960.5680.6690.6221.0000.6270.619
Tin0.6540.7420.6130.6330.5930.7310.6050.6271.0000.619
Timber0.6260.6290.6060.6430.6430.5620.6150.6190.6191.000
Table A9. BGE-small L2 distance matrix.
Table A9. BGE-small L2 distance matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold0.0001.0761.2441.4251.4751.1991.3631.1561.3691.449
Silver1.0760.0001.2581.1791.4461.0511.3851.1871.1801.440
Medal1.2441.2580.0001.4751.4971.3971.5281.3501.4661.505
Blue1.4251.1791.4750.0001.6291.3451.2881.3041.4391.444
Reserves1.4751.4461.4971.6290.0001.4331.6181.5731.5361.461
Coin1.1991.0511.3971.3451.4330.0001.3661.3341.2101.572
Rubber1.3631.3851.5281.2881.6181.3660.0001.4081.4471.457
Diamonds1.1561.1871.3501.3041.5731.3341.4080.0001.4201.462
Tin1.3691.1801.4661.4391.5361.2101.4471.4200.0001.469
Timber1.4491.4401.5051.4441.4611.5721.4571.4621.4690.000

Appendix A.2.5. MiniLM-L12-v2 Results

MiniLM-L12-v2 shows the most distinctive differentiation with the strongest distinction between metal and non-metal terms. Gold–Silver maintains the highest similarity (0.732) among all pairs, and there are very low similarities for unrelated pairs (Timber–Coin: 0.144) (Table A10). L2 distances show the largest range, indicating strong discrimination capability (Table A11).
Summarizing all these comments, we can see that each of these advanced models provides different features, and none of them could be assumed to be “better” compared to the others. Our set-word embedding provides another point of view, as is the case with Word2Vec, but the main advantage of our purpose is that we can finally give an interpretation of why the numerical marks are observed.
Table A10. MiniLM-L12-v2 cosine similarity matrix.
Table A10. MiniLM-L12-v2 cosine similarity matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold1.0000.7320.4800.4390.3950.6450.2080.6430.4220.292
Silver0.7321.0000.4510.4180.3970.5390.2640.4570.4910.297
Medal0.4800.4511.0000.2780.3500.4090.1870.3530.4320.226
Blue0.4390.4180.2781.0000.2490.3710.3000.2760.4330.394
Reserves0.3950.3970.3500.2491.0000.4010.2540.3410.4330.194
Coin0.6450.5390.4090.3710.4011.0000.3350.3760.4710.144
Rubber0.2080.2640.1870.3000.2540.3351.0000.2910.5160.406
Diamonds0.6430.4570.3530.2760.3410.3760.2911.0000.4390.268
Tin0.4220.4910.4320.4330.4330.4710.5160.4391.0000.438
Timber0.2920.2970.2260.3940.1940.1440.4060.2680.4381.000
Table A11. MiniLM-L12-v2 L2 distance matrix.
Table A11. MiniLM-L12-v2 L2 distance matrix.
GoldSilverMedalBlueRes.CoinRubberDiam.TinTimber
Gold0.0001.6932.3282.3662.6412.0473.3032.3992.2082.879
Silver1.6930.0002.3532.3662.6012.3033.1522.8842.0362.829
Medal2.3282.3530.0002.5932.6682.5743.2753.1072.0922.930
Blue2.3662.3662.5930.0002.8052.6023.0013.2262.0092.545
Reserves2.6412.6012.6682.8050.0002.7113.2613.2272.2713.134
Coin2.0472.3032.5742.6022.7110.0003.1083.1652.2423.261
Rubber3.3033.1523.2753.0013.2613.1080.0003.5542.4552.933
Diamonds2.3992.8843.1073.2263.2273.1653.5540.0002.7903.412
Tin2.2082.0362.0922.0092.2712.2422.4552.7900.0002.288
Timber2.8792.8292.9302.5453.1343.2612.9333.4122.2880.000

References

  1. Clark, S. Vector space models of lexical meaning. In The Handbook of Contemporary Semantics; Lappin, S., Fox, C., Eds.; Blackwell: Malden, MA, USA, 2015; pp. 493–522. [Google Scholar] [CrossRef]
  2. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
  3. Incitti, F.; Urli, F.; Snidaro, L. Beyond word embeddings: A survey. Inf. Fusion 2023, 89, 418–436. [Google Scholar] [CrossRef]
  4. Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
  5. Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
  6. Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. Preprint. 2018. Available online: https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035 (accessed on 23 December 2024).
  7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
  8. Grand, G.; Blank, I.A.; Pereira, F.; Fedorenko, E. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nat. Hum. Behav. 2022, 6, 975–987. [Google Scholar] [CrossRef]
  9. Manetti, A.; Ferrer-Sapena, A.; Sánchez-Pérez, E.A.; Lara-Navarra, P. Design Trend Forecasting by Combining Conceptual Analysis and Semantic Projections: New Tools for Open Innovation. J. Open Innov. Technol. Mark. Complex. 2021, 7, 92. [Google Scholar] [CrossRef]
  10. Zadeh, L.A. Quantitative fuzzy semantics. Inf. Sci. 1971, 3, 159–176. [Google Scholar] [CrossRef]
  11. Zadeh, L.A. A Fuzzy-Set-Theoretic Interpretation of Linguistic Hedges. J. Cybern. 1972, 2, 4–34. [Google Scholar] [CrossRef]
  12. Saranya, M.; Amutha, B. A Survey of Machine Learning Technique for Topic Modeling and Word Embedding. In Proceedings of the 2024 10th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 14–15 March 2024; Volume 1, pp. 1–6. [Google Scholar]
  13. Hongliu, C.A.O. Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark. arXiv 2024, arXiv:2406.01607. [Google Scholar] [CrossRef]
  14. Wang, Y.; Mishra, S.; Alipoormolabashi, P.; Kordi, Y.; Mirzaei, A.; Arunkumar, A.; Khashabi, D. Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. arXiv 2022, arXiv:2204.07705. [Google Scholar]
  15. Georgila, K. Comparing Pre-Trained Embeddings and Domain-Independent Features for Regression-Based Evaluation of Task-Oriented Dialogue Systems. In Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Kyoto, Japan, 18–20 September 2024; pp. 610–623. [Google Scholar]
  16. Baroni, M.; Zamparelli, R. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 1183–1193. [Google Scholar]
  17. Erk, K. Vector space models of word meaning and phrase meaning: A survey. Lang. Linguist. Compass 2012, 6, 635–653. [Google Scholar] [CrossRef]
  18. Arens, R.F.; Eels, J.J. On embedding uniform and topological spaces. Pac. J. Math 1956, 6, 397–403. [Google Scholar] [CrossRef]
  19. Kosub, S. A note on the triangle inequality for the Jaccard distance. Pattern Recognit. Lett. 2019, 120, 36–38. [Google Scholar] [CrossRef]
  20. Deza, M.M.; Deza, E. Encyclopedia of Distances, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
  21. Gardner, A.; Kanno, J.; Duncan, C.A.; Selmic, R. Measuring distance between unordered sets of different sizes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 137–143. [Google Scholar] [CrossRef]
  22. Cobzaş, C. Functional Analysis in Asymmetric Normed Spaces; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar] [CrossRef]
  23. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3111–3119. [Google Scholar] [CrossRef]
  24. Mikolov, T.; Yih, W.T.; Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 746–751. [Google Scholar]
  25. Candido, L.; Cúth, M.; Doucha, M. Isomorphisms between spaces of Lipschitz functions. J. Funct. Anal. 2019, 277, 2697–2727. [Google Scholar] [CrossRef]
  26. Cobzaş, C.; Miculescu, R.; Nicolae, A. Lipschitz Functions; Springer: Berlin, Germany, 2019. [Google Scholar] [CrossRef]
  27. Erdogan, E.; Ferrer-Sapena, A.; Jimenez-Fernandez, E.; Sánchez Pérez, E. Index spaces and standard indices in metric modelling. Nonlinear Anal. Model. Control 2022, 27, 1–20. [Google Scholar] [CrossRef]
  28. Kuratowski, C. Quelques problèmes concernant les espaces métriques non-séparables. Fundam. Math. 1935, 25, 534–545. [Google Scholar] [CrossRef][Green Version]
  29. Ruas, T.; Grosky, W. Keyword extraction through contextual semantic analysis of documents. In Proceedings of the 9th International Conference on Management of Digital EcoSystems, Bangkok, Thailand, 7–10 November 2017; pp. 150–156. [Google Scholar] [CrossRef]
  30. Shi, F.; Qing, P.; Yang, D.; Wang, N.; Lei, Y.; Lu, H.; Lin, X.; Li, D. Prompt space optimizing few-shot reasoning success with large language models. arXiv 2023, arXiv:2306.03799. [Google Scholar]
  31. Wan, X.; Sun, R.; Dai, H.; Arik, S.O.; Pfister, T. Better zero-shot reasoning with self-adaptive prompting. arXiv 2023, arXiv:2305.14106. [Google Scholar]
  32. Zheng, C.T.; Liu, C.; Wong, H.S. Corpus-based topic diffusion for short text clustering. Neurocomputing 2018, 275, 2444–2458. [Google Scholar] [CrossRef]
  33. Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. Acm Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
  34. Xu, S.; Pang, L.; Shen, H.; Cheng, X.; Chua, T.S. Search-in-the-Chain: Interactively Enhancing Large Language Models with Search for Knowledge-intensive Tasks. In Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024; pp. 1362–1373. [Google Scholar]
  35. Xiong, H.; Bian, J.; Li, Y.; Li, X.; Du, M.; Wang, S.; Helal, S. When search engine services meet large language models: Visions and challenges. IEEE Trans. Serv. Comput. 2024, 17, 4558–4577. [Google Scholar] [CrossRef]
  36. Bodria, F.; Giannotti, F.; Guidotti, R.; Naretto, F.; Pedreschi, D.; Rinzivillo, S. Benchmarking and Survey of Explanation Methods for Black Box Models. Data Min. Knowl. Disc. 2023, 37, 1719–1778. [Google Scholar] [CrossRef]
  37. Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
  38. Lyu, Q.; Apidianaki, M.; Callison-Burch, C. Towards Faithful Model Explanation in NLP: A Survey. Comput. Linguist. 2024, 50, 657–723. [Google Scholar] [CrossRef]
  39. Wang, L.; Yang, N.; Huang, X.; Yang, L.; Majumder, R.; Wei, F. Multilingual E5 Text Embeddings: A Technical Report. arXiv 2024, arXiv:2402.05672. [Google Scholar]
  40. Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 5776–5788. [Google Scholar]
  41. Lin, Y.; Ding, B.; Jagadish, H.V.; Zhou, J. SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions. arXiv 2023, arXiv:2309.07856. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.