Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings

Arnau, Roger; Coronado Ferrer, Ana; González Cortés, Álvaro; Sánchez Arnau, Claudia; Sánchez Pérez, Enrique A.

doi:10.3390/axioms14050389

Open AccessArticle

Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings

by

Roger Arnau

¹

,

Ana Coronado Ferrer

²

,

Álvaro González Cortés

¹

,

Claudia Sánchez Arnau

¹

and

Enrique A. Sánchez Pérez

^1,*

¹

Instituto Universitario de Matemática Pura y Aplicada, Universitat Politècnica de València, Camino de Vera s/n, 46022 Valencia, Spain

²

Departamento de TICS, Florida Universitaria, 46470 Catarroja, Spain

^*

Author to whom correspondence should be addressed.

Axioms 2025, 14(5), 389; https://doi.org/10.3390/axioms14050389

Submission received: 25 April 2025 / Revised: 17 May 2025 / Accepted: 19 May 2025 / Published: 21 May 2025

(This article belongs to the Special Issue New Perspectives in Mathematical Statistics)

Download

Browse Figures

Versions Notes

Abstract

We present a new framework to study the stability of semantic projections based on word embeddings. Roughly speaking, semantic projections are indices taking values in the interval

[0, 1]

that measure how terms share contextual meaning with the words of a given universe. Since there are many ways to define such projections, it is important to establish a procedure for verifying whether a group of them behaves similarly. Moreover, when fixing one particular projection, it is important to assess whether the average projections remain consistent when replacing the original universe with a similar one describing the same semantic environment. The aim of this paper is to address the lack of formal tools for assessing the stability of semantic projections (that is, their invariance under formal changes which preserve the underlying semantic context) across alternative but semantically related universes in word embedding models. To address these problems, we employ a combination of statistical and AI methods, including correlation analysis, clustering, chi-squared distance measures, weighted approximations, and Lipschitz-based estimators. The methodology provides theoretical guarantees under mild mathematical assumptions, ensuring bounded errors in projection estimations based on the assumption of Lipschitz continuity. We demonstrate the practical applicability of our approach through two case studies involving agricultural terminology across multiple data sources (DOAJ, Scholar, Google, and Arxiv). Our results show that semantic stability can be quantitatively evaluated and that the careful modeling of projection functions and universes is crucial for robust semantic analysis in NLP.

Keywords:

semantic projection; measure; approximation; AI; NLP

MSC:

62P25; 68T50

1. Introduction

In recent years, semantic tools based on word embeddings have become one of the most powerful and widely adopted tools in Natural Language Processing (NLP) [1]. These embeddings map semantic terms to vectors in high-dimensional spaces, enabling the computation of semantic similarity through distances such as the Euclidean norm or the cosine similarity. The assumption behind these models is that semantic similarity is captured by proximity, that is, in a way whereby two words are semantically related if their vector representations are close in the embedding space.

In this framework, semantic projections have emerged as essential tools for modeling specific features within these semantic environments. Formally, they are functions that quantify how a term expresses a certain property, often modeled as linear functionals over the embedding space [2]. This allows projecting high-dimensional semantic information onto lower-dimensional features such as size, age, or sentiment [3]. However, linearity imposes structural constraints that may not align well with the inherent complexity of language, especially when dealing with the non-linear or context-dependent relationships between terms.

The alternative, proposed in [4], overcomes these limitations by redefining the word embedding process in a fundamentally different mathematical context. Instead of relying solely on linear algebra, this new framework embeds terms into algebras of subsets endowed with a measure and a family of (possibly non-symmetric) metrics. This leads to what is called a set-word embedding, where the semantic representation of a word is given by a measurable subset and its semantic projection is given via a semantic index, which is defined as a proportion of overlap between these sets. This approach yields a more flexible, context-sensitive representation, while remaining mathematically robust through the use of the theory of Lipschitz functions, changing the usual low dimensional linear spaces used in the word embeddings by the more sophisticated Arens–Eells spaces [3,4].

In the present work, our aim is to explore two closely related problems concerning the stability of semantic projections across different methodologies. These questions arise naturally in empirical language analysis, where the results of semantic modeling are expected to remain consistent when small changes are introduced into the system. Concretely, two main problems are analyzed. In the first one, stability under different semantic projections are studied. Semantic projections are real-valued functions that measure the intensity of the relationship between two terms, one belonging to a universe of words that describes a given semantic environment, and the other being an external linguistic expression. However, these projections can be defined in different ways, and as a result, their values generally differ even when applied to the same pair of terms. Thus, given a fixed universe of words U and a term t, we analyze how consistent different semantic projections are. In other words, we ask whether the internal “distribution of meaning” through a set of projections

P^{i}

is invariant with respect to the index i, thus indicating coherence across multiple interpretative perspectives.

The second one is the stability under changes in the sets of words that we call universes. Given two universes

U^{1}

and

U^{2}

that are assumed to describe the same semantic environment, we evaluate to what extent the projection vectors

{(P_{u_{i}^{1}} (t))}_{i = 1}^{n_{1}}

and

{(P_{u_{j}^{2}} (t))}_{j = 1}^{n_{2}}

are compatible, in the sense that the overall information that these projection contain is essentially the same. This involves defining methods to compare universes via metrics and estimating how projection values can be transferred or approximated from one universe to another.

Our approach relies on the existence of suitable word embeddings, and it builds upon recent ideas on Natural Language Processing (NLP). The development of distributional semantics has provided a powerful framework for modeling lexical meaning, grounded in the hypothesis that words occurring in similar contexts tend to have similar meanings [5,6,7]. Word embeddings give computational form to this idea by representing terms as vectors in high-dimensional spaces, where spatial proximity captures semantic similarity [8,9,10], and they are widely used across scientific domains to analyze deep relationships among linguistic constructs [1,11]. Early models such as Word2Vec [8] and GloVe [9] showed that relational and analogical properties can emerge from purely co-occurrence-based representations [12,13], paving the way for more sophisticated models of compositionality [14], contextual variation [15], and concept-based semantic dimensions via projection techniques [2]. Note that although this is of primary interest, here, we are not discussing how deep a study based on word embeddings can be or how far-reaching the conclusions might be through their use, despite their well-known limitations [16,17]. However, we must at least require that the proposed mathematical methods satisfy some minimal conditions of coherence and compatibility, and this is what we are analyzing here.

Projection-based approaches recover interpretable conceptual features (such as size, gender, or animacy) by aligning word vectors with semantically meaningful axes derived from empirical contrasts. These techniques have been successfully applied in diverse domains, including cross-modal learning [18], few-shot segmentation [19], and theoretical linguistics. At a foundational level, this line of research resonates with early insights by Zadeh on fuzzy categories, where linguistic meaning is modeled as graded and context-sensitive rather than crisp and discrete [20].

In both frameworks that we have considered (vector-valued and set-theoretic) the notion of distance plays a central role. For example, we construct matrices of pairwise distances between elements of two semantic universes and use them to build estimators that transfer projection values across different contexts. These estimators include point-wise approximations (through convex combinations weighted by semantic distance) and global error assessments based on root-mean-square deviation between the estimated and actual projection values.

The theoretical foundation of this methodology is reinforced by formal results showing that, under Lipschitz-type assumptions, the estimation error is bounded and controlled by the projection differences between two terms, t and

t^{'}

. This mathematical grounding provides a robust way to evaluate how stable and coherent semantic projections are under shifts in the projection function or the semantic universe.

Thus, the contributions of this paper are threefold. First, we formalize the comparison of projection structures under different and semantic universes, grounding it in both metric geometry, statistical analyses, and AI methods. Following a theoretical background explained in Section 2, this is done in Section 3. An example of such an analysis is shown in Section 3.2, in which we propose computable estimators for semantic projections and their associated correlations, thus offering practical tools for evaluating semantic coherence. A comparison between universes is presented in Section 4, where a procedure designed by means of the computation on a particular set of terms that are considered relevant is explained. Finally, in Section 4.4, we illustrate the utility of these tools through a concrete example in the field of agriculture.

Let us recall some basic definitions. If

(X, d)

is a metric space, a function

f : X \to R

is said to be Lipschitz if there exists

L > 0

such that

| f (a) - f (b) | \leq L d (a, b) for all a, b \in X .

The infimum constant is denoted by

L i p (f)

[21]. We will work in the examples with the Euclidean norm, but all the results are valid for any distance of a metric space.

Several statistical and AI methods will be used, with the necessary information provided at the point of application.

2. Theoretical Background

The concepts of NLP and the mathematical tools we use in this paper are standard, but some of them come from fields not directly connected to statistics or data analysis, such as Lipschitz continuity. Let us start with a brief summary of the main definitions.

A metric space model is a mathematical structure (in our case, for semantics) that represents semantic terms as elements of a metric space S (the semantic universe), where the distance between these elements represents the semantic distance between the corresponding terms. Recall that a metric is a subadditive and symmetric function $d : S \times S \to R^{+}$ such that $d (A; B) = 0$ if and only if $A = B$ . This is the basic assumption in the definition of word embeddings, in which semantic structures are embedded in high-dimensional Euclidean spaces. Thus, semantic similarity between terms (that is, how far two language items are semantically) is measured in the model by the distance $d .$
In this context, the notion of set-word embedding is not as well known as the usual linear space-valued word embeddings. Set-word embeddings are also metric models for NLP, but the semantic items are identified with subsets of a fixed set rather than with vectors of a linear space [4].
As mentioned in the Introduction, a function is Lipschitz continuous if there exists a constant such that the difference in function values between any two points is bounded by that constant times the distance between the points. This condition ensures that the function does not change too quickly, controlling the variation with respect to the metric.
Given a set $S,$ a class of subsets $Σ$ of S-preserving intersections, we say that a function $P : Σ \to R$ is a semantic projection (or semantic index as defined below), if it is given as a proportion $P (B) = (A \cap B) / P (A)$ for another function $m : Σ \to R .$ Broadly speaking, in terms of semantic notions, this definition aims to provide a numerical estimate (a sort of probability) of how a given (semantic) term B shares the meaning of another given term A, normalized by the second term.

In information management, it is usual to consider some similarity functionals on finite sets that are already classical in the setting. The Jaccard (or Tanimoto) index, a similarity measure applied in many areas of Machine Learning, is defined by the expression

J (A, B) = \frac{| A \cap B |}{| A \cup B |},

for A and B finite subsets of a set

W,

where

| C |

is the cardinal of the set

C .

The so-called Jaccard distance [22], which satisfies the requirements for being a metric, is given by

D (A, B) = 1 - J (A, B) = \frac{| A \cup B | - | A \cap B |}{| A \cup B |},

which gives a particular case for a more general version that is the Steinhaus distance for measurable sets

A, B

with respect to a positive finite measure

μ

(that is, a countably additive set function,) and it is given by [23] (§.1.5)

S_{μ} (A, B) = \frac{μ (A \cup B) - μ (A \cap B)}{μ (A \cup B)},

The distance proposed in [24] is a generalization of these classical metrics, aiming to extend their applicability in broader mathematical and computational settings.

Let us define a semantic index characteristic of the embedding i, based on an asymmetric version of the Jaccard index. In our context, and due to the role it plays in the framework of this work, we refer to it as the semantic index of one meaningful element with respect to another within a given context.

Throughout the rest of this section, we assume that

μ

is a finite positive measure defined on a set S, acting over a

σ

-algebra

Σ

of subsets of a set

Ω

.

Definition 1.

Let

A, B \in Σ (S)

. The semantic index of B on A is defined as

P_{A} (B) : = \frac{μ (A \cap B)}{μ (A)} .

Thus, once a measurable set A is fixed, the semantic index becomes a function

P_{A} : Σ \to R^{+}

.

Such functions are referred to as semantic projections in [3,4,25]. Informally, this ratio captures the “proportion of meaning” of A that is explained or shared by the meaning of B. However, it is important to emphasize that this is a purely mathematical definition. The term “meaning” associated with a set

A \in Σ

is simply given by the evaluation of the measure

μ

on A, where

μ

plays the conventional role of quantifying the size of the set according to a fixed criterion.

In [25], the notion of semantic projection was introduced to formalize by means of probabilistic indices that represent how a given conceptual item (typically a noun) is represented within a universe of concepts. Thus, given a term t and a finite universe of words

U

, the semantic projection

P_{u} (t)

with respect to an element

u \in U

is a real value contained in the interval

[0, 1] .

It is defined as a vector of

| U |

coordinates in [25] (§.3.1), where each coordinate is given by

P_{u_{i}} (t)

for

u_{i} \in U

,

i = 1, \dots, | U | .

A natural way of defining a particular case of such a semantic projection is the generalization of a non-symmetric version of the Jacard index defined above. It is based on assigning a set from the

σ

-algebra

Σ

to each element of U and to the term t. Indeed, let T be the set considered for all the terms (

T \to T = {t}

) that contains the universe

U .

Then, we define

t \mapsto A (t) \in Σ

, and if

μ

is the counting measure, we define

P_{u_{i}} (t) = \frac{μ (A (t) \cap A (u_{i}))}{μ (A (t))} .

(1)

A typical example is the case where

Σ

is the family of all subsets of a collection of scientific papers

Ω

, and the assignment

t \mapsto A (t)

associates to each term t the subset of documents in

Ω

that contain it. In the practical applications presented in the following sections, semantic projections are defined precisely in this manner. By considering the counting measure

μ

over the

σ

-algebra of subsets of a given document repository, we compute the ratio between the number of documents in which both terms t and

u_{i}

co-occur and the number of documents in which t appears.

To summarize the conceptual structure introduced in this section, we outline below the essential components that define our analytical framework.

A lexical domain, that is, a finite set of linguistic items, such as words, short expressions, or terms, denoted by W, which serves as the base vocabulary for constructing a contextual semantic model.
A measurable semantic space, which is a finite measure space $(Ω, Σ, μ)$ , where $Σ$ is a sigma algebra of subsets of $Ω$ , and each element of W is associated with a measurable subset in $Σ$ . In this paper, $Ω$ will be finite, $Σ$ the class of all subsets, and $μ$ the counting measure.
An embedding function, which is an injective mapping $ι : W ↪ Σ$ that assigns to each term in W a unique measurable subset of $Ω$ . This mapping is well defined due to the assumption that all images lie within the $σ$ -algebra $Σ$ .
A word embedding $I : W ↪ R^{d}$ , which allows us to identify each term with a vector of a finite dimensional real linear space.

3. Coherence Analysis of Semantic Projections: Statistical Approach

As we said, in the rest of the paper, we work with a finite set and the class of all its subsets as the measurable space

(Ω, Σ)

and the counting measure as

μ .

Consider the vector that defines the semantic projection on a term t on a universe

U = {u_{1}, \dots, u_{n}}

of semantic items. That is, given a concrete semantic projection (formally, a measure on a finite set as defined in Section 2 in (1)), we fix the vector

(P_{u_{1}} (t), \dots, P_{u_{n}} (t))

that represents the term t in U with respect to the projection

P .

We follow the notation introduced in the last section. It is supposed that we have different projections

P^{1}, \dots, P^{m},

and the objective of this section is to assess how far all of them give similar information, that is, how far all together confirm the information on the “real value” of the sharing degree contained in each of the coordinates of such vector. Consequently, we use some statistical concepts to analyze this question.

Thus, we have to deal with a finite set of semantic projections

P^{1}, \dots, P^{m}

of a term t into a given universe

U = {u_{1}, \dots, u_{n}} .

In the applications of the model, each projection is provided by a search engine, such as Google, or a database of scientific documents, such as DOAJ. The order of the terms of U is fixed.

If T is the set of all terms, we have to consider several functions

\begin{matrix} P^{i} : T \times U & \to {[0, 1]}^{n} \\ (t, u) & \mapsto (P_{u_{1}}^{i} (t), \dots, P_{u_{n}}^{i} (t)), \end{matrix}

where

P^{i}

denotes each of the semantic projections taken into account.

Therefore, once a universe is fixed, each term t and each projection

P^{i}

produce a vector

P^{i} (t) \in R^{n}

with coordinates between 0 and

1 .

In this section, we consider the normalized vectors associated with a fixed term t. The motivation behind this normalization is to evaluate whether the relative weights of the coordinates of each projection vector

P^{i} (t)

are independent of the index

i \in {1, \dots, m}

. That is, we are interested in studying the internal distribution of information within each vector, regardless of its magnitude.

3.1. Analytic Procedure

Given a fixed term t, we define the normalized projection vectors as follows:

N^{i} (t) : = \frac{P^{i} (t)}{∥ P^{i} (t) ∥}, for i = 1, \dots, m,

where

∥ \cdot ∥

denotes the standard Euclidean norm in

R^{d}

. To interpret the structure and similarity of these normalized vectors, our method involves the computation and visualization of the following items.

(1)

Correlation analysis.

We compute the correlation matrix of the vectors as follows:

{Corr}_{t} : = {(〈 N^{i} (t), N^{j} (t) 〉)}_{i, j = 1}^{m},

where

〈 \cdot, \cdot 〉

denotes the standard inner product in

R^{d}

. This matrix quantifies the angular similarity between the normalized projections. We represent this matrix visually using a heatmap. Our aim is to provide a clear measure of how different the vectors are when representing various sources used to compute the semantic projections. Since correlation analysis is a standard procedure in experimental science and data analysis, it offers clear and direct information to potential users of our method.

Additionally, we have computed a chi-square similarity matrix based on the coordinate-wise comparison of

N^{i} (t)

and

N^{j} (t)

, and we also visualize it via a heatmap. This alternative emphasizes component-wise differences when interpreting the normalized weights as probability distributions. This approach provides a visual tool to represent the proximity between different sources of semantic projections, offering an even more straightforward understanding than correlation analysis.

(2)

Clustering of semantic profiles.

We treat the normalized vectors

{N^{i} (t)}_{i = 1}^{m}

as points in

R^{d}

, and we apply clustering algorithms to explore grouping patterns. In particular, we perform the following:

–: Apply Principal Component Analysis (PCA) to project the data into two dimensions for visualization.
–: Perform k-means clustering for $k = 2, 3, 4$ , showing the resulting clusters with graphical representations in the PCA plane.
–: Use the Elbow method to determine the optimal number of clusters.

With the use of these procedures, we intend to improve the selection of which semantic projection methods can be considered equivalent as well as appropriate. The main cluster is used to discard the sources that are not providing adequate responses from the interpretational point of view.

(3)

Representative or average result.

For applications, and following the identification of clusters or highly correlated groups of vectors, we can compute the mean of the original (non-normalized) vectors

P^{i} (t)

within the main cluster or those with high correlation to a selected reference vector. This average could serve as a representative projection pattern for the term t.

{\bar{P}}^{C} (t) : = \frac{1}{| C |} \sum_{i \in C} P^{i} (t),

where

C \subset {1, \dots, m}

is the set of indices corresponding to the selected cluster or correlated group.

This provides a full and comprehensive description of the properties of the semantic projections, allowing us to assess whether the results are coherent (in the sense that they follow similar trends), and if so, to compute a meaningful average value that synthesizes the overall outcome. Note that the groupings proposed depend on a fixed term.

3.2. A Case Study on Agricultural Terms Using Four Semantic Projections

Let us show an example of our methodology. We consider the universe

U = {“ crop ”, “ farmland ”, “ farming ”, “ harvest ”, “ irrigation ”, “ orchard ”, “ soil ”, “ tractor ”}

, and the term

t =

“plowing” to compute the semantic projections onto U.

3.2.1. Data Collection and Sources

We focus the attention on the following four different search engines: DOAJ [26], Google Scholar [27], Google [28], and Arxiv [29]. The normalized results obtained for each source are shown in Table 1.

As can be seen, terms such as “soil” and “crop” are most associated with “plowing” across all platforms, although the importance varies by source. Google tends to distribute semantic weight more evenly, while DOAJ and Scholar show a greater emphasis on a few agricultural terms, indicating a more focused context.

3.2.2. Semantic Projection

The raw data obtained in the previous step are now graphically analyzed using explicitly the concept of semantic projection. The results can be seen in Figure 1. The representation of the curves gives visual proof that DOAJ, Scholar, and Arxiv follow the same trends, while Google shows a different picture.

3.2.3. Statistical and AI Methods

The relations between these semantic sources are presented in the correlation matrix shown in Table 2. As expected, DOAJ and Scholar show a very high positive correlation (

0.92

), suggesting that both repositories contain articles with agricultural contents in a very similar way. Arxiv also shows a positive correlation, although somewhat weaker, with these sources, probably due to the low representation of articles present in this collection. In contrast, Google exhibits strong negative correlations with DOAJ and Scholar, highlighting how it behaves differently.

This behavior is explained graphically in Figure 2, where a heatmap of the correlations is shown. While the other searches are conducted within a “closed” set of scientific documents, such as peer-reviewed articles or curated repositories, Google indexes a much wider range of sources. As a result, it may lead to connections that are not necessarily meaningful or directly related to the actual co-occurrence of the specific terms being investigated.

A chi-squared distance heatmap (Figure 3) also provides complementary information. As can be seen, smaller distances in blue confirm the closeness between DOAJ and Scholar, while Google appears more distant from the academic repositories.

A Principal Component Analysis is also developed. The results summarized in Table 3 reveal that more than

72 %

of the variance is captured by the first principal component. This indicates that the main semantic variation between sources can be mostly explained by a single underlying factor.

Next, and confirming the previous results, Figure 4 presents the optimal clustering into two groups (the first one containing three elements; the second, one element), clearly separating Google from the academic sources.

Figure 5 provides a representation of the total within-cluster variance as a function of the number of clusters. The“elbow” at

k = 2

confirms that two clusters are the most natural choice for this dataset.

The effectiveness of the method depends on whether the volume of data is large enough to ensure a certain statistical stability. In very specific contexts, the procedure we propose could easily fail, depending on whether the relevant terms share a sufficiently large set of documents to make the corresponding ratios non-trivial or avoid a strong sensitivity to small variations in term frequency.

3.2.4. Evaluation Metrics and Analysis Procedures

Let us sum up, in what follows, the results of the application of our analytical procedure. All the information provided in the previous steps can now be integrated to offer a global view of the results. Thus, in general, Scholar, DOAJ, and Arxiv show a strong consistency in their treatment of agricultural and water-related vocabulary, while Google presents a significantly different semantic pattern. This divergence is expected, since Google indexes a big amount of documents, from highly specialized scientific papers to more informal or journalistic content. Therefore, although some agricultural terms are present, their semantic connections to specialized activities such as “plowing” become less consistent. In contrast, sources like DOAJ, Scholar, and Arxiv are restricted to academic and technical articles, where vocabulary tends to be more focused and contextually coherent. This explains why, across the semantic projections, Google exhibits negative correlations with Scholar and DOAJ (around

- 0.85

), as shown in Table 2 and Figure 2. The chi-squared distance heatmap in Figure 3 confirms this fact.

These results suggest an important methodological point as follows: when analyzing specific technical contexts using semantic projections, the choice of repositories or search engines strongly influence the resulting structure. This fact has to be taken into account for the relational analyses based on multi-source semantic projections.

Finally, the average result can be computed as just the average value of the vectors defined by DOAJ, Google Scholar, and Arxiv. This could be useful when a unified result is needed for further analysis.

4. Comparison of Universes: Semantic Stability in Metric-Based Models

In this section, we consider two universes of terms,

U^{1}

and

U^{2}

, containing

n_{1}

and

n_{2}

elements, respectively. Our goal is to introduce and analyze different methods for comparing these universes. The underlying assumption is that both aim to describe the same semantic environment, and we seek to determine whether they actually do so. To this end, we propose comparing the semantic projections of certain reference terms within each universe. These comparisons serve as the basis for assessing whether both universes effectively represent the same semantic content.

The comparison method we propose follows a similarity-based approach, organized through the following steps.

(1): Fix a finite set A of testing terms, and select a concrete semantic projection P defined on a set X that contains both $U^{1}$ and $U^{2} .$
(2): Compute the semantic projections of all elements $t \in A$ onto both universes $U^{1}$ and $U^{2}$ .
(3): Using one of the methods proposed in the following sections, compute estimates of the semantic projections $P_{U^{2}} (t)$ based on $P_{U^{1}} (t)$ and the similarity between the elements of $U^{1}$ and $U^{2}$ , where similarity is measured by the distance d between embedded vectors.
(4): Compare the direct values of the semantic projections $P_{U^{2}} (t)$ with their corresponding estimates for all $t \in A$ .
(5): Evaluate the performance by calculating the quadratic error between the actual and estimated values of the projections.

The methods proposed for this comparison are discussed in separate subsections. In both approaches, the elements of

U^{1}

and

U^{2}

are embedded into a metric space via a word embedding

I : U \to R^{d}

. This allows the use of standard distance metrics in

R^{d}

, such as the Euclidean norm, to compute similarities or differences between the universes. It is important to note that we do not assume

U^{1}

and

U^{2}

contain the same number of elements.

Fix a word embedding I into

R^{d}

such that all terms in both universes

U^{1}

and

U^{2}

can be embedded. Instead of relying on the Euclidean norm, we consider a general metric d defined on

R^{d}

, which allows for a more flexible notion of distance between elements in the semantic space.

The metric model is then based on the matrix of pairwise distances between embedded terms from each universe

M_{U^{1}, U^{2}} = {(d (I (u_{i}^{1}), I (u_{j}^{2})))}_{i = 1, j = 1}^{n_{1}, n_{2}} .

Using this matrix, we define a proximity function D between the two universes as the sum of all the pairwise distances

D (U^{1}, U^{2}) = \sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} d (I (u_{i}^{1}), I (u_{j}^{2})) .

This function is symmetric, but it is not a true distance, since

D (U, U) = 0

only in trivial cases. However, when the context is fixed, it can serve as a meaningful measure of how far apart the universes

U^{1}

and

U^{2}

are in semantic terms.

In addition to this metric-based estimate of proximity, we are also interested in estimating the semantic projection

{\hat{P}}_{u_{j}^{2}} (t)

of a given term t at each concept

u_{j}^{2} \in U^{2}

, using only the known values of

P_{u_{i}^{1}} (t)

for

u_{i}^{1} \in U^{1}

.

To do this, we define

{\hat{P}}_{u_{j}^{2}} (t)

for a fixed term t as a weighted average (convex combination) of the values

P_{u_{i}^{1}} (t)

, where the weights are calculated in terms of the distance between

u_{j}^{2}

and each

u_{i}^{1}

in the embedding space. We propose three procedures.

Remark 1.

Notice that we propose a metric model. Accordingly, we expect that the semantic projections, as described in other works, preserve the distance-related parameters. Thus, semantic projections, regarded as real-valued functions, are assumed to preserve the distances in the metric space where they are defined. The natural requirement in this direction is that they are Lipschitz functions with controlled Lipschitz constants.

4.1. Uniform Distribution of the Weights in the Extension

First, for each fixed

u_{j}^{2} \in U^{2}

, define the normalization factor

D_{j} = \sum_{i = 1}^{n_{1}} d (I (u_{i}^{1}), I (u_{j}^{2})) .

(2)

Then, the estimated projection

{\hat{P}}_{u_{j}^{2}} (t)

is given by

{\hat{P}}_{u_{j}^{2}} (t) = \sum_{i = 1}^{n_{1}} \frac{1}{n_{1} - 1} P_{u_{i}^{1}} (t) (1 - \frac{d (I (u_{i}^{1}), I (u_{j}^{2}))}{D_{j}}) .

This expression yields a convex combination where each weight increases as the distance

d (I (u_{i}^{1}), I (u_{j}^{2}))

decreases, assigning greater influence to the semantically closer elements. As we will see later, other weight values with the same property can be implemented to yield alternative estimations.

The error committed when approximating

P_{u_{j}^{2}} (t)

by

{\hat{P}}_{u_{j}^{2}} (t)

is measured by the following:

E_{(U^{2} | U^{1})} (t) = {(\sum_{j = 1}^{n_{2}} {|P_{u_{j}^{2}} (t) - {\hat{P}}_{u_{j}^{2}} (t)|}^{2})}^{1 / 2} .

This gives a point-wise error for each term t. By averaging this error over a sufficiently large and representative set of terms, we obtain a global estimate of the stability of the projection across the universes. A small value of this quantity suggests that the

P_{U^{1}} (t)

and

P_{U_{2}} (t)

are semantically compatible and yield consistent projection behavior. Under the assumption that the projection function

P_{u_{i}^{1}}

is Lipschitz continuous with respect to the metric d, it is possible to derive a theoretical upper bound for the error

E_{(U^{1} | U^{2})} (t)

, further supporting the validity of the approximation.

Let us now establish the result concerning the Lipschitz-type properties of the proposed model. We assume the existence of a word embedding that allows us to compute the values of

d (u_{i}^{1}, u_{j}^{2})

. However, there are no restrictions on how the metric d is defined. In standard settings, d is typically the Euclidean distance. It is worth noting, however, that the cosine distance (although frequently used) cannot be applied in this context, as it does not satisfy the axioms of a true metric.

Theorem 1

(Lipschitz continuity of the projection estimator). Let t and

t^{'}

be two terms, and let

U^{1} = {u_{1}^{1}, \dots, u_{n_{1}}^{1}}

and

U^{2} = {u_{1}^{2}, \dots, u_{n_{2}}^{2}}

be two semantic universes. Assume that the semantic projections

P_{u_{i}^{1}}

are Lipschitz continuous in t with constant

L > 0

, that is,

|P_{u_{i}^{1}} (t) - P_{u_{i}^{1}} (t^{'})| \leq L \cdot d (t, t^{'}), for all i = 1, \dots, n_{1} .

Then, for each

u_{j}^{2} \in U^{2}

, the estimated projection satisfies

|{\hat{P}}_{u_{j}^{2}} (t) - {\hat{P}}_{u_{j}^{2}} (t^{'})| \leq L \cdot d (t, t^{'}) .

In particular, the estimator

{\hat{P}}_{u_{j}^{2}}

is Lipschitz continuous with respect to the term t, and its Lipschitz constant is bounded above by L.

Proof.

Fix

u_{j}^{2} \in U^{2}

and consider the difference

|{\hat{P}}_{u_{j}^{2}} (t) - {\hat{P}}_{u_{j}^{2}} (t^{'})| = |\sum_{i = 1}^{n_{1}} w_{i, j} (P_{u_{i}^{1}} (t) - P_{u_{i}^{1}} (t^{'}))| .

The weights

w_{i, j}

defined as before are given by

w_{i, j} = \frac{1}{n_{1} - 1} (1 - \frac{d (I (u_{i}^{1}), I (u_{j}^{2}))}{D_{j}}),

where

D_{j}

are the normalization factors defined in (2). We claim that the addition of all these weights with a fixed j equals

1 .

Indeed,

(n_{1} - 1) \sum_{i = 1}^{n_{1}} w_{i, j} = \sum_{i = 1}^{n_{1}} (1 - \frac{d (I (u_{i}^{1}), I (u_{j}^{2}))}{D_{j}}) = n_{1} - \sum_{i = 1}^{n_{1}} \frac{d (I (u_{i}^{1}), I (u_{j}^{2}))}{D_{j}},

which is equal to

n_{1} - 1

by the definition of

D_{j}

. □

Note that this approach does not define an extension of the original function. That is, if we put as search term an element of the original universe

U^{1}

, the solution is not necessarily the projection on this element. Furthermore, as will be shown in the final example of the paper, this extension tends to assign an average value to all estimates, making it difficult to clearly distinguish between the semantic projections of different elements of the universe. To address these issues, in the next section, we propose a different method for defining the weights, modifying them so that the elements of the universe closer to the target vector receive stronger weighting.

4.2. Hierarchical Weight Distribution

This approximation is computed separately for every element

u_{j}^{2} \in U^{2} .

So let us fix and index j (that is, an element

u_{j}^{2}

), which will be considered to be fixed throughout the following steps. Here, the distance d is the one provided by the Euclidean norm. Consider the map

ϕ : U^{1} \to R

given by

ϕ (u_{i}^{1}) = ∥ u_{i}^{1} - u_{j}^{2} ∥, i = 1, \dots, n_{1},

and the associated inverse ordering given by

u_{i}^{1} \leq_{1} u_{k}^{1}

, if and only if

ϕ (u_{i}^{1}) \leq ϕ (u_{k}^{1})

(note that, for the sake of simplicity, we do not explicitly refer to the index j in the notation of

ϕ

).

We define the weights of the convex combination according to the next tool. Reorder the set

U^{1}

using the ordering

\leq_{1}

and reindex

U^{1}

increasingly with respect to this ordering as

U^{1} = {u_{i_{1}}^{1}, \dots, u_{i_{n_{1}}}^{1}} .

Thus, we define the weights as follows. Let us write

max ϕ

for

max {ϕ (u_{k}^{1}) : k = 1, \dots, n_{1}} = ϕ (u_{i_{n_{1}}}^{1}) .

w_{i_{1}} = (1 - \frac{ϕ (u_{i_{1}}^{1})}{max ϕ}),

w_{i_{2}} = \frac{ϕ (u_{i_{1}}^{1})}{max ϕ} (1 - \frac{ϕ (u_{i_{2}}^{1})}{max ϕ}),

w_{i_{3}} = \frac{ϕ (u_{i_{1}}^{1})}{max ϕ} \frac{ϕ (u_{i_{2}}^{1})}{max ϕ} (1 - \frac{ϕ (u_{i_{3}}^{1})}{max ϕ}),

and following in this fashion until the last index, which will be different, given by

w_{i_{n_{1}}} = \frac{ϕ (u_{i_{1}}^{1})}{max ϕ} \frac{ϕ (u_{i_{2}}^{1})}{max ϕ} \dots \frac{ϕ (u_{i_{n_{1}}}^{1})}{max ϕ} .

Let us show that the sum of these weights is in fact equal to

1 .

Theorem 2.

Fix a point

u_{j}^{2} \in U^{2} \subset X

. Let

(X, d)

be a metric space, and let

S = {u_{1}^{1}, \dots, u_{n_{1}}^{1}} \subset U^{1} \subset X

be a finite set. Define

ϕ (u_{i}^{1}) = d (I (u_{i}^{1}), I (u_{j}^{2})),

where I is a given embedding into the metric space

X .

Order the elements

u_{i}^{1}

increasingly, according to

ϕ (u_{i}^{1})

(i.e., the closest first), and normalize the distances by setting

\tilde{ϕ} (u_{i}^{1}) = \frac{ϕ (u_{i}^{1})}{ϕ_{max}}, where ϕ_{max} = max \{ϕ (u_{i}^{1}) : 1 \leq i \leq n_{1}\} .

Define recursively the weights

w_{i}

as

w_{i_{k}} = (\prod_{ℓ = 1}^{k - 1} \tilde{ϕ} (u_{i_{ℓ}}^{1})) (1 - \tilde{ϕ} (u_{i_{k}}^{1}))

for

k = 1, \dots, n_{1} - 1

, and for

k = n_{1}

,

w_{i_{n_{1}}} = \prod_{ℓ = 1}^{n_{1} - 1} \tilde{ϕ} (u_{i_{ℓ}}^{1}) .

Then,

\sum_{k = 1}^{n_{1}} w_{i_{k}} = 1,

and the approximation formula obtained with these weights is an extension of the function

ψ : U^{1} \to R

given by

u_{i}^{1} \mapsto ψ (u_{i}^{1}) = P_{u_{i}^{1}},

i = 1, \dots, n_{1} .

Proof.

The proof is the result of a direct computation. Note that the required addition gives

\sum_{k = 1}^{n_{1} - 1} (\prod_{ℓ = 1}^{k - 1} \tilde{ϕ} (u_{i_{ℓ}}^{1}) (1 - \tilde{ϕ} (u_{i_{k}}^{1}))) + \prod_{ℓ = 1}^{n_{1} - 1} \tilde{ϕ} (u_{i_{ℓ}}^{1}) .

If we group the terms recursively, each product accounts for the proportion of the unity not yet assigned, and the last term collects the remaining part. Note that this expression is a telescoping-type expansion

(1 - x_{1}) + x_{1} (1 - x_{2}) + x_{1} x_{2} (1 - x_{3}) + \dots + x_{1} x_{2} \dots x_{n - 1}

, which always sums to 1. Thus, the total sum of the weights equals 1. □

As in the other cases analyzed before, it can be easily seen that this approximation preserves the Lipschitz constant of the projection

P_{U^{1}}

, following the same arguments as in the proof of Theorem 1. Finally, note that the estimated projection will be in this case represented by

{\hat{P}}_{u_{j}^{2}} (t) = \sum_{i = 1}^{n_{1}} w_{i} P_{u_{i}^{1}} (t) .

This expression yields a convex combination where each weight increases as the distance

d (I (u_{i}^{1}), I (u_{j}^{2}))

decreases, assigning greater influence to the semantically closer elements.

4.3. Lipschitz Extensions and Regression Estimation

The classical formulas of McShane and Whitney for extending Lipschitz functions can also be used to obtain estimates of suitable extensions of the projections, in the context of what is called Lipschitz regression (see, e.g., [25,30,31] and the references therein). The advantage of this technique of regression is that, as in the previous cases, only a metric is required on the space, without restrictions involving the use of linear structures.

Real Lipschitz functions on any metric subspace of a metric space can always be extended to the whole space with the same Lipschitz constant. Suppose that we have a function f defined on a subset

S \subset X

and we aim to extend it to all of X while preserving

L i p (f)

.

Although there are many possible extensions, there are two classical formulas that have also the advantage of being the minimal/maximal extensions preserving

L i p (f)

. That is, any Lipschitz extension F preserving the constant satisfies

F^{M} (x) \leq F (x) \leq F^{W} (x), x \in X .

Those formulas are the minimal (sometimes called the McShane) extension,

F^{M} (x) = sup_{s \in S} \{f (s) - L d (x, s)\},

and the maximal (Whitney) extension,

F^{W} (x) = inf_{s \in S} \{f (s) + L d (x, s)\} .

Moreover, for any

α \in [0, 1]

,

F^{α} (x) = α F^{W} (x) + (1 - α) F^{M} (x)

is a valid Lipschitz extension with constant L. The usual election for the interpolation parameter is

α = 1 / 2 .

As with the previously explained techniques, we can use McShane–Whitney formulas to predict the unknown projections

{\hat{P}}_{u_{j}^{2}} (t)

based on the known values

P_{u_{i}^{1}} (t)

. Basically, the method follows the same steps that the previous ones did. First, we compute the distances

d (u_{j}^{2}, u_{i}^{1})

for all i, and for each

u_{j}^{2}

, we calculate

F^{M} (u_{j}^{2}) = max_{i} {P_{u_{i}^{1}} (t) - L d (u_{j}^{2}, u_{i}^{1})},

as well as

F^{W} (u_{j}^{2}) = min_{i} {P_{u_{i}^{1}} (t) + L d (u_{j}^{2}, u_{i}^{1})} .

Then, we estimate the projection by

{\hat{P}}_{u_{j}^{2}} (t) = \frac{F^{M} (u_{j}^{2}) + F^{W} (u_{j}^{2})}{2} .

(3)

or adapting the value of

α

if needed.

Remark 2.

Note that the average value (3) always belongs to the interval

[0, 1]

, since

min \{P_{u_{i}^{1}} (t) : i = 1, \dots, n_{1}\} \leq {\hat{P}}_{u_{j}^{2}} (t) \leq max \{P_{u_{i}^{1}} (t) : i = 1, \dots, n_{1}\} .

Indeed, we know that

\begin{matrix} min_{i} P_{u_{i}^{1}} (t) & = \frac{1}{2} min_{i} \{P_{u_{i}^{1}} (t) + L d (u_{j}^{2}, u_{i}^{1}) + P_{u_{i}^{1}} (t) - L d (u_{j}^{2}, u_{i}^{1})\} \\ \leq \frac{1}{2} min_{i} \{P_{u_{i}^{1}} (t) + L d (u_{j}^{2}, u_{i}^{1})\} + \frac{1}{2} max_{i} \{P_{u_{i}^{1}} (t) - L d (u_{j}^{2}, u_{i}^{1})\} \\ \leq max_{i} P_{u_{i}^{1}} (t) . \end{matrix}

Note also that the Lipschitz constant L has to be previously estimated as the following maximum ratio between differences of known projections and their distances,

L = max_{i, k} \frac{|P_{u_{i}^{1}} (t) - P_{u_{k}^{1}} (t)|}{d (u_{i}^{1}, u_{k}^{1})} .

In this case, by definition and by taking into account the nature of the McShane and Whitney formulas, the proposed method is an extension of the projection defined as acting in

U^{1} .

Note that these formulas are not restricted to values in

[0, 1]

when

α \neq 1 / 2

, so this condition must be enforced if needed.

4.4. Universe Similarity Analysis: A Case Study

In this section, we conclude the explanation of the comparison method we propose, based on a set of single-term estimates of a chosen semantic projection. An agricultural context is also considered in this example. The analysis that follows proceeds through the following steps.

(1): First, we fix two universes $U^{1}$ and $U^{2}$ , which are assumed to describe similar semantic contexts, and a semantic projection to measure its stability under the change from $U^{1}$ to $U^{2}$ .
(2): Fix a finite set of terms S for comparison. For each $t \in S$ , compute the values of the semantic projection $P_{u_{i}^{1}} (t)$ and $P_{u_{i}^{2}} (t)$ , where $u_{i}^{1} \in U^{1}$ and $u_{i}^{2} \in U^{2}$ .
(3): Use the methods explained in the previous section to obtain an estimate ${\hat{P}}_{u_{i}^{2}} (t)$ of the value of $P_{u_{i}^{2}} (t)$ for every $t \in S$ and $u_{i}^{2} \in U^{2}$ .
(4): Compare the estimates with the true values by measuring the quadratic error between them.

We will now explain the procedure through a concrete example. Let us explain the details of the example in the following. Note that the words have been chosen to describe a semantic context.

4.4.1. Data Collection and Sources

We choose a simplified semantic environment related to water resources and agriculture, and the universes

U^{1} = {“ water source ”, “ reservoir ”, “ irrigation ”, “ crop ”, “ orchard ”}, and

U^{2} = {“ wetland ”, “ farming ”, “ water resources ”, “ watering ”},

which we consider to be similar for describing this semantic context. We consider just one term for comparison in the set

S,

which is assumed to make sense in the corresponding environment. We choose the term t = “rice”. The distances matrix shown in Table 4 has been computed using GloVe (small size version GloVe 6B 50d [32]).

4.4.2. Semantic Projection Techniques

We fix the semantic projection defined by the computation of the joint occurrences of two terms within the documents retrieved through the DOAJ (Directory of Open Access Journals) search engine. The results of all projections are written in Table 5 and Table 6.

These result are represented in Figure 6 and Figure 7 that follow.

4.4.3. Statistical and AI Methods

Next, we compute all the estimates of the values of

{\hat{P}}_{u_{i}^{2}} (“ r i c e ”)

following the three methods explained in the previous section, in order to compare with the real values calculated using the direct semantic projection computation. Here, for the sake of clarity, we call them the Averaged Weights Method (Averaged W.), Closer Weights Method (Closer W.), and McShane–Whitney Method (Mc-W). The results are given in Table 7.

4.4.4. Evaluation Metrics and Analysis Procedures

These data are represented in Figure 8. First, it is important to note that, due to the construction of the proposed methods, all estimates remain within the same value range as the original data. As shown, the method that best reproduces the trends observed in the real values is the McShane–Whitney extension, although the numerical accuracy is still limited. The Averaged Weights technique also provides a reasonably good approximation; however, the resulting values are rather uniform, which could be desirable if the goal is to obtain an average estimate of the representation of “rice” in the universe

U^{2}

based on the real values in

U^{1}

. Finally, and probably due to the fact that there are no large differences among the distances from “rice” to the other words in both universes

U^{1}

and

U^{2}

, the Closer Weights method produces the poorest results. A closer look at the construction process reveals that this technique assigns a disproportionately large weight to the nearest element, resulting in a strong deviation in contexts where all contributions are expected to be relatively similar.

The quadratic errors are presented in Table 8, confirming through their numerical values the overall accuracy of the proposed estimation methods.

The example developed in this section shows how the proposed metric-based estimation methods behave when applied to a real semantic context. Although the agricultural example we used is relatively simple, it allows us to see important differences between the methods. The McShane–Whitney extension reproduces the overall trend of the real data more accurately, while the Averaged Weights method provides reasonable results, although more uniform and less sensitive to specific variations. The results of the Closer Weights method, however, is not as good, mainly because it relies too much on the nearest neighbor. This is not an appropriate strategy when distances between elements are all quite similar, as is the case in the example.

These results show that the choice of the estimation method should depend on the specific characteristics of the semantic universe and on what we want to achieve. If capturing fine differences between terms is important, methods like McShane–Whitney should be preferable. If a general average behavior is enough, the Averaged Weights method can work well.

4.5. Limitations and Future Directions

Although the examples we have shown can be reasonably interpreted as successful, the method we propose is still incomplete regarding the levels of reliability we can recognize in them. It depends heavily on how we can extract meaningful terms to initiate the analysis, how we can determine whether the universe of words we have fixed is broad enough to represent a semantic context, and how far we can trust the datasets, AI tools, and search engines for detecting the coincidences needed to compute the semantic projections.

The proposed technique includes elements that could be adapted to other contexts, as long as the assumptions about the occurrence of significant terms are maintained and the calculated ratios are sufficiently far from zero to avoid excessive sensitivity to small variations. However, its application to certain specific scenarios, such as the analysis of populations of certain insects in complex ecosystems (for which the number of related documents could be extremely small), would not be appropriate without further adjustments.

Thus, further investigations into AI instruments and the design of better tools for detecting relevant information in the databases must be carried out. Additionally, the construction and refinement of mathematical notions adapted to the problem would be necessary.

5. Conclusions

We have proposed a general analytical methodology for controlling the semantic projections of terms onto universes that describe a given semantic context. First, several statistical and AI methods are used to analyze how closely the semantic projections associated with different search methods (Google, Arxiv, DOAJ, and Google Scholar) align. This analysis allows us to decide, based on the experimental criteria, when any of the search procedures should be discarded.

Second, we present a methodology to compare universes that have been designed to represent the same semantic environments, using semantic projections and metric space tools. Our approach aims to evaluate how stable the description of a given semantic structure remains when different (but similar) universes are used to represent it. By developing and testing several estimation methods (weighted averages, closer-weight-based models, and Lipschitz extensions), we provide a flexible framework that can be adapted to different needs depending on the required degree of precision and sensitivity.

Through various examples in the agricultural context, we have firstly demonstrated a coherence analysis among different projection procedures, and secondly, we have shown how our techniques are capable of capturing both general trends and fine variations in the semantic relationships between terms. The McShane–Whitney extension, for instance, appears particularly useful when preserving subtle semantic differences is important, whereas the averaging methods offer a simpler approach that may be preferable in contexts where only a general approximation is sufficient.

Taken together, the two parts of this paper provide two complementary analytical tools that can help improve the study of contextual relationships across different descriptions of semantic environments. To enrich the discussion and provide a more prospective approach, it is worth considering the broader implications and possible future applications of the proposed framework. Beyond the application in the agricultural context we have explained, this methodology could be extended to other fields within Natural Language Processing (NLP) and Artificial Intelligence (AI). For example, it could be adapted to improve semantic analysis in specialized fields such as healthcare, legal studies, or environmental sciences, where understanding the nuanced relationships between terms is critical. In addition, the framework’s ability to compare and stabilize semantic representations across universes could be leveraged to improve multilingual PLN tasks such as machine translation or multilingual information retrieval, ensuring semantic consistency across languages.

Furthermore, the integration of this methodology with emerging AI tools, such as large linguistic models or knowledge graphs, could open new avenues for refining the semantic understanding and improving the interpretability of AI systems. For example, it could be used to validate or refine the semantic outputs of generative models, ensuring that they fit domain-specific contexts. The flexibility of the framework also suggests potential applications in dynamic environments, such as real-time data analysis or adaptive learning systems, where semantic relationships may evolve over time.

Author Contributions

Validation, A.C.F. and C.S.A.; Formal analysis, R.A., C.S.A. and E.A.S.P.; Conceptualization, A.C.F., Á.G.C. and E.A.S.P.; Investigation, R.A., A.C.F. and E.A.S.P.; Writing—review and editing, R.A. and Á.G.C.; Writing—original draft, E.A.S.P. and C.S.A.; Supervision, Á.G.C. and E.A.S.P.; Visualization, A.C.F. and C.S.A. All authors have read and agreed to the published version of the manuscript.

Funding

We would like to acknowledge funding from the Generalitat Valenciana (Spain) through the PROMETEO 2024 CIPROM/2023/32 grant.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to acknowledge the support of the Instituto Universitario de Matemática Pura y Aplicada (IUMPA-UPV), Universitat Politècnica de València.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pak, A.; Ziyaden, A.; Saparov, T.; Akhmetov, I.; Gelbukh, A. Word Embeddings: A Comprehensive Survey. Comput. Sist. 2024, 28, 2005–2029. [Google Scholar] [CrossRef]
Grand, G.; Blank, I.A.; Pereira, F.; Fedorenko, E. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nat. Hum. Behav. 2022, 6, 975–987. [Google Scholar] [CrossRef] [PubMed]
Fernández de Córdoba, P.; Reyes Pérez, C.A.; Sánchez Pérez, E.A. Mathematical features of semantic projections and word embeddings for automatic linguistic analysis. AIMS Math. 2025, 10, 3961–3982. [Google Scholar] [CrossRef]
Fernández de Córdoba, P.; Reyes Pérez, C.A.; Sánchez Arnau, C.; Sánchez Pérez, E.A. Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis. Computers 2025, 14, 30. [Google Scholar] [CrossRef]
Lenci, A. Distributional models of word meaning. Annu. Rev. Linguist. 2018, 4, 151–171. [Google Scholar] [CrossRef]
Boleda, G. Distributional semantics and linguistic theory. Annu. Rev. Linguist. 2020, 6, 213–234. [Google Scholar] [CrossRef]
Erk, K. Vector space models of word meaning and phrase meaning: A survey. Lang. Linguist. Compass 2012, 6, 635–653. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Levy, O.; Goldberg, Y. Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Process. Syst. 2014, 27, 2177–2185. [Google Scholar]
Wulff, D.U.; Mata, R. Semantic Embeddings Reveal and Address Taxonomic Incommensurability in Psychological Measurement. Nat. Hum. Behav. 2025, 1–11. [Google Scholar] [CrossRef] [PubMed]
Mikolov, T.; Yih, W.t.; Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 746–751. [Google Scholar]
Lu, H.; Wu, Y.N.; Holyoak, K.J. Emergence of analogy from relation learning. Proc. Natl. Acad. Sci. USA 2019, 116, 4176–4181. [Google Scholar] [CrossRef] [PubMed]
Baroni, M.; Zamparelli, R. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 1183–1193. [Google Scholar]
Clark, S. Vector space models of lexical meaning. In The Handbook of Contemporary Semantics; Lappin, S., Fox, C., Eds.; Blackwell: Malden, MA, USA, 2015; pp. 493–522. [Google Scholar]
Arseniev-Koehler, A. Theoretical Foundations and Limits of Word Embeddings: What Types of Meaning Can They Capture? Sociol. Methods Res. 2024, 53, 1753–1793. [Google Scholar] [CrossRef] [PubMed]
Boutyline, A.; Arseniev-Koehler, A. Meaning in Hyperspace: Word Embeddings as Tools for Cultural Measurement. Annu. Rev. Sociol. 2025, 51. [Google Scholar] [CrossRef]
Dai, J.; Zhang, Y.; Lu, H.; Wang, H. Cross-view semantic projection learning for person re-identification. Pattern Recognit. 2018, 75, 63–76. [Google Scholar] [CrossRef]
Xian, Y.; Choudhury, S.; He, Y.; Schiele, B.; Akata, Z. Semantic projection network for zero- and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8256–8265. [Google Scholar]
Zadeh, L.A. A Fuzzy-Set-Theoretic Interpretation of Linguistic Hedges. J. Cybern. 1972, 2, 4–34. [Google Scholar] [CrossRef]
Cobzaş, S.; Miculescu, R.; Nicolae, A. Lipschitz Functions; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Kosub, S. A note on the triangle inequality for the Jaccard distance. Pattern Recognit. Lett. 2019, 120, 36–38. [Google Scholar] [CrossRef]
Deza, M.M.; Deza, E. Encyclopedia of Distances, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
Gardner, A.; Kanno, J.; Duncan, C.A.; Selmic, R. Measuring distance between unordered sets of different sizes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 137–143. [Google Scholar] [CrossRef]
Manetti, A.; Ferrer-Sapena, A.; Sánchez-Pérez, E.A.; Lara-Navarra, P. Design Trend Forecasting by Combining Conceptual Analysis and Semantic Projections: New Tools for Open Innovation. J. Open Innov. Technol. Mark. Complex. 2021, 7, 92. [Google Scholar] [CrossRef]
Directory of Open Access Journals (DOAJ). Available online: https://www.doaj.org/ (accessed on 3 April 2025).
Google Scholar. Available online: https://scholar.google.com/ (accessed on 5 April 2025).
Google. Available online: https://www.google.com/ (accessed on 1 April 2025).
arXiv. arXiv e-Print Archive. Available online: https://arxiv.org/ (accessed on 7 April 2025).
Arnau, R.; Calabuig, J.M.; González, A.; Sánchez Pérez, E.A. Moduli of Continuity in Metric Models and Extension of Livability Indices. Axioms 2024, 13, 192. [Google Scholar] [CrossRef]
Erdogan, E.; Ferrer-Sapena, A.; Jimenez-Fernandez, E.; Sánchez Pérez, E. Index spaces and standard indices in metric modelling. Nonlinear Anal. Model. Control 2022, 27, 1–20. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. Pre-Trained Embeddings, 50 Dimensions, Trained on 6B Tokens (Wikipedia + Gigaword). 2014. Available online: http://nlp.stanford.edu/data/glove.6B.zip (accessed on 17 December 2024).

Figure 1. Normalized ratios (semantic projections) across different sources (DOAJ, Scholar, Google, and Arxiv). Each line represents one source.

Figure 2. Correlation heatmap between normalized semantic source vectors (DOAJ, Scholar, Google, and Arxiv). Positive correlations are shown in red and negative correlations in blue.

Figure 3. Chi-squared distance heatmap between normalized semantic source vectors (DOAJ, Scholar, Google, Arxiv), along with a hierarchical grouping. Colors closer to blue indicate lower distances, while red indicates higher distances.

Figure 4. K-means clustering result with

k = 2

applied to the normalized semantic vectors. The two clusters are clearly separated, with Dim1 explaining

72.4 %

and Dim2 explaining

20.8 %

of the total variance.

Figure 4. K-means clustering result with

k = 2

applied to the normalized semantic vectors. The two clusters are clearly separated, with Dim1 explaining

72.4 %

and Dim2 explaining

20.8 %

of the total variance.

Figure 5. Elbow method applied to the clustering analysis: total within-cluster sum of squares (WSS) versus number of clusters k. The optimal number of clusters appears to be

k = 2

, since

k = 3

does not produce a meaningful decrease in the within-cluster sum of squares.

Figure 5. Elbow method applied to the clustering analysis: total within-cluster sum of squares (WSS) versus number of clusters k. The optimal number of clusters appears to be

k = 2

, since

k = 3

does not produce a meaningful decrease in the within-cluster sum of squares.

Figure 6. Bar plot (logarithmic scale) of the term/universe ratios for the term rice based on the co-occurrence in the DOAJ document repository with the terms in universe

U^{1}

(alphabetical order).

Figure 6. Bar plot (logarithmic scale) of the term/universe ratios for the term rice based on the co-occurrence in the DOAJ document repository with the terms in universe

U^{1}

(alphabetical order).

Figure 7. Bar plot (logarithmic scale) of the term/universe ratios for the term rice based on the co-occurrence in the DOAJ document repository with the terms in universe

U^{2}

(alphabetical order).

Figure 7. Bar plot (logarithmic scale) of the term/universe ratios for the term rice based on the co-occurrence in the DOAJ document repository with the terms in universe

U^{2}

(alphabetical order).

Figure 8. Comparison of the real semantic projection values for “rice” of the elements and estimates using different methods (Averaged Weights, Closer Weights, and McShane–Whitney extension) for agricultural water-related terms of

U^{2}

.

Figure 8. Comparison of the real semantic projection values for “rice” of the elements and estimates using different methods (Averaged Weights, Closer Weights, and McShane–Whitney extension) for agricultural water-related terms of

U^{2}

.

Table 1. Normalized values of the results for the term

t =

“plowing” using DOAJ, Scholar, Google, and Arxiv for agricultural terms.

Table 1. Normalized values of the results for the term

t =

“plowing” using DOAJ, Scholar, Google, and Arxiv for agricultural terms.

Term	DOAJ	Scholar	Google	Arxiv
crop	0.43140071	0.57878055	0.2563693	0.5189
farming	0.16838752	0.13862709	0.4035290	0.5189
farmland	0.05160329	0.02601003	0.4129493	0.0000
harvest	0.18129108	0.23225145	0.3920603	0.2531
irrigation	0.39032364	0.47708966	0.3103834	0.2531
orchard	0.16625340	0.04189903	0.4089449	0.0000
soil	0.71211573	0.58409779	0.1591425	0.5189
tractor	0.25081501	0.14365947	0.3978700	0.2531

Table 2. Correlation matrix between normalized semantic source vectors.

	DOAJ	Scholar	Google	Arxiv
DOAJ	1.00	0.92	−0.84	0.64
Scholar	0.92	1.00	−0.86	0.65
Google	−0.84	−0.86	1.00	−0.48
Arxiv	0.64	0.65	−0.48	1.00

Table 3. Principal component analysis (PCA) results: standard deviations, variance proportions, and cumulative variance proportions for the first four principal components. Note that PC1 explains more than 72% of the variance.

	PC1	PC2	PC3	PC4
Standard Deviation	2.407	1.290	0.738	0.000
Proportion of Variance	0.724	0.208	0.068	0.000
Cumulative Proportion	0.724	0.932	1.000	1.000

Table 4. Euclidean distance matrix between selected water- and agriculture-related terms.

	water s.	reserv.	irrigat.	crop	orchard	wetland	farming	resour.	watering	rice
water s.	0.00	4.78	4.30	4.90	5.57	5.79	5.43	4.77	5.03	5.41
reservoir	4.78	0.00	5.38	5.57	5.25	5.48	5.75	4.25	5.84	4.98
irrigation	4.30	5.38	0.00	4.43	6.09	5.05	6.04	5.50	5.43	6.35
crop	4.90	5.57	4.43	0.00	5.56	5.98	4.62	5.08	5.13	5.87
orchard	5.57	5.25	6.09	5.56	0.00	5.54	4.41	5.18	5.67	4.56
wetland	5.79	5.48	5.05	5.98	5.54	0.00	5.30	6.08	5.17	4.92
farming	5.86	6.06	4.51	5.14	5.78	5.52	0.00	5.04	5.35	5.40
resources	5.43	5.75	6.04	4.62	4.41	5.30	5.04	0.00	6.62	5.52
watering	4.77	4.25	5.50	5.08	5.18	6.08	5.29	5.04	0.00	6.62
rice	5.03	5.84	5.43	5.13	5.67	5.17	5.58	5.35	5.87	0.00

Table 5. Semantic projection using the DOAJ of the term “rice” associated with each element of

U^{1}

.

Table 5. Semantic projection using the DOAJ of the term “rice” associated with each element of

U^{1}

.

	water source	reservoir	irrigation	crop	orchard
$P_{u_{i}^{1}} (“ r i c e ”)$	0.012	0.004	0.055	0.096	0.008

Table 6. Semantic projection using the DOAJ of the term “rice” associated with each element of

U^{2}

.

Table 6. Semantic projection using the DOAJ of the term “rice” associated with each element of

U^{2}

.

	wetland	farming	water resources	watering
$P_{u_{i}^{2}} (“ r i c e ”)$	0.036	0.064	0.016	0.037

Table 7. Comparison of the real semantic projection values and approximations obtained with different estimation methods.

Term	Real Value	Averaged W.	Closer W.	Mc-W
wetland	0.0360	0.0349	0.0762	0.0550
farming	0.0640	0.0352	0.0457	0.0394
water resources	0.0160	0.0346	0.0394	0.0040
watering	0.0370	0.0353	0.0175	0.0480

Table 8. Root Mean Square Error (RMSE) for each estimation method.

Method	RMSE
Averaged Weights	0.01718
Closer Weights	0.02682
McShane–Whitney Extension	0.01754

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arnau, R.; Coronado Ferrer, A.; González Cortés, Á.; Sánchez Arnau, C.; Sánchez Pérez, E.A. Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings. Axioms 2025, 14, 389. https://doi.org/10.3390/axioms14050389

AMA Style

Arnau R, Coronado Ferrer A, González Cortés Á, Sánchez Arnau C, Sánchez Pérez EA. Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings. Axioms. 2025; 14(5):389. https://doi.org/10.3390/axioms14050389

Chicago/Turabian Style

Arnau, Roger, Ana Coronado Ferrer, Álvaro González Cortés, Claudia Sánchez Arnau, and Enrique A. Sánchez Pérez. 2025. "Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings" Axioms 14, no. 5: 389. https://doi.org/10.3390/axioms14050389

APA Style

Arnau, R., Coronado Ferrer, A., González Cortés, Á., Sánchez Arnau, C., & Sánchez Pérez, E. A. (2025). Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings. Axioms, 14(5), 389. https://doi.org/10.3390/axioms14050389

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Measuring Semantic Stability: Statistical Estimation of Semantic Projections via Word Embeddings

Abstract

1. Introduction

2. Theoretical Background

3. Coherence Analysis of Semantic Projections: Statistical Approach

3.1. Analytic Procedure

3.2. A Case Study on Agricultural Terms Using Four Semantic Projections

3.2.1. Data Collection and Sources

3.2.2. Semantic Projection

3.2.3. Statistical and AI Methods

3.2.4. Evaluation Metrics and Analysis Procedures

4. Comparison of Universes: Semantic Stability in Metric-Based Models

4.1. Uniform Distribution of the Weights in the Extension

4.2. Hierarchical Weight Distribution

4.3. Lipschitz Extensions and Regression Estimation

4.4. Universe Similarity Analysis: A Case Study

4.4.1. Data Collection and Sources

4.4.2. Semantic Projection Techniques

4.4.3. Statistical and AI Methods

4.4.4. Evaluation Metrics and Analysis Procedures

4.5. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI