2.1. Concepts and Mathematical Notation Relevant to the Proposed Algorithms
For a better understanding of the algorithms and methods proposed in the research, the mathematical notation used in the research is presented below. These elements are common in publications on linguistic data summarization [
7,
8,
9,
10]. A linguistic summary is generally represented by a grammatical structure called a “protoform” with the following structure: QRy’s are S, as shown below:
QRy’s are S: Linguistic summaries composed of a quantifier “Q”, a set of filters “R”, and summarizers “S” that describe the behavior of objects “y”. For example:
- ○
Qy’s are S: Unfiltered linguistic summaries in which the statement has the form “Q objects are S”. For example: “Most workers are punctual”, where Q = “Most”, S = “are punctual”, and they characterize “y” = “workers”.
- ○
QRy’s are S: Filtered linguistic summaries in which the statement has the form “Q objects R are S”, where R can be formed by one or more attributes that qualify the object. Example: “Most young workers are unpunctual”. In this second example, similar to the previous one, R = “young” is incorporated, which characterizes or filters the set of analyzed objects.
Q: A quantifier is a fuzzy set with the universe of discourse in the interval [0, 1] expressing a quantity, for example, “most”, “60%”, or “more than half”.
R: A qualifier or filter, which is another attribute that determines a fuzzy subset of the object yi, for example, “young” for the attribute “age”.
S: A summarizer is an attribute with a linguistic value (fuzzy predicate) defined in the domain of attribute Aj, for example, “low salary” for the attribute “salary”.
Linguistic summaries learned through the application of linguistic data summarization techniques constitute learning objects that can be consulted through the independent analysis of each of their parts. In this sense, different protoforms are established for the construction of linguistic summaries and the retrieval of information from them.
The structures of linguistic summaries have been conceptualized as protoforms, as shown in
Table 1. This table presents a categorization of the protoforms, indicating in the “Protoform” column the grammatical structure of the summary; in the “Known” column, the knowledge available at a given time or about the objects of study; and in the “Doubt” column, the question or information sought.
The following examples illustrate different situations associated with the protoforms in
Table 1:
- ■
The term SStructure means that the variables that make up the summary are known; while SValue denotes that the value of the summarizer is unknown.
- ▪
Protoform 0: Corresponds to the structure QRy’s are S. It has the lowest level of abstraction, as it assumes that all the elements that make up the summary are known. What we want to know is their truth value T.
- ▪
Protoform 1: Has a
Qy’s are S structure, the summarizer (
S) is known, and we want to discover the quantifier (
Q). Protoforms 1 and 2 can generate summaries from SQL statements or their various extensions [
10].
- ▪
Protoform 2: Has a QRy’s are S structure, the summarizer (S) and the filter (R) are known, and we want to discover the quantifier (Q). As with protoforms 0 and 1, these summaries can be obtained through database queries.
- ▪
Protoform 3: Has a Qy’s are S structure, and the quantifier (Q) and the summary structure are known; the goal is to discover the summarizer (S).
- ▪
Protoform 4: Has a QRy’s are S structure, and the quantifier (Q) and the summary structure are known; the goal is to discover the summarizer (S) and the filter (R).
- ▪
Protoform 5: Has a QRy’s are S structure. This is the highest level of abstraction, as no elements are known, and therefore, the goal is to discover everything. These are the most complex summaries and are the main focus of this work.
The following notations will be applied in the process of generating the summaries:
U: Database used in this research to represent the oceanographic dataset.
Y = {y1, …, yn}: Set of objects (records) available used in the investigation.
A = {A1, …, Am}: Set of attributes that describe objects. In our case, these constitute the variables of the oceanographic study.
ALV: Set of linguistic variables (see Definition 2) that describe the attributes A.
D = {[ALV1(y1), ALV2(y1), …, ALV m(y1)], …, [ALV 1(yn), ALV 2(yn), …, ALV m(yn)]}: Dataset resulting from the fuzzification process of universe U.
Aj(yi) ∈ D: Denotes the value of the attribute Aj for the object yi, for example, “young” for the attribute “age”. In this research, for example, the value “Tall” represents the linguistic value of a measurement of one of the variables such as vgos_mean.
Let Y = {y1, …, yn}, a database of objects (e.g., “oceanographic measurement”) described by attributes A (e.g., “vgos_mean”), which can take values X = {x1, x2, …, xm}, for example, {−0.2, …, 0.2}. For a set of m attributes, A = {A1, …, Am}, such that Aj(yi). Then, di = Aj(yi) represents the value of the attribute Aj for the object yi that takes values in the domain of the set Xj.
Regarding ways to evaluate linguistic summaries.
Let summarizer S = {S1, S2, …, Sm} and be represented by various attributes Si; then, , i = 1, 2, …, n, represents the certainty grade of the summarizer and can be defined as minj∈{1, 2, …, m}[μSj(Aj(yi))].
If the summary contains filters, it is defined as:
minj∈{1, 2, …, m}[μSj(Vj(yi)) ∧ μRg(Vg(yg))], where “∧” is a t-norm, and μRg(Ag(yg)) represents the grade of the certainty of the summary filters.
As part of the notations and elements relevant to the understanding of the proposal, a set of quality indicators of linguistic summaries are presented [
7,
9,
10,
12] and are useful for the subsequent processing of the generated knowledge.
T1: Degree of Truth: Determines the validity of a summary, a criterion introduced by Yager [
8]. Equation (1) shows how to calculate this indicator for unfiltered summaries, and Equation (2) explains in detail the calculation of indicator T
1 with filters R.
where
is the degree of membership of the filter of the summary.
is the membership degree of the aggregator of a summary.
∩ represents a t-norm, which could be the “minimum” operation or a product.
n is the number of objects in the database.
µQ[
r] is the membership function that represents the linguistic quantifier
Q [
9,
10,
12].
T2: Degree of Imprecision: This is an obvious and important validity criterion [
9,
10,
12], since a highly imprecise linguistic summary with a high Degree of Truth is not very useful. For example, the summary “on almost every winter day, the temperature is quite cold” has a very high truth value (T
1 would tend towards 1); however, it is very imprecise, as it is not useful or does not generate valuable information. Equations (6) and (7) are used to calculate this indicator. As can be seen in Equation (7), this indicator depends on the summary summarizer and not on the database; that is, to calculate it, the records are not traversed, since only the cardinality of each linguistic set of the summarizer, as well as its domain, is considered.
where
m is the number of fuzzy sets in the summarizer
S = {
S1,
S2,
… Sj, …,
Sm}.
denotes the cardinality of the corresponding fuzzy set.
T3 Coverage Degree: Indicates how many objects in the database that meet the filter
R are covered by the summary, whose summarizer is
S. Its interpretation is simple; for example, if the value were 0.15, then it means that 15% of the objects are consistent with the summary in question. The value of this indicator clearly depends on the content of the database [
9,
10,
12] and is calculated as shown in the following equations: Equations (8)–(10).
This indicator is based on the conditional probability of the summarizers appearing given that the filters take place and on the concept of a support set discussed in the fuzzy logic.
T4: Degree of Suitability or Appropriateness: This is the most relevant indicator according to [
9,
10,
12], see Equations (11)–(13). Let us assume that the summary contains a description (fuzzy set)
S = (
S1,
S2, …,
Sm)
, which is partitioned into
m summarizers composed of attributes A
1,
A2, …,
Am, such that each summarizer corresponds to a fuzzy set; then, it is denoted as
T5: Length of a Summary: This is an important indicator, because the longer the summary, the harder it is to understand, see Equation (15):
T6: Strength of Discovered Dependencies [
7,
12]: Measures the algorithms’ ability to detect summaries with strong filter–summarizer relationships. This indicator is measured by analyzing the frequency of occurrence of quantifiers that indicate strong dependency relationships, such as “most”, “almost all”, and “many”. The following steps were applied to calculate this indicator:
- Step 1:
For each database, the relative frequency of occurrence of each quantifier in the linguistic summaries generated by each algorithm is calculated.
- Step 2:
For each database, the weighted sum of the relative frequencies is calculated using the following weights: 0 for the quantifiers “very few”, “few”, and “some”, while the quantifiers “approximately half”, “many”, “most”, and “almost all” were assigned weights of 0.08, 0.1, 0.4, and 0.42, respectively.
T7: Evaluation Integrated (CWW) (bold): An integrated summary of T
1–T
6, calculated as an average using computation with words (CWW) via a two-tuple method. It integrates all previous indicators into a single value to globally assess the quality of the summaries or algorithms. Use it as the final metric for ranking [
7,
12], see Equation (16):
There are also other relevant notations for algorithms.
Definition 1. Controlled natural language (CNL) is a constructed language based on a specific natural language (e.g., Spanish, English, Arabic, Japanese, etc.). It is more restrictive in terms of vocabulary, syntax, and/or semantics while preserving most of its natural properties, so that speakers of the base language can intuitively and accurately understand texts in the controlled natural language, at least to a substantial degree. It includes a CNLGrammar and a CNLDictionary with simple phrases that describe the variables and attributes of the problem at hand.
Definition 2. A linguistic variable is defined by a quintuple (x, T(x), X, G, M), where x is the name of the variable, T(x) is the set of linguistic terms, X is the universe of discourse, M is a semantic rule that associates each linguistic value Z with its meaning M(Z) and where M(Z) denotes a fuzzy set in X, and G is the set of syntactic rules for generating compound terms from the atomic terms that make up the sentences that give rise to each linguistic value. An example of linguistic variables (LVs) for the variable “vgos” is shown below.
X is the universe of discourse defined on the interval [0, 1].
T(X) = {Extreme Low, Very Low, Low, Medium, High, Very High, Extreme High}.
M(Z) represents the fuzzy sets represented in this proposal by triangular membership functions:
Extreme Low (−0.8, −0.8, −0.5);
Very Low (−0.75, −0.5, −0.25)
Low (−0.5, −0.25, 0);
Medium (−0.25, 0, 0.25);
High (0, 0.25, 0.5);
Very High (0.25, 0.5, 0.75);
Extreme High (0.5, 0.8, 0.8).
G is the syntactic rule that describes the relationship between the fuzzy sets, which can be represented by triangular functions and their overlaps.
2.3. Algorithm Description
The proposed algorithm uses probabilistic graph learning techniques to discover relationships between variables [
11]. From the graph learning, a set of probabilistic trees is generated that represent the different relationships between the variables (Algorithm 1).
| Algorithm 1. Generation of linguistic summaries combining probabilistic models and GenAI |
Input Dataset U = {x1, x2, …, xn}, where each xi is an object (row) described by p attributes, and n=∣U∣ is the total number of objects. Controlled natural language definition CNL = (CNLGrammar), where CNLGrammar denotes the grammar of the linguistic summaries, contains simple phrases that describe the variables and attributes of the problem, see Section 2.1. List of indicators T = (T1, T2, …, T7) to quality assessment of linguistic summaries, (detailed information about indicators in Section 2.1). Output: Set of linguistic summaries S Quality evaluation scores Q(S) Begin- Step 1.
Candidates = ∅ - Step 2.
D = build_fuzzy_dataset(U) // Algorithm 2 - Step 3.
prob_graph = build_probabilistic_graph(D) - Step 4.
For each of the prob_graphi trees in prob_graph do - Step 4.1.
If prob_graphi has more than one vertex - Step 4.2.
candidatesi = do_candidate_from_branches (prob_graphi) - Step 4.3.
Candidates ← candidate_summariesi End of the conditional statement started in step 4.1 End of the cycle started in step 4 - Step 5.
Summaries = agent_generate_summaries(U, Candidates, CNL) - Step 6.
calculate_T(U, Summaries, T) - Step 7.
return Summaries
End |
Essentially, the algorithm in step 2 of Algorithm 1 (
build_fuzzy_dataset) results in a dataset (
U) that is received as input. See Algorithm 2 build_fuzzy_dataset(
U).
| Algorithm 2. build_fuzzy_dataset(U) |
Input
U: dataset
Begin- Step 1.
U’ = clean_DataSet (U) - Step 2.
ALV = ∅ - Step 3.
For each numeric attribute A in U do - Step 4.
LVi = discretize_build_linguistic_variables (U’) //Algorithm 3 - Step 5.
ALV = ALV ∪ LVi // obtain the linguistic variables for each attribute Definition 2 - Step 6.
D = {[ALV1(y1), ALV2(y1), …, ALVm(y1)], …, [ALV1(yn), ALV2(yn), …, ALVm(yn)]} data set obtained from the fuzzification of set U’. // Transformed each numeric point in U’ into a linguistic label by applying the maximum membership principle, considering the membership of each data point in ALV sets of linguistic variables that represents the attribute in question. - Step 7.
return D
End |
In the first step of Algorithm 2, the preprocessing of dataset U is performed, and the dataset is cleaned to reduce information uncertainty. The output is dataset U’.
In the context of this research, uncertainty refers to the lack of certainty or precise knowledge about a phenomenon, process, or prediction that can affect the decision-making process [
9,
16]. It can be caused by different factors, including:
Data noise: This can be caused by errors in measurement processes, including calibration errors associated with measurement technologies or equipment, distorting the actual signal being analyzed. In this particular study, there is noise present in the satellite data used in the experiment. However, this is acceptable given the proposed objectives associated with the macro-analysis of the area; an excessively high level of precision is not required.
Vagueness of concepts: This prevents the definition of clear boundaries between linguistic categories (e.g., “high”, “low”, and “acceptable”). These terms depend on the context and even on the personal preferences of those evaluating them, introducing subjectivity into the interpretation.
Data incompleteness: This reflects the absence of relevant information, whether due to omission, lack of measurement, or the inability to record all necessary variables. This leads to conclusions that are always provisional and subject to revision.
Inconsistencies: This refers to internal contradictions in the available information, such as when two reliable sources offer opposing data. In real-world settings, these inconsistencies are frequent and add an additional layer of uncertainty.
The data are verified to identify difficulties such as data incompleteness, inconsistencies, or outliers. The technique of removing records from set U that could affect data quality for any of these three reasons is applied. However, it is acknowledged that the data used, available in [
17], is of high quality. Removing the problematic records does not significantly affect the sample given the amount of data available.
In the second step of Algorithm 2, the data are discretized, and fuzzy sets are constructed for each cluster. In this case, the following strategies can be applied in the proposed approach:
Follow a simple attribute discretization process, using the following strategy: constructing intervals of equal size or intervals with equal frequency. The advantages of this approach are its low computational complexity and the fact that it does not require experts. A disadvantage is that the constructed intervals may not accurately represent the natural grouping of the data.
Another alternative is to carry out an attribute-level clustering process to construct the intervals. The main advantages of this method are that it does not require experts and that the discovered intervals better represent the natural grouping of the data (see Algorithm 3). In this case, we will work with seven clusters for each attribute. The selection of 7 is due to the fact that the number seven is recommended in the construction of valid linguistic structures. This is because it guarantees a balance between the level of granularity and the level of semantic coherence and comprehension by human experts [
7,
9,
10].
| Algorithm 3. Discretize_build_linguistic_variables (A, U’) |
// The algorithm is applicable to numerical data, applicable in the proposed scenario
Input
U’: cleaned dataset.
K = 7
Begin- Step 1.
V = Retrieve the values of the records associated with set A - Step 2.
Randomly select K points of V as the initial centroids (centers of the clusters C) - Step 3.
Calculate the Euclidean distance between each data point and each centroid. - Step 4.
Assign each point to the nearest centroid. - Step 5.
Recalculate the position of each centroid by taking the average of all the points assigned to that cluster. - Step 6.
Repeat steps 2 and 3 until the centroids no longer change significantly or a maximum number of iterations is reached. - Step 7.
C is the sets of clusters obtained from step 2 to step 6. - Step 8.
LV =. Construct the linguistic variable associated with the attribute; each resulting cluster in C is constituted as a fuzzy set. See Definition 2. - Step 9.
Assign each fuzzy set a semantic meaning expressed in a linguistic label in the set LBTL = {Extreme Low, Very Low, Low, Medium, High, Very High, Extreme High}. The cluster with the lowest values will have the linguistic label Extreme Low associated with it, and so on successively. - Step 10.
return LV
End |
Then, in step 3 of Algorithm 1, the dataset is used to learn the probabilistic model that best approximates the behavior of the data [
18,
19]. The following presents different alternative algorithms that can be used due to their capabilities for learning probabilistic graphs from data:
The Chow Liu algorithm, was initially proposed for constructing trees from data [
20,
21].
The Rebane–Pearl algorithm [
22,
23] extends the Chow Liu algorithm and enables the learning of polytrees, allowing for the description of higher-order interactions.
The Polytree Approximation Algorithm (PA) [
24] bases its learning on the calculation of marginal and conditional Mutual Information.
The Learning Polytree Algorithm (LPA) [
24] calculates marginal and conditional dependencies using computational methods based on the concept of entropy.
The algorithms in step 3, which learn the probabilistic models, generate a polytree.
Figure 1 represents an example of one polytree generated.
Then, in step 4 of Algorithm 1, a candidate’s objects are generated for each tree, such that the leaf node is identified as the summarizing attribute and the nodes on the branches are identified as filters.
In step 5 of Algorithm 1, candidate summaries are generated from the list of candidate objects by transforming each object into a summary object whose attributes are filters and summarizing objects. This step involves agents supported by the GenAI algorithm “agent_generate_summaries” that construct linguistic summaries from pre-constructed candidate summaries.
The fundamental contribution of generative artificial intelligence in this step is its ability to generate content in natural language, the explicability of the models, and its multilingual approach [
25,
26,
27,
28]. Essentially, prompting algorithm techniques are applied, and the desired structure of the summaries is precisely designed by specifying a grammar in a natural language. Then, these prompts, specially designed for generating summaries from the submitted summaries, are executed using LLMs or SLMs, and the final summaries are obtained. The strengths of multilingual processing, incorporated into the models, enhance the use of algorithms proposed by researchers and professionals from different countries and languages.
The following techniques are used to carry out this task:
Below is one of the English-language templates used as linguistic molds to ensure that the generated summaries follow a consistent, semantically valid format and are aligned with the proposed protoform theory. Other similar templates for the rest of the languages are presented in [
1].
<Linguistic Summary>::= <D> <data descriptor connector> <Q> <quantifier connector> <y’> <filter connector> <R> <summary connector> <S> |<Q> <y’><filter connector><R><summary connector><S>|<Q><y’><summary connector> <S>
<D>::= “almost all”|”most”|”many”|”around a half of”|”some”|”few”|”very few”
<data descriptor connector>::= “records show, that”
<quantifier connector>::= “of times”,
<filter connector>::= “with”
<summary connector>::= “have”
<y’>::= <subject>
<subject>::= <simple phrase>
<Q>::= <quantifier linguistic>|<numeric quantifier>|<mixed quantifier> | <percent quantifier>
<linguistic quantifier>::= “almost all”|”most”|”many”|”around a half of”|”some”| “few”|”very few”
<numeric quantifier>::= “more than 95% of” | “around 85% of” | “around 75% of” | “approximately 50%” | “close to 33%” | “less than 17%” | “less than 5%”
<mixed quantifier>::= “very few (less than 5%)”|”few (around of 15%)”|”some of (close to 33%)”|”around a half of”|”many (close to 65%)”|”most of (around of 83%)”| “almost all of”
<percent quantifier>::= <percent connective> <percent numeric value>
<percent connective>::= “in”
<S>::= <phrase>
<R>::= <phrase>
<phrase>::= <phrase> <logical operator> <simple phrase> | <simple phrase>
<logical operator>::= <conjunction> | <disjunction>
In the experimentation with the AI models, the following hyperparameters were established to help control the models’ hallucination:
Temperature = 0.0;
top_p = 0.95;
top_k = 40 (Gemini);
max_output_tokens (Gemini) = 8192;
max_completion_tokens (GPT-5 and GPT-5 Mini) = 16,384;
response_mime_type (Gemini) = “application/json”;
response_format (GPT-5 and GPT-5 Mini) = {“type”: “json_object”};
stop_sequences (Gemini) = not used;
presence_penalty (GPT-5 and GPT-5 Mini) = 0.0;
frequency_penalty (GPT-5 and GPT-5 Mini) = 0.0;
stop (GPT-5 and GPT-5 Mini) = null;
Seed = 42;
logit_bias (GPT-5 and GPT-5 Mini) = not applied;
logprobs = true;
top_logprobs (GPT-5 and GPT-5 Mini) = 5;
reasoning_effort (GPT-5) = “high”;
safety_settings (Gemini) = BLOCK_NONE.
Based on the definition of the above elements, an intelligent agent, specialized in constructing linguistic summaries, was built. It was established that this agent would be supported by a GenAI model and the elements declared above.
Next, in step 6, the quality indicators for each linguistic summary are calculated. The following indicators, Degree of Truth (T
1), Degree of Imprecision (T
2), Degree of Coverage (T
3), Degree of Suitability (T
4) or Appropriateness, Length (T
5), Strength of Discovered Dependencies (T
6), and Evaluation Integrated (CWW) (T
7), are explained in
Section 2.1. Essentially, it assesses how the linguistic summary in question covers or represents the objects of dataset D processed from step 2 of the algorithm.
The following section presents the validation results of the proposed algorithm. The proposed algorithm is applied to the analysis of kinetic energy along the coasts of Northern Chile.
2.4. Implementation Details and Reproducibility
All experiments were implemented in Python 3.14. The data processing and analysis pipeline was developed using the libraries Xarray, Pandas, and Streamlit, all in versions compatible with Python 3.14.
The construction of probabilistic models was based on the Chow Liu algorithm, applied over oceanographic variables obtained from the Copernicus Marine Service dataset [
17] (last accessed on 29 January 2026). The dataset includes Sea Level Anomaly (SLA), Absolute Dynamic Topography (ADT), and geostrophic velocities (absolute and anomalies) derived from multi-mission satellite altimetry using optimal interpolation techniques.
Given the high quality of the Copernicus data, preprocessing was minimal and consisted primarily of data structuring using Xarray. Standardization and normalization procedures were considered and partially applied depending on the variable, although no aggressive cleaning or filtering was required. No pruning strategies were applied to the probabilistic trees.
The generation of linguistic summaries was performed using GenAI models. Small models were deployed locally using OpenWebUI v0.6, while larger models were accessed externally. Default model configurations were used, including temperature and maximum token parameters, as defined by each model and the OpenWebUI framework.
Each experiment was repeated 11 times to ensure the stability and consistency of the results. Statistical analyses were conducted using PSPP, applying non-parametric tests (Friedman and Wilcoxon) with Holm–Bonferroni correction, and had a significance level of α = 0.05.
All experiments were executed on a local computing environment with the following specifications: 13th Gen Intel® Core™ i5-13420H CPU (2.10 GHz), 16 GB RAM, GPU with 6 GB VRAM, and 477 GB of storage, running a 64-bit operating system.
The probabilistic trees, generated linguistic summaries, and experimental outputs are available and can be shared via Google Drive upon reasonable request to facilitate reproducibility.
Figure 2 illustrates the overall architecture of the proposed framework. This figure represents, in IDEF0 format, all the processes that occur, including their inputs, outputs, and mechanisms. The diagram more clearly illustrates the following elements that contribute to the reproducibility of the proposal:
It is identified that, in the case of the linguistic summary generation process, the combined use of generative AI models with controlled natural language grammars is key. In this case, the combination is achieved through the use of best practices in prompting algorithms.
An important element is the evaluation of the quality of the generated linguistic summaries using indicators T1 through T7.
Later in the experimentation process, to compare the effectiveness and efficiency of different generative AI models, linguistic summaries are generated for each model and evaluated using the proposed indicators T1 through T7.
Finally, the generated summaries are evaluated by human externalists following the Human-in-the-Loop principles for final decision-making.