HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text

Hendry, Hendry; Tukino, Tukino; Sediyono, Eko; Fauzi, Ahmad; Huda, Baenil

doi:10.3390/info16110995

Open AccessArticle

HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text

by

Hendry Hendry

^1,*

,

Tukino Tukino

^1,2,*

,

Eko Sediyono

¹

,

Ahmad Fauzi

²

and

Baenil Huda

²

¹

Faculty of Information Technology, Satya Wacana Christian University, Salatiga 50711, Indonesia

²

Faculty of Computer Science, Buana Perjuangan Karawang University, Karawang 41361, Indonesia

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(11), 995; https://doi.org/10.3390/info16110995

Submission received: 15 September 2025 / Revised: 15 October 2025 / Accepted: 27 October 2025 / Published: 17 November 2025

(This article belongs to the Topic AI, Deep Learning, and Machine Learning in Veterinary Science Imaging)

Download

Browse Figures

Versions Notes

Abstract

This study is intended to evaluate and contrast the performance of varying combinations of embedding algorithms and weighting systems in measuring perception-based text similarity using the Cosine Similarity approach. Within a structured experiment design, a hybrid model referred to as HyEWCos (Hybrid Embedding and Weighting for Cosine Similarity) was built incorporating conventional embedding models (Word2Vec, FastText), transformer-based models (BERT, GPT), and statistical and linguistic word weighting schemes (TFIDF, BM25, POS-weighting, and N-weighting). The test results indicate that Word2Vec merged with the CBOW architecture and TFIDF weighting always returned the most reliable performance, with lowest error values (RMSE and MAE of 0.9868) and the highest rating correlation with expert judgment (Pearson’s, 0.524; Spearman’s, 0.543). These results show that contextually conditioned distributional representation approaches perform better in maintaining the semantic subtlety of short and subjective texts than transformer models that are not fine-tuned. This work is unique in terms of its evaluation framework because it integrates embedding and weighting approaches that have hitherto been examined mostly in separation. The main contribution of the study is the development of an experimental framework that serves as a foundation for building more stable and accurate text-based assessment systems. The research also proves the need for making decisions on representation methods based on the data type and domain and opens a door for continuing research in adaptive hybrid models and how their potential can be achieved through combining the best of various approaches.

Keywords:

1. Introduction

Perception measurement plays an essential role in evaluation processes across various domains, including education, artificial intelligence, and machine learning. One common approach to perception assessment is text similarity measurement, which relies on the integration of embedding methods and word-weighting schemes to represent textual meaning in a contextual manner. In this study, we intentionally employ BERT and GPT as static embedding encoders, without any domain-specific fine-tuning, in order to examine their baseline representational capability within hybrid weighting–embedding configurations. This methodological decision avoids task-specific adaptation and allows a neutral comparison, consistent with previous comparative studies where transformer-based embeddings were evaluated in their general, pre-trained form [1,2]. At the same time, classical weighting models—such as TFIDF, BM25, and Part-of-Speech (POS) weighting—are integrated to emphasize statistically and linguistically important terms, providing a complementary perspective on feature relevance in text similarity measurement.

Word2Vec proved to be useful in emotion classification tasks because of the ability to preserve semantic meaning and is still widely applied in sentiment analysis and text classification [3]. FastText proved to have stronger capacity to observe contextual subtlety, particularly in multi-aspect sentiment analysis, compared to conventional techniques such as TFIDF [4]. Contextual embedding methods have a tendency to improve real-time detection accuracy of text [5]. Also, BERT-based models surpassed earlier embedding in representing contextual word senses in diverse fields [6]. TFIDF weighting has also been shown consistently to perform well in separating crucial words from frequent words [7,8], and its performance is enhanced even more when used with embedding like Word2Vec for classification and sentiment analysis tasks [8]. With regard to large-scale assessment texts, BM25 has also been observed to be highly relevant to its effectiveness and accuracy in determination of relevance scores. POS-weighting also enhances text similarity modeling since it encompasses grammatical parts, such as nouns, verbs, and adjectives, thereby enabling weighted differences using word classes.

Hybrid approaches that intertwine embedding and weighting techniques enable models to represent text in a comprehensive way, encompassing both statistical and semantic features [9]. Combination encourages the representation of structure relationships in text and enhances similarity assessment by incorporating inter-entity association with the information [10]. As an example, merging Word2Vec and TFIDF optimizes text summarization effectiveness through relevance assessment based on cosine similarity [11]. Recent studies have also investigated how standard weighting methods (TFIDF, BM25, POS, and N-weighting) can be integrated with modern embedding methods (Word2Vec, FastText, BERT, and GPT) and proved to be effective for text processing and perception-based assessment tasks [12].

The cosine similarity has been a standard metric for semantic similarity between text representations [13]. The metric computes the angle between the vectors and has been utilized extensively in current studies [12,13,14]. The combination of cosine similarity with text representation methods has proven to be useful in quantifying the similarity of students’ answers with reference answers towards enhancing objectivity and efficiency in measuring text similarity [15]. Its use is particularly significant in expert opinion-based testing, where semantic similarity is more desirable than lexical similarity. In addition, the use of cosine similarity in measuring the alignment of students’ answers to curriculum standards has highlighted its applicability in educational assessment and planning [16]. It has been used effectively in text-based conversational systems, like university entrance chat-bots, that rely on semantic relevance between request and response [17].

Despite these advancements, hybrid approaches that include weighting and embedding techniques within a single unified framework for text similarity measurement using the cosine similarity have not been taken up extensively in prior work. Much of the work previously undertaken has been individualized around a single technique, either embedding such as Word2Vec for retaining semantic meaning or weighting techniques such as TFIDF for emphasizing feature importance [18,19]. This decoupling restricts the ability to fully account for data complexity, particularly in tasks that require sensitivity towards contextual semantics and feature weighting, such as perception-based text assessment. Embedding techniques tend to disregard explicit feature weights, while weighing techniques disregard semantic dimensions [19]. Additionally, the restricted interpretability of such models restricts their realization in practice. The increased computational intensity of combining both approaches also imposes constraints, and therefore most studies prefer separate techniques or low-quality approximations. Therefore, a sophisticated and efficient methodology that utilizes weighting and embedding strategies systematically for text similarity measurement is still lacking and thus there is an essential knowledge gap in the literature.

To address this void, the present research attempts to synthesize and compare various embedding and weighting methods that have not been comprehensively explored in the context of perception-based measures using text similarity. The aim is to experiment and compare the performance of varying combinations of embedding and weighting to measure text perception similarity using cosine similarity to find the optimal configuration.

The primary contribution of this study is its systematic comparative examination of hybrid approaches that combine embedding and weighting approaches to the evaluation of perception similarity in test texts. The novelty of this research lies in applying a hybrid approach to text similarity evaluation using cosine similarity, which has not been extensively or systematically addressed in present literature.

2. Related Work

The empirical study of text similarity measurement is the central topic of the current research, in the field of perception-based text assessment. Although several comparative studies have evaluated text similarity methods using various embedding techniques, most remain limited to isolated analyses of either embeddings or weighting schemes. Previous works typically focused on enhancing representation accuracy through embedding optimization alone, without examining how weighting mechanisms could interact with or influence semantic performance. This gap highlights the need for integrative evaluation frameworks that systematically analyze the interplay between embedding and weighting methods an issue this study directly addresses through the proposed HyEWCos framework.

Application of the cosine similarity algorithm for determining content similarity between information systems has been explored with comparison paradigms in merging various embedding models such as Word2Vec, FastText, BERT, and GPT with weighting approaches such as TFIDF, BM25, and POS-weighting [20]. This mixed method strives to measure the performance of such models in order to improve the precision of perception-based text tests with cosine similarity as the measurement, which is tested statistically and correlated [21].

Various studies have established that cosine similarity has been effective in measuring document similarity to detect content overlap in research and community service information systems [22]. These tests confirmed the capability of cosine similarity to identify inter-document relations in academic literature, also the objective of the present study to identify semantic relations in perception-based tests. Other studies have applied case-based reasoning to detect document similarity, further proving the application of cosine similarity as a fundamental tool in text analysis on the basis of similarity [23].

The application of weighting techniques, such as TFIDF, to measure text similarity has also been widely investigated. One study utilized TFIDF to assess the similarity of thesis titles, illustrating how weighted vector representations improve the accuracy of similarity detection. However, this approach was limited to a single type of weighting and did not account for potential synergy with embedding techniques, a gap that this study aimed to address. Other studies have indicated that embedding techniques, such as Word2Vec and FastText, can capture syntactic and semantic attributes with varying levels of accuracy depending on text complexity, highlighting the importance of embedding selection in deep semantic evaluations [24].

Several studies have attempted to combine embedding and weighting, albeit in limited context. For example, one study applied keyword extraction techniques based on text similarity indices in the domain of electric double-layer capacitors, although its primary focus was not on perception-based assessments [25]. Another study compared stemming algorithms and similarity methods for searching Qur’anic translations, which were methodologically relevant but did not offer an integrated embedding–weighting approach [26]. In the medical domain, the BioWordVec approach, which integrates word-and graph-based embeddings has been employed to capture semantic relationships within clinical terminology [27]. Moreover, transformer-based neural network models have been used to produce semantically rich sentence representations, which have proven effective in complex text similarity analyses [25,26].

Performance evaluations involving cosine similarity have been conducted in various studies, including the development of mobile-based public aspiration systems. Quantitative evaluation methods such as RMSE and MAE were used to assess the accuracy of anomaly detection based on textual descriptions. These studies provided a foundation for the evaluation framework employed this study. In addition, correlation metrics, such as Pearson and Spearman, have been increasingly used to assess the strength of the association between expert manual assessments and automated model evaluations. The application of cosine similarity has also been extended across domains, such as content-based filtering in tourism recommendation systems, demonstrating the flexibility of this approach in various text-based applications [28].

Overall, this literature review indicates that, although cosine similarity has been extensively applied in text similarity measurements, systematic studies comparing combinations of embedding and weighting techniques remain limited. Therefore, this study aims to address the need for innovation in developing automated text-based evaluation methods, particularly in the context of educational assessment systems [29].

This study makes theoretical and empirical contributions to the literature. Theoretically, this study expands our understanding of how the integration of embedding and weighting techniques influences the results of perception-based text evaluations. Empirically, this study proposed an evaluation framework applicable across various contexts of text-based assessment in academic environments and practical implementations. This approach is expected to enrich the scholarly discourse on text similarity and provide a valuable reference for future studies in this field.

Recent studies have extended transformer-based models such as BERT, RoERTa, and GPT through fine-tuning and hybrid integration for semantic similarity tasks [1,29]. These works highlight the increasing relevance of contextualized embeddings in domain-specific applications, especially in educational NLP [30]. In contrast, this study contributes a comparative analysis that bridges distributional and contextual models under a unified hybrid weighting framework, with a specific focus on Indonesian evaluative text.

Beyond a descriptive synthesis of prior studies, this research extends the comparative analysis by examining existing frameworks that evaluate different approaches to text similarity measurement. Several recent comparative works have investigated the performance of transformer-based, distributional, and hybrid models [29,31]. However, these studies rarely provide a systematic examination of how word embedding and word weighting techniques can interact synergistically, particularly within the context of perception-based or short evaluative texts in the Indonesian language. Therefore, this study contributes by designing and evaluating a hybrid framework that integrates both embedding and weighting strategies into a unified text similarity measurement model.

3. Materials and Methods

3.1. Pre Processing

In the effort to measure textual similarity between LED texts and expert evaluation perceptions, the primary challenge lies in the diversity of natural language structures, writing styles, and the clarity of narratives from each individual. Therefore, the pre-processing stage is crucial to ensure that the textual data exhibit linguistic quality and consistency prior to being represented as vectors for similarity measurements using cosine similarity [31].

The data used in this study consisted of open-ended assessment texts provided by assessors and domain experts that were collected through systematic documentation and manual labeling processes. This labeling process produced pairs of texts categorized into similar and dissimilar groups. All data were anonymized and curated to preserve validity and confidentiality in accordance with the research ethics principles [26].

The dataset used in this study consists of Indonesian evaluation texts collected from institutional self-assessment and perception reports. Following approval from the originating institution, the dataset has been anonymized and made publicly accessible for research purposes to enhance reproducibility and transparency.

The dataset is available at GitHub (GitHub, Inc., San Francisco, CA, USA), containing 260 text pairs annotated by human experts on a 0–4 similarity scale, where 0 indicates “not similar” and 4 indicates “highly similar.”

Pre-processing was performed sequentially to simplify textual structures, eliminate linguistic noise, and generate more meaningful representations in vector space. The pre-processing steps are as follows:

Case Folding.

All characters in the text are converted to lowercase letters. This normalization was necessary to eliminate artificial differences between semantically identical words that differed only in capitalization [32,33].

b.: Punctuation Removal.

All punctuation marks such as periods, commas, question marks, and exclamation points were removed. Although punctuation serves syntactic functions in language, it often does not add informational value to the context of token-based semantic analysis.

c.: Number Removal.

All numerical characteristics, including integers and fractions, were removed. In the context of word meaning processing and sentence structure, numbers generally do not contribute significantly to semantic interpretation [32].

d.: Tokenization.

Texts are broken down into small linguistic units called tokens in the form of words or phrases. Tokenization enables the segmentation of paragraphs into sentences and sentences into words, allowing for structured analysis of linguistic units [33].

e.: Stop Word Removal.

Common words such as “the”, “on”, “to”, and “in” were removed because their limited semantic contribution. This process aims to simplify text representation and reduce noise during the feature-extraction stage [34].

f.: Lemmatization.

All words were reduced to their base or dictionary forms to consolidate their morphological variants and preserve their core semantic meanings. Lemmatization allows the system to recognize semantically identical words written in different inflectional forms [34].

These pre-processing steps were designed to enhance the linguistic homogeneity of the data and reduce semantic ambiguity, which plays a critical role in improving the accuracy of inter-text similarity computations. The systematic implementation of pre-processing renders unstructured raw data more suitable for analysis in machine-learning-based systems or vector representations. This aligns with previous findings indicating that pre-processing contributes to improved model performance in text classification, similarity measurement, and sentiment analysis tasks [34,35,36,37].

3.2. Corpus Construction

To prevent potential data leakage during the weighting process, the corpus was divided into distinct training and testing subsets before calculating any inverse document frequency (IDF) or BM25 statistics. The IDF and BM25 parameters were computed exclusively on the training subset to ensure that test pairs did not influence the global term weighting. This approach follows standard practices in information-retrieval research, maintaining statistical independence between training and evaluation data and thus ensuring the validity of performance comparisons [38,39,40].

3.3. Data Transformation

Data transformation in this study was systematically designed not only to generate hybrid representations but also to establish reproducible baselines. Raw texts were first converted into vector formats using both standalone lexical models (TFIDF-only, BM25-only) and classical distributional embeddings (Word2Vec and FastText without weighting) to provide fair points of comparison [36]. Word2Vec was retained for its proven capability in capturing syntactic and semantic proximity through contextual co-occurrence, as supported by previous evaluations reporting correlations above 0.80 with human judgment [40,41,42]. These baseline representations ensured that performance gains observed in hybrid models could be attributed to the integration of weighting mechanisms rather than confounding architectural variability.

FastText, an extension of Word2Vec, adopts a sub-word n-gram approach to generate word vectors. This technique enhances semantic representation capabilities, particularly for out-of-vocabulary (OOV) words, by constructing representations based on morphological similarities between word fragments [43,44]. This makes FastText particularly effective for processing languages with complex morphology. In addition to distributional models, we employed transformer-based architectures using specific checkpoints to ensure reproducibility. For contextual encoding, we used BERT via the bert-base-multilingual-cased model, which applies masked language modeling and next-sentence prediction to capture bidirectional semantic dependencies [45,46,47], For autoregressive generation-based embeddings, we used GPT through the gpt2-medium checkpoint, trained to predict sequential tokens and generate rich sentence-level contextual representations [48,49]. Both models were used in static, pre-trained form to provide consistent baselines across embedding families.

In this study, transformer-based encoders were used in static mode to establish a consistent comparative baseline with classical embeddings. Specifically, sentence representations were derived using bert-base-multilingual-cased and gpt2-medium, with no domain-specific fine-tuning applied. Each model produced contextual embeddings at the token level, from which sentence vectors were computed using mean pooling over the final hidden layer. This configuration isolates representational capacity without the confounding effects of supervised adaptation, as further discussed in Section 5.

To construct hybrid representations, token-level weighting was applied prior to sentence aggregation. Given a token embedding

e_{i}

and its associated weight

w_{i}

from TFIDF, BM25, or POS weighting, we computed a transformed vector using.

e_{i}^{'} = w_{i} \cdot e_{i}

(1)

TFIDF weights were obtained using scikit-learn’s implementation, while BM25 was computed with RankBM25 (k₁ = 1.2, b = 0.75) using statistics derived exclusively from the training corpus to avoid data leakage. part-of-speech (POS) weights were assigned using Stanza v1.6 (Indonesian UD), with class coefficients set as NOUN = 1.5, VERB = 1.3, ADJ = 1.2, OTHER = 1.0, reflecting the semantic contribution of each syntactic role.

Final sentence embeddings were generated via weighted mean pooling, followed by L2 normalization to ensure comparability between models of different dimensionalities. This transformation pipeline enhanced the semantic robustness of the vector space and ensured compatibility with downstream cosine similarity and RMSE evaluation, thereby enabling consistent hybridization across embedding families.

The final output of the transformation pipeline consists of standardized sentence embeddings, normalized to ensure compatibility across models of different dimensionalities and evaluated using both cosine similarity and error-based metrics (RMSE, MAE). By explicitly contrasting lexical baselines with hybrid configurations, this study provides a controlled experimental matrix that isolates the contribution of weighting and contextual encoding. This structured design enables transparent reproducibility and supports the validity of subsequent comparative analyses presented in Section 4.

3.4. Building an Embedding and Word Weighting Model

Cosine Similarity

The Cosine Similarity model was employed to compute the degree of similarity between pairs of texts based on the data transformed through the hybrid integration of word embedding and weighting techniques. A combination of various embedding, weighting, and Cosine Similarity approaches was selected to strengthen the contextual features extracted from the text [41,42]. The fundamental formula for Cosine Similarity is expressed as Equation (2):

C o s (θ) = \frac{A \cdot B}{| | A | | \cdot | | B | |}

(2)

where A·B denotes the dot product of vectors A and B, and ‖A‖ and ‖B‖ represent the norms (magnitudes) of each vector. The cosine value represents the angle (θ) between two vectors. Cosine Similarity values range from 0 to 1, where values closer to 1 indicate greater directional similarity between the two vectors, while values near 0 indicate significant differences [50,51].

The primary advantage of this method is its ability to capture vector orientation without considering the vector magnitude, making it ideal for representing sparse, high-dimensional textual data [3]. Cosine Similarity was generally used in conjunction with text representation techniques such as TFIDF or word embeddings to improve accuracy in semantic similarity measurements between documents or sentences [44].

To generate optimal vector representations, word-embedding models are trained on large text corpora to accurately capture the semantic relationships between words in multidimensional space [46,52,53]. Each word was represented as a vector in that space, and semantic relations between words were then computed using metrics such as cosine similarity.

Embedding representations were generated using the Word2Vec and FastText algorithms, which adopt two main architectures: a Continuous Bag of Words (CBOW) and skip-gram [43,44]. As illustrated in Figure 1, CBOW predicts a target word from its surrounding context, whereas skip-gram predicts the context from a target word. The dimensionality of the embeddings can be adjusted to enhance semantic accuracy depending on the training parameters [43,45,46,47,48]. Additionally, transformer-based models such as BERT and GPT were employed to generate complex and contextualized vector representations [47]. In this study, both BERT and GPT were used as static, pre-trained encoders without domain-specific fine-tuning, serving as baselines for evaluating contextual representation quality under consistent experimental settings. To enhance semantic diversity and comparability, their pooled sentence embeddings were integrated at the representation level with Word2Vec vectors through concatenation and normalization. This integration was designed to combine BERT and GPT’s contextual richness with Word2Vec’s distributional stability while maintaining methodological consistency across all embedding–weighting configurations [48].

In the CBOW architecture, the arrows toward the SUM box indicate that context word embeddings are aggregated to predict a target word. In contrast, in the Skip-gram architecture, the branching arrows show that a single target word embedding predicts multiple surrounding context words without requiring aggregation.

Following the embedding process, additional features were extracted using three weighting methods: no weighting (N-W), part-of-speech weighting (POS-W), and inverse document frequency weighting (IDF-W) [49,54].

The no-weighting approach converts each word in a sentence into a vector without applying weights related to word frequency or importance in the corpus [55,56]. Sentence representations are then formed by summing or averaging word vectors [51]. While this method is simple, it lacks the ability to capture complex semantic nuances, as all words are treated equally. Therefore, N-weighting is often used as a baseline in preliminary experiments [52]. Equations (3) and (4):

V A = \sum_{i = 1}^{m} w o r d_v e c t o r (W_{Ai})

(3)

V B \sum_{j = 1}^{n} w o r d_v e c t o r (W_{Bj})

(4)

The inverse document frequency (IDF) weighting model assigned higher weights to words that appeared less frequently across documents [57,58]. This technique improved semantic representation by adjusting embedding vectors according to the actual informational value of each word [59,60]. Formula Equation (5):

i d f (w t) = l o g \frac{N}{d f w t}

(5)

Meanwhile, the part-of-speech weighting (POS-W) model assigns weights based on the grammatical category of words such as nouns, verbs, and adjectives [57,61]. This approach is relevant in NLP because grammatical structures influence sentence semantics [54]. For instance, nouns and verbs are often more informative and thus receive higher weights than conjunctions or articles [3]. The implementation of POS-W has been shown to enhance the semantic quality of sentence representations in similarity measurements [56]. This is shown in Equations (6) and (7):

SA = POS_wA1, POS_wA2, POS_wA3, …, POS_wAm

(6)

SB = POS_wB1, POS_wB2, POS_wB3, …, POS_wBn

(7)

b.: Propose method

This study adopts a quantitative experimental approach using a comparative framework to evaluate hybrid embedding–weighting configurations [37,62]. To ensure reproducibility and fair comparison, all models were implemented under controlled experimental settings with fixed random seeds and uniform preprocessing protocols. The primary objective was to assess the effectiveness of distinct combinations of word weighting (TFIDF, BM25, POS, and N-weighting) and embedding families (Word2Vec, FastText, BERT, and GPT) in measuring perception-based text similarity [21,33].

The integration of embedding with weighting techniques is believed to generate more meaningful vector representations, particularly in semantic tasks such as perception assessment [33]. Cosine similarity was used as the primary metric for measuring similarity between text vectors, considering its extensive application in prior semantic studies [24,28]. The experimental approach followed best practices in NLP research, including clearly defined variables, operational validity, and a systematic controlled experimental design.

To support the measurement of text similarity in perception assessment, this study proposes a hybrid embedding with weighting and cosine similarity (HyEWCos) framework. This framework was designed to explore optimal combinations of embedding techniques and word-weighting schemes to improve the accuracy of semantic similarity measurements. The HyEWCos model integrates two types of word-representation approaches: predictive-based embedding and statistical-based weighting.

It is important to note that both BERT and GPT models were used in their pre-trained forms without domain-specific fine-tuning. This design choice was made to maintain comparability across all embedding techniques within a unified experimental framework. However, we acknowledge that the absence of fine-tuning likely constrained their ability to fully capture domain-specific nuances of educational evaluation texts.

The embedding techniques used included Word2Vec in two variants: Continuous Bag of Words (CBOW) and skip-gram, as well as FastText, which extended Word2Vec by handling word morphology [58]. In addition to distributional embeddings, transformer-based encoders were employed to provide contextualized representations. Specifically, BERT was implemented using the bert-base-multilingual-cased checkpoint (12 layers, 768 hidden units), while GPT was represented through the gpt2-medium model (24 layers, 345 M parameters). These models were adopted in their static, pre-trained configurations to ensure a reproducible baseline, allowing comparison of representational capacity without the confounding influence of fine-tuning.

Each embedding method was combined with four word-weighting schemes—TF-IDF, BM25, POS-weighting, and N-weighting—to capture both statistical salience and syntactic contribution. For reproducibility, TFIDF weights were generated using scikit-learn with sublinear term frequency; BM25 weights were computed using RankBM25 (k₁ = 1.2, b = 0.75); and POS weights were assigned using Stanza (Indonesian UD), with coefficients NOUN = 1.5, VERB = 1.3, ADJ = 1.2, OTHER = 1.0. These weights were applied at the token level prior to mean pooling, producing hybrid vectors that preserve both semantic context and lexical importance [37,63].

For classical embedding algorithms (Word2Vec, FastText), the CBOW and Skip-gram architectures were implemented as standard predictive training schemes [60]. In contrast, for transformer-based models (BERT, GPT), these labels were retained only for structural consistency in the experimental matrix—they do not imply retraining the transformer models under CBOW/Skip-gram objectives. Both transformers were employed as static sentence encoders using fixed checkpoints (bert-base-multilingual-cased and gpt2-medium) and processed via mean pooling over the final hidden layer. To ensure reproducibility, all transformer embeddings were generated using HuggingFace Transformers (v4.x) with fixed random seeds, default tokenizers, and no vocabulary modification, preserving their native subword handling (WordPiece for BERT and BPE for GPT). This configuration isolates representational capacity across model families while preserving inherent architectural distinctions [10,11]. In total, 32 hybrid models were proposed and are summarized in Table 1.

This study proposed a framework consisting of 32 perception-based text-similarity models evaluated across three sequential phases, as illustrated in Figure 2. In the first phase, raw institutional evaluation texts were collected and manually segmented into 48 semantic pairs, each assigned a unique identifier. These texts were stored in a structured tabular format (CSV) with fields including text_A, text_B, and expert_score to ensure traceability. Preprocessing was conducted using Python 3.12.12 with Pandas and the Natural Language Toolkit 3.9.1, involving case normalization, removal of non-linguistic characters, and Unicode standard cleaning to eliminate noise. To prevent data leakage, the corpus was split into training and evaluation subsets before computing TF-IDF and BM25 statistics, and the final cleaned dataset was stored as a reproducible digital corpus for downstream embedding and weighting experiments [22,64,65,66].

In the second phase, word vectorization was conducted using pre-trained distributional models implemented in Gensim, specifically Word2Vec (300 dimensions, CBOW with window size = 5, min_count = 2) and FastText (skip-gram with subword n-grams, 300 dimensions). For reproducibility, all models were initialized with a fixed random seed and trained exclusively on the training corpus to prevent data leakage [67,68,69].

Following vector generation, additional lexical features were derived using three weighting strategies: No-weighting (N-W), TFIDF weighting (via Scikit-learn), and part-of-speech weighting (via Stanza v1.6, Indonesian UD). Each transformed embedding was stored as a serialized NumPy vector for consistent reuse across experiments [65,70,71].

In the final hybridization stage, embeddings and weights were combined through weighted mean pooling with L2 normalization, yielding 16 controlled model variants. These hybrid vectors were used in the HyEWCos framework to compute cosine similarity and RMSE-based error, enabling systematic comparison of embedding–weighting interactions across perception-based text assessments [63,64].

This approach conceptually follows the principles of an information retrieval system, emphasizing efficiency and accuracy in detecting textual similarity across documents [4]. Therefore, the proposed framework was expected to make a significant contribution to the development of evaluation systems based on natural language processing, particularly in the context of perception-based text assessments.

3.5. Model Evaluation

The evaluation dataset used in this study consisted of 48 pairs of short, perception-based assessment texts collected from institutional self-evaluation documents in the higher education accreditation domain. All responses were written in Indonesian, with lengths ranging from 18 to 65 words (median: 32 words), capturing authentic variations in narrative style, subjectivity, and semantic expression. These text pairs were curated to represent different levels of semantic overlap, including low, moderate, and high similarity cases.

The ground truth similarity scores were established through a structured annotation process conducted by three domain experts—two specializing in educational assessment and one in computational linguistics. Each expert independently rated the semantic similarity of text pairs using a five-point ordinal scale (1 = completely dissimilar; 5 = semantically equivalent). Prior to annotation, all experts were provided with a guideline document defining evaluation criteria such as thematic alignment, paraphrase recognition, and permissible lexical divergence.

To ensure annotation reliability, we calculated weighted Cohen’s Kappa (κ = 0.83) and Intraclass Correlation Coefficient, ICC(3, k) = 0.87, both indicating strong agreement. Any cases with rating discrepancies greater than two scale points were resolved through moderated consensus discussion, ensuring that the final score reflected a shared semantic interpretation rather than simple averaging. This protocol aligns with established practices in human-judgment-based text similarity benchmarking [61]. Following the establishment of expert reference scores, model predictions were evaluated using Root Mean Square Error (RMSE) to penalize extreme deviations and Mean Absolute Error (MAE) to measure average absolute deviation. The formulas used to compute these metrics are presented in Equations (8) and (9).

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (X_{i} - Y_{i})}

(8)

M A E = \frac{1}{n} \sum_{i = 1}^{n} | X_{i} - Y_{i} |

(9)

n denotes the total number of evaluated text pairs; Xᵢ represents the similarity score predicted by the model for the i-th item, and Yᵢ refers to the expert-assigned reference score. The term (Xᵢ − Yᵢ) captures the residual error between the model output and human judgment. To obtain a robust performance estimate, RMSE and MAE were first computed for each evaluation pair (D1–D2, D1–D3, and D1–D4) and then averaged across all pairs [17,18]. Furthermore, to balance the sensitivity to large deviations (captured by RMSE) and average error magnitude (captured by MAE), a composite score was calculated using a weighted linear combination. The combined evaluation formula is as follows in Equation (10):

Error Score = w₁ · RMSE + w₂ · MAE

(10)

The weight parameters satisfy (w₁ + w₂ = 1), allowing flexible emphasis between the two metrics (e.g., 0.6/0.4 or 0.4/0.6). This composite metric provides a more stable evaluation of model performance than relying on a single error measure alone [65]. The AoE-Angle-optimized Embeddings for Semantic Textual Similarity [Internet]. Association for Computational Linguistics;correlation coefficient was used to measure the strength of the linear relationship between the model-predicted and expert-assigned scores. Its value ranges from −1 to +1. The general formula for the Pearson coefficient is as follows in Equation (11):

r = \frac{n \sum (X_{i} Y_{i}) - \sum X_{i} \sum Y_{i}}{\sqrt{[n \sum X_{i}^{2} - {(\sum X_{i})}^{2}] [n \sum Y_{i}^{2} - {(\sum Y_{i})}^{2}]}}

(11)

The variables x and y represent the average scores from the model and experts, respectively, while r denotes the Pearson correlation coefficient. Each pair (D1–D2, D1–D3, D1–D4) was evaluated to compute the individual correlation coefficients, which were then averaged to provide an overall picture of the linear relationship [3], as shown in Equation (12).

\bar{r} = \frac{r_{1} + r_{2} + r_{3}}{3}

(12)

Subsequently, Spearman’s rank correlation was used to measure the monotonic relationship between two variables based on the ranking order of the scores. This approach is relevant when the relationship between variables is nonlinear but still maintains a consistent directional trend. The Spearman formula is as follows (as shown in Equation (13)):

ρ = 1 - \frac{6 \sum d_{i}^{2}}{n (n^{2} - 1)}

(13)

where d_i represents the difference in rank between the model score and expert score, and nnn denotes the number of items/data points. The standard deviation (STD) was used to measure the degree of dispersion of the predictions relative to the reference scores. This metric reflects the stability of a model’s performance in producing consistent prediction scores, as shown in Equation (14).

σ = \sqrt{\frac{\sum {(x_{i} - \bar{x})}^{2}}{n}}

(14)

In the context of model performance evaluation, the standard deviation was used to identify the degree of deviation between the predictions and reference values, thereby providing insight into the stability of the model’s performance. To obtain a more comprehensive evaluation of model performance, the three metrics were combined into a final score using a weighted approach. The final scoring formula used in this study is as follows in Equation (15):

Final Score =(w₁ × ∣r∣) + (w₂ × ∣ρ∣) + (w₃ × Skor STD)

(15)

where w₁, w₂, and w₃ represent the weights; r denotes the average Pearson correlation; ρ indicates the average Spearman correlation; and STD refers to the standard deviation as defined in Equation (16).

(Skor STD = 1 - \frac{σ}{σ_{maks}})

(16)

The weights (0.4 for Pearson correlation, 0.4 for Spearman correlation, and 0.2 for standard deviation) were selected based on empirical balancing between correlation-based accuracy and consistency stability. Similar proportional weighting has been adopted in multi-criteria text similarity evaluations where correlation metrics are prioritized for semantic agreement while variance measures ensure stability across test sets [30]. Sensitivity testing revealed that minor changes (±0.05) in the weights did not alter the relative ranking of the models, confirming that the chosen configuration provides a stable and interpretable composite score.

3.6. Evaluation Result Analysis

The evaluation results were then analyzed to identify the combination of embedding and weighting methods that yielded the highest accuracy in measuring perception. The analysis was conducted quantitatively by examining the correlation values and significance of the test results, as well as qualitatively by assessing the characteristics of each method. Particular attention has been paid to the relationship between semantic representation and syntactic structure within the hybrid model [27]. An in-depth analysis was also conducted to determine the extent to which embedding and weighting techniques influence text similarity outcomes [20,59].

4. Results

4.1. Model Performance and Consistency Evaluation Based

Based on the research data presented in Table 2, the performance of the proposed text similarity measurement models was influenced by the type of embedding (Word2Vec, FastText, BERT, GPT), training architecture (Continuous bag of words/CBOW or skip-gram/SG), and weighting methods applied (TFIDF, N-weighting, and POS-weighting). In this study 32 model proposed, several of which demonstrated superior performance. The combinations of GPT with TFIDF and either Skip-Gram or CBOW architecture yielded the highest average text similarity scores, with values of 3.9815 for GPT + CBOW + TFIDF, and 3.9812 for GPT + SG + TFIDF. Both combinations also exhibited a very low score dispersion, with a standard deviation of 0.0133, reflecting high consistency in the similarity assessment across document pairs (P1, P2, P3, and P4).

Furthermore, the BERT-based models also showed good and stable performance, although they did not surpass the results of the GPT-based models. The best-performing BERT configuration, BERT + SG + TFIDF, achieved an average similarity score of 3.5 with the lowest standard deviation of 0.3, indicating good stability in document-level assessments. Meanwhile, FastText (FT)-based models produced average similarity scores ranging from 3.30 to 3.52, but with relatively higher standard deviations, ranging between 0.6 and 0.8. In Contrast Word2Vec (W2V)-based models recorded the lowest similarity scores, ranging from 2.84 to 3.08, with an average standard deviation of 0.8, indicating greater inconsistency in evaluating text similarity.

These findings indicate that GPT and BERT configurations achieved relatively high internal consistency (i.e., low dispersion of predicted scores), particularly when combined with TFIDF weighting. However, their absolute agreement with human judgements remained only moderate, with correlation values generally between 0.30 and 0.40, suggesting that these models may systematically overestimate similarity in certain semantic contexts. By contrast, FastText-based configurations demonstrated greater variability across document pairs, while Word2Vec showed difficulty in capturing subtle thematic nuances, particularly in cases involving paraphrasing or implicit negation.

Rather than identifying a single dominant approach, these results emphasize the inherent trade-offs between stability and semantic sensitivity across embedding–weighting combinations. To address this, we now report 95% confidence intervals and include pairwise statistical tests to clarify the significance of performance differences. Moreover, an error-focused analysis (Section 4.2 and Section 4.3) has been added to examine representative false positive and false negative cases, providing insight into the specific failure modes associated with each configuration. Accordingly, GPT with TFIDF is not presented as universally superior, but rather as one of the more consistent configurations in terms of score stability—while still exhibiting limitations in absolute alignment with human perception.

4.2. Evaluation Based on Combined Correlation and Standard Deviation Scores

Table 3 presents the comparative performance of 12 selected models out of the 32 proposed configurations, combining various embedding techniques, training architectures, and word-weighting schemes for perception-based text similarity measurement. Model performance was assessed using the Pearson and Spearman correlation coefficients with 95% confidence intervals, alongside standard deviation (STD) to capture score stability across text pairs.

To ensure a more rigorous comparison, we additionally conducted pairwise statistical tests (paired t-test and Wilcoxon signed-rank test) across identical document pairs and adjusted p-values using the Holm method. The final composite score was derived by integrating correlation strength and stability, but is interpreted with caution due to the moderate effect sizes observed, rather than absolute superiority of any single configuration.

The dimensionality of the embedding vectors differed across models due to each model’s inherent configuration and prior empirical standards. Word2Vec and FastText were implemented with a 150-dimensional vector size, as commonly adopted in medium-scale semantic similarity studies [64], while transformer-based embeddings (BERT and GPT) produced 128-dimensional representations through mean pooling of contextual layers. The focus of this work was on benchmarking representative hybrid configurations rather than hyperparameter optimization; therefore, an ablation study on vector dimensionality was not conducted. This methodological clarification ensures the consistency of model comparison without introducing bias from arbitrary parameter selection.

The FT + SG + BM25 model achieved the highest ranking, with an average final score of 0.568. This score was derived from an average Pearson correlation of 0.749, a Spearman correlation of 0.622, and a relatively stable error value (STD = 0.903). Meanwhile, the second-best performance was obtained by four models—GPT + CBOW + N, GPT + SG + TFIDF, GPT + CBOW + TFIDF, and BERT + CBOW + TFIDF—each scoring between 0.524 and 0.529. These results indicate that the combination of FastText (FT), the Skip-Gram (SG) architecture, and BM25 weighting produces the most stable and consistent semantic representations. All five models demonstrated that FastText with the Skip-Gram architecture offers superior performance stability compared to other embedding methods used in this experiment.

Conversely, the W2V-based model combined with the CBOW architecture and POS weighting demonstrated significantly lower performance, with an average final score of 0.148 and Pearson and Spearman correlation values of 0.375 and 0.468, respectively. These findings indicate that the semantic representations produced by the W2V + CBOW + POS model under this configuration failed to capture the relevant meanings required for accurate text similarity assessment.

Interestingly, the BERT + SG + TFIDF based model demonstrated more competitive performance compared to the W2V + CBOW + POS model. Although the Pearson and Spearman correlation values were relatively low (approximately 0.152 and 0.475, respectively), the model exhibited a very small standard deviation (1.947). This indicates that BERT was able to surpass the performance of the W2V + CBOW + POS configuration, although it still could not match the best performance achieved by the FT + SG + BM25 model.

Overall, the identified performance patterns indicated that models based on FastText with Skip-Gram architecture and BM25 offered the most effective approach for measuring semantic text similarity. In contrast, W2V + CBOW + POS representations proved to be less effective in the configurations tested. These findings underscore the importance of selecting appropriate combinations of embedding techniques, training architectures, and weighting schemes aligned with the characteristics of the dataset and goals of perception-based text similarity analysis.

The analysis of the experimental results reveals that model performance differences are closely related to the linguistic representation and weighting interaction. Word2Vec combined with CBOW and TFIDF achieved the lowest MAE and RMSE because TFIDF effectively captures lexical importance in short evaluative sentences, while CBOW provides stable contextual averaging that reduces semantic noise. Conversely, FastText with Skip-gram and BM25-weighting produced stronger correlations with expert judgments, as subword embeddings and syntactic weighting improved semantic alignment in morphologically rich Indonesian text.

Transformer-based models (BERT, GPT) exhibited higher stability and score consistency but lower correlation, largely due to the absence of fine-tuning on domain-specific data, which limits contextual adaptation in perception-based similarity tasks. Compared to state-of-the-art benchmarks such as SimCSE which reports Spearman ≈ 0.816 on standard STS tasks, and SimCSE++ [68,72,73] improving contrastive embeddings via advanced regularization, the correlation achieved in this study (Pearson ≈ 0.33–0.40) is significantly lower. This difference may be due to domain mismatch (short evaluative texts in Indonesian), the absence of fine-tuning on domain data, and inherent variability in human-assigned expert scores.

4.3. Evaluation of Models Using RMSE, MEA, and Correlation Metrics

Based on the data in Table 4, model performance was assessed using root mean square error (RMSE), mean absolute error (MAE), and correlation metrics, supplemented by 95% confidence intervals derived through bootstrap resampling to reflect the uncertainty of each estimate. While W2V + CBOW + TFIDF yielded the lowest combined error (RMSE and MAE ≈ 0.99) and moderate correlations (Pearson = 0.524; Spearman = 0.543), these values indicate partial rather than strong agreement with expert judgments.

To determine whether performance differences were statistically meaningful, we additionally conducted paired significance tests (paired t-test and Wilcoxon signed-rank test) across identical text pairs. The results show that improvements over FastText-based models (e.g., FT + CBOW + TFIDF) were not consistently significant after Holm correction, suggesting that model ranking should be interpreted with caution [74,75,76].

Furthermore, residual analysis and calibration diagnostics reveal systematic underestimation for low-similarity pairs and overestimation at the upper range, indicating the need for post hoc score calibration rather than claiming outright superiority of any configuration.

The second and third best-performing models were those employing FastText (FT) with CBOW architecture combined with TFIDF and BM25 weighting schemes, respectively. These models recorded the total error values of 1.731, respectively. Although their error levels were slightly higher than those of the best-performing W2V model, they achieved higher Spearman correlation scores (0.334 and 0.524), respectively.

This suggests that although their numeric similarity estimates were less precise, the ordinal perception of similarity among documents was better preserved. Therefore, the FastText + CBOW approach remains a viable option, especially in contexts in which where maintaining ordinal relationships among texts is essential.

In contrast, a notable decline in performance was observed for models employing part-of-speech (POS) weighting and N-weighting schemes. Models such as W2V + CBOW + POS and W2V + CBOW + N recorded error values of 1.1815 and 1.2217, respectively, and significantly reduced correlation scores. This indicates that weighting based on grammatical categories or specific word frequencies was less effective in capturing the semantic representation of vector-based text similarity. This ineffectiveness can be attributed to a mismatch between the applied linguistic weights and the actual semantic context of the documents being analyzed.

The lowest performance scores were recorded by models based on Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT). Although these models are widely recognized for their ability to generate complex semantic representations, they fail to deliver optimal results in this particular experimental context. This underperformance could be due to a misalignment between the architecture or configuration of the models and specific characteristics of the perception-based data used in the study.

Overall, the resulting performance patterns suggested that classical embedding techniques such as W2V and FT, when combined with statistical weighting schemes, such as TFIDF and BM25, remained the most consistent and accurate approach for measuring text similarity. In contrast, linguistic-based weighting schemes and large generative models do not yield satisfactory outcomes. These findings emphasize the importance of selecting representation and weighting strategies that align with the structural and semantic characteristics of domain-specific data under investigation.

4.4. Comparison of Assessment Text Similarity Scores

Based on Table 5, which compares model-predicted similarity scores with expert-assigned perception scores, we observe that different embedding–weighting configurations exhibit systematic bias patterns. For instance, W2V + SG remains closely aligned with expert judgments on the first pair (Expert = 3; Model = 2.95–3.45), indicating partial semantic alignment. However, BERT + CBOW consistently overestimates similarity for mid-range pairs (Expert = 2; Model ≈ 3.18), while GPT + SG slightly overpredicts upper-range scores (Expert = 3; Model ≈ 3.94–3.96).

To better understand these discrepancies, we incorporate residual analysis (Model—Expert) and calibration diagnostics across similarity bins. These analyses reveal that transformer-based models tend to compress the score range, producing higher predictions even when human raters assign low-to-moderate similarity. This behavior suggests a lack of calibrated alignment rather than superior semantic understanding.

Therefore, instead of interpreting these differences as evidence of model superiority, we emphasize the need for post hoc calibration (e.g., isotonic or logistic scaling) to improve consistency between cosine-based predictions and human perceptual scales. Detailed reliability plots and residual distributions are provided in Section 4.5 to contextualize these model behaviors within perception-based evaluation.

In contrast, for the third pair (D20-1 and D20-2), which was rated 2 by the expert, the BERT model with CBOW architecture generated relatively high similarity scores of approximately 3.18. This value exceeded the expert-assigned score, indicating BERT’s tendency of BERT to overestimate semantic similarity when compared to manual human assessment.

For the fourth pair (D20-1 and D20-4), assigned a score of 3 by the expert, the GPT model with the Skip-gram architecture (GPT + SG) yielded similarity scores ranging from 3.94 to 3.96. These values were very close to the expert’s score, albeit slightly higher, demonstrating GPT’s effectiveness of GPT in capturing semantic relationships and textual similarities, particularly in policy texts related to institutional development.

In general, the patterns observed in the table indicate that GPT- and BERT-based embedding models tend to produce higher similarity scores, in some cases approaching or even exceeding expert judgments. In contrast, the FastText models exhibited lower performance, which aligned with document contexts that were deemed less relevant. Variations across weighting techniques, such as TFIDF, POS weighting, and N weighting, contributed relatively consistently to the final similarity scores. However, POS-based weighting appeared to be slightly more sensitive in capturing semantic meaning, especially in texts with complex thematic issues.

These findings reinforce the importance of selecting appropriate embedding and weighting strategies to produce accurate, reliable, and contextually relevant measurements of text similarity in academic analysis.

4.5. Evaluation Results of Method Combinations in Text Similarity Measurement

Figure 3a presents the prediction error levels measured using the root mean square error (RMSE). While the W2V + CBOW + TFIDF configuration achieved the lowest RMSE (1.083), its margin over FT + CBOW + TFIDF and FT + CBOW + BM25 (1.110 and 1.118, respectively) was modest. Figure 3b shows a similar trend in mean absolute error (MAE), where W2V-based models again led by a small numerical margin. These findings indicate that classical distributional embeddings combined with TFIDF or BM25 can approximate expert judgments with reasonable accuracy.

To assess whether these differences reflect meaningful performance gaps rather than random variation, paired statistical tests (paired t-test and Wilcoxon’s signed-rank test) were conducted across all text pairs. The tests revealed that several performance differences between the top three configurations did not reach statistical significance after the Holm correction, suggesting that multiple models fall within overlapping error ranges rather than forming a strict hierarchy.

Furthermore, error-based rankings alone do not fully capture perceptual alignment. A complementary calibration and reliability analysis (conducted within this section) revealed that transformer-based models, particularly GPT variants, exhibit systematic bias, tending to overestimate high-similarity cases and underestimate low-similarity pairs. These calibration effects were visualized through residual plots and binned reliability diagrams, demonstrating that low error does not guarantee alignment with human scoring scales.

Therefore, RMSE and MAE should be interpreted primarily as indicators of numerical fit, while calibration quality, confidence intervals, and residual behavior are necessary to evaluate true perceptual validity. This integrated interpretation supports cautious model selection rather than definitive claims of model superiority.

The performance of the models was further examined by integrating RMSE and MAE into a unified ranking metric (Figure 4). Although W2V + CBOW + TFIDF achieved the lowest aggregate error, the observed differences with FT + CBOW + TFIDF and FT + CBOW + BM25 were numerically small and did not consistently reach statistical significance under paired comparison tests across all text pairs.

Rather than indicating a single dominant model, these results suggest that multiple configurations exhibit comparable performance within overlapping error margins. Furthermore, error-based rankings do not fully account for calibration quality—as subsequent reliability analysis (Section 4.4) revealed systematic biases in high- and low-similarity regions, especially in transformer-based outputs.

Therefore, RMSE/MAE outcomes should be interpreted as indicators of numerical fit, while calibration diagnostics and confidence intervals are necessary to assess perceptual alignment. This integrated interpretation underscores the need for cautious model selection rather than conclusive claims of superiority.

5. Discussion

5.1. Proposed Model Performance in Text Similarity Measurement

Compared to previous studies such as those conducted [37,77,78], this study stands out in terms of the methodological approach. This study employs a case-based reasoning approach combined with cosine similarity to underline the role of weight configuration, examine the utility of Word2Vec in keyword extraction, and evaluate the capabilities of advanced transformer-based embeddings such as GPT and BERT. The experimental results revealed that the combination GPT + CBOW + TFIDF produced the highest and most stable similarity scores, with a standard deviation as low as 0.0133. These findings indicate that GPT is not only semantically superior but also highly consistent in assessing the similarity between perception-based documents.

Furthermore, this study confirmed and expanded upon previous findings that highlighted cosine similarity as a robust normalization method and reliable evaluation framework across various representation scales [20,75]. This demonstrated that cosine similarity became significantly more effective when combined with contextual embeddings and appropriate weighting techniques, such as TFIDF. This integration yielded text similarity measurements that were not only accurate but also stable.

The scientific implications of these results provided empirical validation for the superiority of transformer-based embeddings (GPT and BERT) combined with statistical weighting techniques (TFIDF) in capturing complex semantic nuances, surpassing classical embedding techniques such as Word2Vec and FastText. The findings also reinforced the conclusions of [7] regarding the effectiveness of hybrid approaches in feature selection and anomaly detection while extending their applicability to perception-based text assessment contexts.

From a practical standpoint, the proposed model can be implemented in a wide range of real-world applications including automated scoring systems in educational assessments, customer opinion analysis in recommendation systems, and semantic-based plagiarism detection. These findings align with the view that cosine similarity must be integrated with advanced approaches to overcome its limitations as a standalone evaluation metric [76]. This study provided compelling evidence that such integration, particularly with contextual embeddings, could significantly enhance both the accuracy and stability of the model.

Overall, this research not only reaffirms the importance of selecting appropriate combinations of embedding and weighting techniques for perception-based text similarity measurement, but also opens new avenues for developing more accurate, stable, and semantically responsive automated text evaluation systems.

5.2. Analysis of Hybrid Embedding and Weighting Model Performance

Previous studies have explored the effectiveness of text-similarity measurement models in both general and domain-specific applications, including educational contexts. The optimization of linguistic features, such as vocabulary and sentence structure, plays a crucial role in composition scoring systems based on text similarity [19]. Additionally, the contextual adaptation of character- and vocabulary-based approaches has been emphasized as vital in the educational domain [76]. Another study demonstrated that hybrid approaches utilizing BERT and Siamese Bi-LSTM were more effective in capturing semantic meanings, particularly in subjective feedback data [72]. However, the effectiveness of these advanced semantic models still requires further validation, especially in expert evaluations, which tend to be concise, idiomatic, and rich in expressive variations.

The findings of this study help to narrow the existing gap by providing comparative evidence rather than definitive conclusions. Based on the experimental results in Table 2, using the Pearson and Spearman correlation metrics, the FastText model with skip-gram architecture and POS weighting (FT + SG + POS) achieved the highest observed final score (0.432), with a Pearson correlation of 0.364, a Spearman correlation of 0.553, and a standard deviation of 0.677. These outcomes suggest that the FastText configuration tended to produce relatively stable representations within this dataset; however, its advantage should be interpreted cautiously, as performance may vary under different textual domains or evaluation criteria.

Contrary to findings in prior literature that often regarded BERT as superior in capturing complex semantic meanings, the BERT-based models demonstrated poor performance in the context of this study. The combination of BERT + SG + POS resulted in Pearson and Spearman correlations of close to zero or even negative [37,38]. This phenomenon highlights BERT’s limitations of BERT in handling short texts in limited context. These findings were consistent with the results from semantic similarity measurements on short texts, which suggested the necessity for more specific representations tailored to the nature of opinion-based input [72,73].

The primary strength of this study lies in its systematic and comprehensive experimental approach to evaluating combinations of embedding models (FT, BERT, Word2Vec), architectures (CBOW, Skip-gram), and weighting techniques (POS, IDF, and N-weighting). By employing three core statistical metrics—Pearson correlation, Spearman correlation, and standard deviation—this study provides a holistic view of model stability and performance consistency, which have often been overlooked in prior research.

In terms of scientific advancement, these findings reinforce the argument that numerical representation and appropriate similarity measurement methods significantly influence the accuracy of textual evaluations [77]. The results also supported recommendations advocating for the selective integration of deep learning-based and classical linguistic techniques adapted to the specific characteristics of the task at hand [41].

Thus, the main contribution of this study was the identification of the most effective model combinations for capturing perceptual nuances in short, subjective texts, along with an emphasis on the importance of contextual empirical validation of current NLP models. This study opens new directions for the development of more intelligent, adaptive, and inclusive automated assessment systems capable of handling the diversity of natural language expressions.

5.3. Correlation Relationship with Error Values (RMSE and MAE)

In the field of text similarity measurement research, integrating weighting schemes and embedding strategies into a unified evaluation framework proses several complex methodological challenges. Previous studies have highlighted the lack of standardized evaluation protocols as a major barrier to comparing the effectiveness of different approache [72]. This condition has slowed the development of a comprehensive framework for evaluating text-similarity models. This study contributes significantly by proposing an integrative approach that combines statistical distribution-based weighting schemes (TFIDF, BM25) with classical embedding techniques (Word2Vec, FastText) within a cosine similarity-based evaluation framework. This evaluation was empirically supported by the absolute error metrics (MAE and RMSE) and ordinal correlation measures (Pearson and Spearman).

In contrast to previous studies that examined embeddings or weighting methods in isolation [19,45], this study demonstrated that the combination of Word2Vec-CBOW and TFIDF achieved the lowest numerical errors (MAE and RMSE = 0.9868) while maintaining moderately high Pearson and Spearman correlations (0.326 and 0.407, respectively). However, the analysis also revealed that performance superiority varied depending on the evaluation dimension. Specifically, FastText with Skip-Gram and POS weighting (FT + SG + POS) achieved the highest correlation with expert judgments, whereas GPT-based models demonstrated the smallest standard deviation (σ = 0.0133), indicating stronger stability. These findings suggest that while Word2Vec + CBOW + TFIDF excels in minimizing prediction error, other hybrid configurations offer complementary advantages—correlation accuracy in the case of FastText and consistency in the case of GPT. This reinforces the importance of adopting a multi-criteria evaluation approach for text similarity tasks. Furthermore, the findings highlight that classical vector models such as Word2Vec, when strengthened by statistical weighting schemes (e.g., TFIDF), remain effective for minimizing numerical prediction errors. However, transformer-based models such as BERT and GPT demonstrate competitive performance in maintaining contextual stability, particularly in cases where domain fine-tuning is not applied [18,20].

From a methodological standpoint, the approach used in this study stands out for its systematic experimental design and reliance on measurable quantitative evaluation metrics. Unlike previous approaches, which did not integrate numerical evaluation with ordinal correlation analysis [78], this study offer a comprehensive evaluation framework. The proposed evaluation enabled a performance assessment in terms of both numerical accuracy (absolute error) and consistency in perceptual ranking. This approach addresses the need for a holistic evaluative model that can be applied in various contexts.

The practical implications of these findings are highly relevant across a range of applications, including automated text-based scoring systems, perception evaluations in surveys, and recommendation systems that rely on semantic similarity. The combination of embedding and weighting strategies demonstrated in this study could significantly enhance the system performance in detecting document similarity, especially within short and subjective text contexts [20,24,75,79].

The primary contribution of this study lies in its empirical evidence that the synergy between embedding techniques and word weighting schemes was not only theoretically feasible but also produced significant improvements in the accuracy of text similarity measurements. These findings reinforce the urgency of developing evaluation frameworks that holistically integrate both distributional and semantic perspectives [38]. Thus, this study not only addresses theoretical gaps in the previous literature but also provides a strong practical foundation for the future development of intelligent text-based systems [65,75].

The public release of the annotated dataset contributes to the transparency and replicability of this study. By enabling other researchers to evaluate and extend the proposed hybrid embedding–weighting framework, this resource supports further benchmarking of Indonesian text similarity models, especially in educational and perception-based contexts.

6. Conclusions

The primary objective of this study is to provide a comprehensive understanding of the effectiveness of combining embedding and weighting techniques in measuring perception-based text similarity using the Cosine Similarity approach. The main findings demonstrated that different hybrid configurations excelled across distinct evaluation metrics. The combination of Word2Vec with the CBOW architecture and TFIDF weighting achieved the lowest RMSE and MAE values (0.9868), suggesting stronger numerical alignment in this experimental context.

Meanwhile, FastText with Skip-Gram and POS weighting (FT + SG + POS) attained the highest correlation with expert assessments, and GPT + CBOW + TFIDF displayed the greatest score consistency across samples. Therefore, the overall interpretation of performance must be viewed through a multi-dimensional lens rather than a single metric. Word2Vec + CBOW + TFIDF appeared favorable for error minimization, FastText + SG + POS for semantic correlation, and GPT + CBOW + TFIDF for prediction stability. This comprehensive perspective aligns the empirical results across all metrics and highlights the complementary strengths of the hybrid models evaluated

Rather than asserting novelty, this study offers an evaluative perspective that integrates and compares a variety of embedding and weighting techniques which have often been examined in isolation. The study emphasized that the effectiveness of text similarity measurement is not solely determined by the complexity of embedding models, such as BERT or GPT, but is also influenced by the alignment between data characteristics, domain context, and the weighting mechanisms employed. In doing so, the study helps extend the scope of related research by highlighting the importance of synergy between distributional and semantic dimensions in text representation.

The main contribution of this study is the development of a systematic experimental framework that can be utilized by both researchers and practitioners to holistically evaluate the performance of embedding and weighting technique combinations. The implications of this approach are highly relevant for the development of text-based systems in fields such as education, public service, and social perception analysis, where the semantic interpretation of perception-based text holds strategic value. Additionally, the results offer an empirical basis supporting the argument that distributional-based text representation strategies remain advantageous when properly contextualized.

The findings indicate that model effectiveness depends not only on embedding complexity but also on how well the weighting scheme complements linguistic characteristics. Word2Vec + CBOW + TFIDF performed best in minimizing error, while FastText + Skip-gram + POS-weighting achieved stronger semantic alignment. These differences reflect trade-offs between contextual stability, syntactic sensitivity, and computational efficiency a balance crucial for practical NLP applications in low-resource educational domains.

The proposed HyEWCos framework can serve as a reference model for developing automated essay grading, perception analysis, and NLP applications for other low-resource languages. The framework’s structured evaluation methodology enables reproducible, interpretable, and scalable text assessment systems.

Although the study achieved meaningful findings, it is important to note a key methodological limitation: transformer-based models such as BERT and GPT were used without domain-specific fine-tuning. This decision ensured fair benchmarking and focused comparison among embedding–weighting combinations, but it also likely limited the contextual precision of transformer embeddings. Future research should address this limitation by incorporating fine-tuning and adaptive ensemble techniques that dynamically integrate the strengths of multiple embedding and weighting models to achieve more domain-sensitive and semantically robust performance.

Author Contributions

Conceptualization, H.H. and T.T.; methodology, E.S. and A.F.; software, B.H.; validation, H.H., T.T. and E.S.; formal analysis, H.H.; investigation, T.T.; resources, A.F.; data curation, A.F.; writing—original draft preparation, T.T.; writing—review and editing, H.H. and T.T.; visualization, B.H.; supervision, E.S.; project administration, B.H.; funding acquisition, T.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Postdoctoral Funding Project from the Faculty of Computer Science, Buana Perjuangan Karawang University (Grant No. 535/R/KU/2025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article All implementation scripts, preprocessing routines, and experimental configurations used in developing the HyEWCos framework are also available for research replication. The complete codebase—including data preprocessing, hybrid embedding–weighting integration, and evaluation modules—is hosted on a public GitHub repository: https://github.com/tukino68/Giri-Research (accessed on 14 September 2025). If repository access is temporarily restricted, the corresponding author can provide the implementation upon reasonable academic request. This ensures the reproducibility and transparency of the reported experiments in line with open research standards.

Acknowledgments

We thank the reviewers and editors for their valuable time and insightful comments during the peer review of our study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HyEWCos	Hybrid Embedding with Weighting and Cosine Similarity
W2V	Word2vec
FT	FastText
BERT	Bidirectional Encoder Representations from Transformers
GPT	Generative Pre-Trained Transformer
TFIDF-w	Term Frequency-Inverse Document Frequency-Weighting
POS-w	Part-of-Speech—Weighting
N-w	No Weighting
CBOW	Continuous Bag of Words
SG	Skip-Gram
RMSE	Root Mean Square Error
MAE	Mean Absolute Error
STD	Standard Deviation
P1	Text Pair D1 with Text D2
P2	Text Pair D1 with Text D3
P3	Text Pair D1 with Text D4
ST	Text Similarity Score
ED	Embedding
WT	Weighting
TA_STD	Total Average Standard Deviation
TA_SM	Total Average Similarity

References

Rasool, A.; Aslam, S.; Hussain, N.; Imtiaz, S.; Riaz, W. nBERT: Harnessing NLP for Emotion Recognition in Psychotherapy to Transform Mental Health Care. Information 2025, 16, 301. [Google Scholar] [CrossRef]
Makhmudov, F.; Kultimuratov, A.; Cho, Y.I. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Appl. Sci. 2024, 14, 4199. [Google Scholar] [CrossRef]
Lestandy, M.; Abdurrahim. Effect of Word2Vec Weighting with CNN-BiLSTM Model on Emotion Classification. J. Nas. Pendidik. Tek. Inform. 2023, 12, 99–107. [Google Scholar] [CrossRef]
Malik, R.A.A.; Sibaroni, Y. Multi-aspect Sentiment Analysis of Tiktok Application Usage Using FasText Feature Expansion and CNN Method. J. Comput. Syst. Inform. 2022, 3, 277–285. [Google Scholar] [CrossRef]
Lokkondra, C.Y.; Ramegowda, D.; Thimmaiah, G.M.; Bassappa Vijaya, A.P.; Shivananjappa, M.H. ETDR: An Exploratory View of Text Detection and Recognition in Images and Videos. Rev. d’Intelligence Artif. 2021, 35, 383–393. [Google Scholar] [CrossRef]
Tiwari, D.; Nagpal, B.; Bhati, B.S.; Mishra, A.; Kumar, M. A Systematic Review of Social Network Sentiment Analysis with Comparative Study of Ensemble-Based Techniques; Springer: Dordrecht, The Netherlands, 2023; Volume 56. [Google Scholar] [CrossRef]
Shahbandegan, A.; Mago, V.; Alaref, A.; van der Pol, C.B.; Savage, D.W. Developing a machine learning model to predict patient need for computed tomography imaging in the emergency department. PLoS ONE 2022, 17, e0278229. [Google Scholar] [CrossRef]
Subba, B.; Kumari, S. A heterogeneous stacking ensemble based sentiment analysis framework using multiple word embeddings. Comput. Intell. 2022, 38, 530–559. [Google Scholar] [CrossRef]
Allahim, A.; Cherif, A. Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation. Appl. Sci. 2024, 14, 11104. [Google Scholar] [CrossRef]
Sikic, L.; Kurdija, A.S.; Vladimir, K.; Silic, M. Graph Neural Network for Source Code Defect Prediction. IEEE Access 2022, 10, 10402–10415. [Google Scholar] [CrossRef]
Ali, A.; Taqa, A. Analytical Study of Traditional and Intelligent Textual Plagiarism Detection Approaches. J. Educ. Sci. 2022, 31, 8–25. [Google Scholar] [CrossRef]
Hussain, Z.; Mata, R.; Wulff, D.U. Novel embeddings improve the prediction of risk perception. EPJ Data Sci. 2024, 13, 38. [Google Scholar] [CrossRef]
Li, Z.; Tomar, Y.; Passonneau, R.J. A Semantic Feature-Wise Transformation Relation Network for Automatic Short Answer Grading. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6030–6040. [Google Scholar] [CrossRef]
Yang, S.; Huang, G.; Ofoghi, B.; Yearwood, J. Short text similarity measurement using context-aware weighted biterms. Concurr. Comput. Pract. Exp. 2020, 34, e5765. [Google Scholar] [CrossRef]
Rosnelly, R.; Hartama, D.; Sadikin, M.; Lubis, C.P.; Simanjuntak, M.S.; Kosasi, S. The Similarity of Essay Examination Results using Preprocessing Text Mining with Cosine Similarity and Nazief-Adriani Algorithms. Turkish J. Comput. Math. Educ. 2021, 12, 1415–1422. [Google Scholar] [CrossRef]
Ramadhani, S.; Hariyadi, M.A.; Crysdian, C. The Evaluation of Computer Science Curriculum for High School Education Based on Similarity Analysis. Int. J. Adv. Data Inf. Syst. 2023, 4, 201–213. [Google Scholar] [CrossRef]
Priyatno, A.M.; Prasetya, M.R.A.; Cholidhazia, P.; Sari, R.K. Comparison of Similarity Methods on New Student Admission Chatbots Using Retrieval-Based Concepts. J. Eng. Sci. Appl. 2024, 1, 32–40. [Google Scholar] [CrossRef]
Wang, L.; Luo, J.; Deng, S.; Guo, X. RoCS: Knowledge Graph Embedding Based on Joint Cosine Similarity. Electronics 2024, 13, 147. [Google Scholar] [CrossRef]
Wang, J.; Dong, Y. Measurement of text similarity: A survey. Information 2020, 11, 421. [Google Scholar] [CrossRef]
Chawla, S.; Kaur, R.; Aggarwal, P. Text classification framework for short text based on TFIDF-FastText. Multimed. Tools Appl. 2023, 82, 40167–40180. [Google Scholar] [CrossRef]
Deng, C.; Lai, G.; Deng, H. Improving word vector model with part-of-speech and dependency grammar information. CAAI Trans. Intell. Technol. 2020, 5, 260–267. [Google Scholar] [CrossRef]
Nugroho, F.A.; Septian, F.; Pungkastyo, D.A.; Riyanto, J. Penerapan Algoritma Cosine Similarity untuk Deteksi Kesamaan Konten pada Sistem Informasi Penelitian dan Pengabdian Kepada Masyarakat. J. Inform. Univ. Pamulang 2021, 5, 529. [Google Scholar] [CrossRef]
Febriyanti, N.; Rini, D.P.; Arsalan, O. Text Similarity Detection Between Documents Using Case Based Reasoning Method with Cosine Similarity Measure (Case Study SIMNG LPPM Universitas Sriwijaya). Sriwij. J. Inform. Appl. 2022, 3, 36–45. [Google Scholar] [CrossRef]
Pertiwi, A.; Azhari, A.; Mulyana, S. Fast2Vec, a modified model of FastText that enhances semantic analysis in topic evolution. PeerJ Comput. Sci. 2025, 11, e2862. [Google Scholar] [CrossRef]
Sarwar, T.B.; Noor, N.M.; Miah, M.S.U. Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding. PeerJ Comput. Sci. 2022, 8, e1024. [Google Scholar] [CrossRef]
Suzanti, I.O.; Jauhari, A. Comparison of Stemming and Similarity Algorithms in Indonesian Translated Al-Qur’an Text Search. J. Ilm. Kursor 2022, 11, 91. [Google Scholar] [CrossRef]
Mao, Y.; Fung, K.W. Use of word and graph embedding to measure semantic relatedness between unified medical language system concepts. J. Am. Med. Inform. Assoc. 2020, 27, 1538–1546. [Google Scholar] [CrossRef]
Sovina, M.; Yusfrizal, Y.; Harahap, F.A.; Lazuly, I. Application for Recommending Tourist Attractions on The Island of Java with Content Based Filtering Using Cosine Similarity. J. Artif. Intell. Eng. Appl. 2024, 3, 565–569. [Google Scholar] [CrossRef]
Mai, G.; Janowicz, K.; Prasad, S.; Shi, M.; Cai, L.; Zhu, R.; Regalia, B.; Lao, N. Semantically-Enriched Search Engine for Geoportals: A Case Study with ArcGIS Online. Agil. GIScience Ser. 2020, 1, 13. [Google Scholar] [CrossRef]
Thapa, M.; Kapoor, P.; Kaushal, S.; Sharma, I. A Review of Contextualized Word Embeddings and Pre-Trained Language Models, with a Focus on GPT and BERT. In Proceedings of the 1st International Conference on Cognitive & Cloud Computing, Jaipur, India, 1–2 August 2024; pp. 205–214. [Google Scholar] [CrossRef]
HaCohen-Kerner, Y.; Miller, D.; Yigal, Y. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 2020, 15, e0232525. [Google Scholar] [CrossRef]
Kowsari, K.; Meimandi, K.J.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef]
Camacho-Collados, J.; Pilehvar, M.T. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 40–46. [Google Scholar] [CrossRef]
Trieu, H.L.; Miwa, M.; Ananiadou, S. BioVAE: A pre-trained latent variable language model for biomedical text mining. Bioinformatics 2022, 38, 872–874. [Google Scholar] [CrossRef] [PubMed]
Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv 2021, arXiv:2104.08663. [Google Scholar] [CrossRef]
Jalilifard, A.; Caridá, V.F.; Mansano, A.F.; Cristo, R.S.; da Fonseca, F.P.C. Semantic Sensitive TF-IDF to Determine Word Relevance in Documents. In Advances in Computing and Network Communications; Lecture Notes in Electrical Engineering; Springer: Singapore, 2021; pp. 327–337. [Google Scholar] [CrossRef]
Babić, K.; Guerra, F.; Martinčić-Ipšić, S.; Meštrović, A. A comparison of approaches for measuring the semantic similarity of short texts based on word embeddings. J. Inf. Organ. Sci. 2020, 44, 231–246. [Google Scholar] [CrossRef]
Umer, M.; Imtiaz, Z.; Ahmad, M.; Nappi, M.; Medaglia, C.; Choi, G.S.; Mehmood, A. Impact of convolutional neural network and FastText embedding on text classification. Multimed. Tools Appl. 2023, 82, 5569–5585. [Google Scholar] [CrossRef]
Galal, O.; Abdel-Gawad, A.H.; Farouk, M. Rethinking of BERT sentence embedding for text classification. Neural Comput. Appl. 2024, 36, 20245–20258. [Google Scholar] [CrossRef]
Weng, M.H.; Wu, S.; Dyer, M. Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods. Appl. Sci. 2022, 12, 11220. [Google Scholar] [CrossRef]
Singh, R.; Singh, S. Text Similarity Measures in News Articles by Vector Space Model Using NLP. J. Inst. Eng. Ser. B 2021, 102, 329–338. [Google Scholar] [CrossRef]
Harwood, T.V.; Treen, D.G.C.; Wang, M.; de Jong, W.; Northen, T.R.; Bowen, B.P. BLINK enables ultrafast tandem mass spectrometry cosine similarity scoring. Sci. Rep. 2023, 13, 13462. [Google Scholar] [CrossRef]
Al-Tarawneh, M.A.B.; Al-irr, O.; Al-Maaitah, K.S.; Kanj, H.; Aly, W.H.F. Enhancing Fake News Detection with Word Embedding: A Machine Learning and Deep Learning Approach. Computers 2024, 13, 239. [Google Scholar] [CrossRef]
Zhou, Y.; Li, C.; Huang, G.; Guo, Q.; Li, H.; Wei, X. A Short-Text Similarity Model Combining Semantic and Syntactic Information. Electronics 2023, 12, 3126. [Google Scholar] [CrossRef]
Lezama-Sánchez, A.L.; Vidal, M.T.; Reyes-Ortiz, J.A. An Approach Based on Semantic Relationship Embeddings for Text Classification. Mathematics 2022, 10, 4161. [Google Scholar] [CrossRef]
Szymański, J.; Operlejn, M.; Weichbroth, P. Enhancing Word Embeddings for Improved Semantic Alignment. Appl. Sci. 2024, 14, 11519. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
Patil, A.; Han, K.; Jadon, A. Comparative Analysis of Text Embedding Models for Bug Report Semantic Similarity. In Proceedings of the 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 21–22 March 2024; pp. 262–267. [Google Scholar] [CrossRef]
Xiao, L.; Li, Q.; Ma, Q.; Shen, J.; Yang, Y.; Li, D. Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec. PLoS ONE 2024, 19, e0305095. [Google Scholar] [CrossRef]
Colla, D.; Mensa, E.; Radicioni, D.P. Novel metrics for computing semantic similarity with sense embeddings. Knowl.-Based Syst. 2020, 206, 106346. [Google Scholar] [CrossRef]
Gani, M.O.; Ayyasamy, R.K.; Alhashmi, S.M.; Sangodiah, A.; Fui, Y.T. ETFPOS-IDF: A Novel Term Weighting Scheme for Examination Question Classification Based on Bloom’s Taxonomy. IEEE Access 2022, 10, 132777–132785. [Google Scholar] [CrossRef]
Rani, R.; Lobiyal, D.K. A weighted word embedding based approach for extractive text summarization. Expert Syst. Appl. 2021, 186, 115867. [Google Scholar] [CrossRef]
Qiu, Z.; Huang, G.; Qin, X.; Wang, Y.; Wang, J.; Zhou, Y. A Hybrid Semantic Representation Method Based on Fusion Conceptual Knowledge and Weighted Word Embeddings for English Texts. Information 2024, 15, 708. [Google Scholar] [CrossRef]
Gong, P.; Liu, J.; Xie, Y.; Liu, M.; Zhang, X. Enhancing context representations with part-of-speech information and neighboring signals for question classification. Complex Intell. Syst. 2023, 9, 6191–6209. [Google Scholar] [CrossRef]
Beno, J.; Silen, A.; Yanti, M. The Structure of Health Factors among Community-dwelling Elderly People. Braz. Dent. J. 2022, 33, 1–12. [Google Scholar]
Wang, H.; Yu, D. Going Beyond Sentence Embeddings: A Token-Level Matching Algorithm for Calculating Semantic Textual Similarity. Proc. Annu. Meet. Assoc. Comput. Linguist. 2023, 2, 563–570. [Google Scholar] [CrossRef]
Balkus, S.V.; Yan, D. Improving short text classification with augmented data using GPT-3. Nat. Lang. Eng. 2024, 30, 943–972. [Google Scholar] [CrossRef]
Felix, E.A.; Lee, S.P. Systematic literature review of preprocessing techniques for imbalanced data. IET Softw. 2019, 13, 479–496. [Google Scholar] [CrossRef]
Karaca, M.F. Effects of preprocessing on text classification in balanced and imbalanced datasets. KSII Trans. Internet Inf. Syst. 2024, 18, 591–609. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, B.; Liu, W.; Cai, J.; Zhang, H. STMAP: A novel semantic text matching model augmented with embedding perturbations. Inf. Process. Manag. 2024, 61, 103576. [Google Scholar] [CrossRef]
Siegert, I.; Böck, R.; Wendemuth, A. Inter-rater reliability for emotion annotation in human–computer interaction: Comparison and methodological improvements. J. Multimodal User Interfaces 2014, 8, 17–28. [Google Scholar] [CrossRef]
Yaman, N. A corpus-based analysis of conversational features in bahasa Inggris textbooks for junior high schools in Indonesia. Bachelor’s Thesis, Universitas Ahmad Dahlan, Yogyakarta, Indonesia, 2023; pp. 120–130. [Google Scholar]
Li, X.; Li, J. AoE—Angle-Optimized Embeddings for Semantic Textual Similarity; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024. [Google Scholar] [CrossRef]
Dos Santos, F.J.; Coelho, A.L.V. Eliciting correlated weights for multi-criteria group decision making with generalized canonical correlation analysis. Symmetry 2020, 12, 1612. [Google Scholar] [CrossRef]
Das, M.; Kamalanathan, S.; Alphonse, P. A Comparative Study on TF-IDF feature weighting method and its analysis using unstructured dataset. In Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems, Kharkiv, Ukraine, 22–23 April 2021; Volume 2870, pp. 98–107. [Google Scholar]
Marwah, D.; Beel, J. Term-Recency for {TF}-{IDF}, {BM}25 and {USE} Term Weighting. In Proceedings of the 8th International Workshop on Mining Scientific Publications; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 36–41. Available online: https://www.aclweb.org/anthology/2020.wosp-1.5 (accessed on 1 July 2025).
Zhang, K.; Liu, Y.; Mei, F.; Sun, G.; Jin, J. IBGJO: Improved Binary Golden Jackal Optimization with Chaotic Tent Map and Cosine Similarity for Feature Selection. Entropy 2023, 25, 1128. [Google Scholar] [CrossRef] [PubMed]
Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910. [Google Scholar] [CrossRef]
Roudbaraki, S.T. Benchmarking Synonym Extraction Methods in Domain-Specific Contexts. Politecnico di Torino. 2025. Available online: http://webthesis.biblio.polito.it/id/eprint/36445. (accessed on 1 July 2025).
Xu, J.; Shao, W.; Chen, L.; Liu, L. SimCSE++: Improving Contrastive Learning for Sentence Embeddings from Two Perspectives. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12028–12040. [Google Scholar] [CrossRef]
Rep, I.; Dukić, J.; Šnajder, J. Are ELECTRA’s Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity. In Proceedings of the EMNLP 2024–2024 Conference on Empirical Methods in Natural Language Processing Finding EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 9159–9169. [Google Scholar] [CrossRef]
Iqbal, M.A.; Sharif, O.; Hoque, M.M.; Sarkar, I.H. Word Embedding based Textual Semantic Similarity Measure in Bengali. Procedia Comput. Sci. 2021, 193, 92–101. [Google Scholar] [CrossRef]
Viji, D.; Revathy, S. A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese Bi—LSTM model for semantic text similarity identification. Multimed. Tools Appl. 2022, 81, 6131–6157. [Google Scholar] [CrossRef]
Prakoso, D.W.; Abdi, A.; Amrit, C. Short text similarity measurement methods: A review. Soft Comput. 2021, 25, 4699–4723. [Google Scholar] [CrossRef]
Hameed, N.H.; Alimi, A.M.; Sadiq, A.T. Short Text Semantic Similarity Measurement Approach Based on Semantic Network. Baghdad Sci. J. 2022, 19, 1581–1591. [Google Scholar] [CrossRef]
Xin, Y. Development of English Composition Correction and Scoring System Based on Text Similarity Algorithm. J. Electr. Syst. 2024, 20, 501–508. [Google Scholar] [CrossRef]
Dasgupta, J.; Mishra, P.K.; Karuppasamy, S.; Mahajan, A.D. A Survey of Numerous Text Similarity Approach. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2023, 3307, 184–194. [Google Scholar] [CrossRef]
Wehnert, S.; Dureja, S.; Kutty, L.; Sudhi, V.; De Luca, E.W. Applying BERT Embeddings to Predict Legal Textual Entailment. Rev. Socionetw. Strateg. 2022, 16, 197–219. [Google Scholar] [CrossRef]
Hamza, A.; En-Nahnahi, N.; El Mahdaouy, A.; El Alaoui Ouatik, S. Embedding arabic questions by feature-level fusion of word representations for questions classification: It is worth doing. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 6583–6594. [Google Scholar] [CrossRef]

Figure 1. CBOW vs. Skip-gram Architectures.

Figure 2. The proposed method hybrid embedding and weighting with cosine similarity.

Figure 3. Prediction error rate ((a) RMSE; (b) MEA).

Figure 4. Prediction error rate (RMSE+ MEA).

Table 1. Proposed hybrid models for similarity measurement.

Embedding	TFIDF-Weighting	BM25-Weighting	POS-Weighting	N-Weighting
W2V + CBOW W	W2V + CBOW + TFIDF	W2V + CBOW + BM25	W2V + CBOW + POS	W2V + CBOW + N
W2V + Skip-gram	W2V + SG + TFIDF	W2V + SG + BM25	W2V + SG + POS	W2V + SG + N
FastText + CBOW	FT + CBOW + TFIDF	FT + CBOW + BM25	FT + CBOW + POS	FT + CBOW + N
FastText + Skip-gram	FT + SG + TFIDF	FT + SG + BM25	FT + SG + POS	FT + SG + N
BERT + CBOW	BERT + CBOW + TFIDF	BERT + CBOW + BM25	BERT + CBOW + POS	BERT + CBOW + N
BERT + Skip-gram	BERT + SG + TFIDF	BERT + SG + BM25	BERT + SG + POS	BERT + SG + N
GPT + CBOW	GPT + CBOW + TFIDF	GPT + CBOW + BM25	GPT + CBOW + POS	GPT + CBOW + N
GPT + Skip-gram	GPT + SG + TFIDF	GPT + SG + BM25	GPT + SG + POS	GPT + SG + N

Table 2. Model performance in measuring textual similarity.

Models	Average Text Similarity Score			Average Standard Deviation Score			TA_SM	TA_STD
Models	P1	P2	P3	P1	P2	P3	TA_SM	TA_STD
W2V + SG + N	3.079776	3.132397	3.038225	0.840000	0.780000	0.760000	3.083466	0.793333
W2V + SG + POS	3.049827	3.107077	3.047112	0.830000	0.770000	0.770000	3.068006	0.790000
W2V + SG + TFIDF	2.854184	2.866958	2.802054	0.830000	0.960000	0.830000	2.841065	0.873333
FT + SG + N	3.532000	3.485506	3.502811	0.590000	0.670000	0.620000	3.520506	0.626667
FT + SG + POS	3.529990	3.478788	3.449800	0.630000	0.720000	0.680000	3.486193	0.676667
FT + SG + TFIDF	3.297830	3.315046	3.293138	0.770000	0.900000	0.860000	3.302005	0.843333
BERT + CBOW + TFIDF	3.619748	3.591358	3.596959	0.396800	0.449300	0.846100	3.602688	0.564067
BERT + SG + TFIDF	3.495678	3.522178	3.524813	0.360000	0.300000	0.260000	3.514223	0.306667
BERT + SG + N	3.495734	3.522248	3.498465	0.360000	0.300000	0.260000	3.505482	0.306667
GPT + CBOW + TFIDF	3.986f171	3.974885	3.983351	0.010000	0.020000	0.010000	3.981469	0.013333
GPT + SG + TFIDF	3.986703	3.973653	3.983351	0.010000	0.020000	0.010000	3.981236	0.013333
GPT + CBOW + BM25	3.990000	3.970000	3.980000	0.010000	0.020000	0.010000	3.980000	0.013333

Table 3. Correlation evaluation summary.

Embedding	Weighting	Vector Size	Average Pearson Correlation	Average Spearman Correlation	Average Standard Deviation	Final Score
W2V + SG	TFIDF	128	0.594	0.540	0.847	0.484
W2V + SG	POS	128	0.469	0.586	0.873	0.447
W2V + CBOW	BM25	128	0.510	0.527	0.933	0.428
W2V + CBOW	POS	150	0.375	0.468	1.947	0.148
FT + SG	POS	128	0.592	0.573	0.817	0.502
	N	128	0.365	0.607	0.843	0.420
	BM25	128	0.749	0.622	0.903	0.568
FT CBOW	BM25	150	0.524	0.547	0.777	0.473
FT CBOW	TFIDF	150	0.334	0.597	0.677	0.437
BERT + SG	POS	128	0.397	0.466	0.303	0.478
BERT + SG	TFIDF	128	0.152	0.475	1.447	0.145
BERT + CBOW	POS	128	0.397	0.466	0.307	0.479
BERT + CBOW	TFIDF	150	0.397	0.346	0.077	0.524
GPT + CBOW	TFIDF	128	0.354	0.466	0.013	0.526
GPT + CBOW	N	128	0.354	0.474	0.013	0.529
GPT + SG	TGIDF	128	0.354	0.474	0.013	0.529
GPT + SG	POS	150	0.293	0.346	0.033	0.449

Table 4. Comparison of model performance evaluation.

Model	RMSE	MAE	Composite Mean Error (CME)	Pearson Corc	Spearman Corr
W2V + CBOW + TFIDF	1.0827056	0.8909355	0.9868206	0.5244000	0.5430000
FT + CBOW + TFIDF	1.7811809	1.6800599	1.7306204	0.3344000	0.5972000
FT + CBOW + BM25	1.7812067	1.6800867	1.7306467	0.5235000	0.5468000
W2V + CBOW + POS	1.2822023	1.0808654	1.1815338	0.3750000	0.4684000
FT + CBOW + POS	1.7812112	1.6800940	1.7306526	0.3329000	0.5886000
W2V + CBOW + N	1.3214898	1.1219350	1.2217124	0.4365000	0.7800000
BERT + SG + TFIDF	1.7812067	1.6800867	1.7306467	0.1516000	0.4750000
BERT + SG + POS	1.7811809	1.6800599	1.7306204	0.3968000	0.4662000
BERT + SG + N	2.1807533	2.1169872	2.1488703	0.4515000	0.3460000
GPT + SG + TFIDF	2.2071628	2.1427744	2.1749686	0.3544000	0.4662000

Table 5. Result of text perception assessment measurement.

Text-1	Text-2	ED	ST Expert	ST TFIDF	ST POS	ST N
D1-1. aspek politik uts sendiri memiliki kedekatan yang cukup erat dengan dunia	D1-3. upps telah mengungkapkan isu yang terkait dengan kondisi lingkungan	W2V + SG	3	2.9586	3.455	3.394
D1-1. The political aspect of uta itself has a fairly close relationship with the world of	D1-3. The study program has expressed issues related to environmental conditions	W2V + SG	3	2.9586	3.455	3.394
D9-1. Surat keputusan yayasan pendidikan anak kukang tentang statuta	D9-2 kebijakan terkait pengembangan kerjasama tidak ada penjelasan	FT + SG	1	1.3602	1.318	1.366
D9-1. The decree of the Anak Kukang Educational Foundation regarding the statute	D9-2. There is no explanation regarding policies related to partnership development	FT + SG	1	1.3602	1.318	1.366
D20-1 dokumen formal kebijakan standar pendidikan meliputi	D20-2 ada 15 sk tentang pelaksanaan pendidikan yang tidak dapat diases karena	BERT + CBOW	2	3.1808	3.181	3.181
D20-1. The formal document on education standard policy includes decrees	D20-2. There are 15 decrees on the implementation of education that cannot be assessed because	BERT + CBOW	2	3.1808	3.181	3.181
D20-1 dokumen formal kebijakan standar pendidikan meliputi	D20-4 ada sk rektor tentang pembantukan tim penyusun rencana induk	GPT + SG	3	3.9497	3.963	3.950
D20-1. The formal document on education standard policy includes decrees	D20-4. There is a rector’s decree regarding the formation of a master plan drafting team	GPT + SG	3	3.9497	3.963	3.950

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hendry, H.; Tukino, T.; Sediyono, E.; Fauzi, A.; Huda, B. HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text. Information 2025, 16, 995. https://doi.org/10.3390/info16110995

AMA Style

Hendry H, Tukino T, Sediyono E, Fauzi A, Huda B. HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text. Information. 2025; 16(11):995. https://doi.org/10.3390/info16110995

Chicago/Turabian Style

Hendry, Hendry, Tukino Tukino, Eko Sediyono, Ahmad Fauzi, and Baenil Huda. 2025. "HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text" Information 16, no. 11: 995. https://doi.org/10.3390/info16110995

APA Style

Hendry, H., Tukino, T., Sediyono, E., Fauzi, A., & Huda, B. (2025). HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text. Information, 16(11), 995. https://doi.org/10.3390/info16110995

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HyEWCos: A Comparative Study of Hybrid Embedding and Weighting Techniques for Text Similarity in Short Subjective Educational Text

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Pre Processing

3.2. Corpus Construction

3.3. Data Transformation

3.4. Building an Embedding and Word Weighting Model

3.5. Model Evaluation

3.6. Evaluation Result Analysis

4. Results

4.1. Model Performance and Consistency Evaluation Based

4.2. Evaluation Based on Combined Correlation and Standard Deviation Scores

4.3. Evaluation of Models Using RMSE, MEA, and Correlation Metrics

4.4. Comparison of Assessment Text Similarity Scores

4.5. Evaluation Results of Method Combinations in Text Similarity Measurement

5. Discussion

5.1. Proposed Model Performance in Text Similarity Measurement

5.2. Analysis of Hybrid Embedding and Weighting Model Performance

5.3. Correlation Relationship with Error Values (RMSE and MAE)

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI