LLM-Guided Weighted Contrastive Learning with Topic-Aware Masking for Efficient Domain Adaptation: A Case Study on Pulp-Era Science Fiction

Sujin Kang

doi:10.3390/electronics14173351

Department of English Language and Literature, Kyungpook National University, Daegu 41566, Republic of Korea

Electronics2025, 14(17), 3351;https://doi.org/10.3390/electronics14173351

This article belongs to the Section Artificial Intelligence

Version Notes

Order Reprints

Abstract

Domain adaptation of pre-trained language models remains challenging, especially for specialized text collections that include distinct vocabularies and unique semantic structures. Existing contrastive learning methods frequently rely on generic masking techniques and coarse-grained similarity measures, which limit their ability to capture fine-grained, domain-specific linguistic nuances. This paper proposes an enhanced domain adaptation framework by integrating weighted contrastive learning guided by large language model (LLM) feedback and a novel topic-aware masking strategy. Specifically, topic modeling is utilized to systematically identify semantically crucial domain-specific terms, enabling the creation of meaningful contrastive pairs through three targeted masking strategies: single-keyword, multiple-keyword, and partial-keyword masking. Each masked sentence undergoes LLM-guided reconstruction, accompanied by graduated similarity assessments that serve as continuous, fine-grained supervision signals. Experiments conducted on an early 20th-century science fiction corpus demonstrate that the proposed approach consistently outperforms existing baselines, such as SimCSE and DiffCSE, across multiple linguistic probing tasks within the newly introduced SF-ProbeEval benchmark. Furthermore, the proposed method achieves these performance improvements with significantly reduced computational requirements, highlighting its practical applicability for efficient and interpretable adaptation of language models to specialized domains.

Keywords:

domain adaptation; contrastive learning; large language models; efficient AI; model optimization; neural networks; computational efficiency

1. Introduction

Domain adaptation remains a fundamental challenge in natural language processing, particularly when pre-trained language models encounter specialized text collections with distinctive vocabularies, stylistic conventions, and semantics structures that diverge significantly from general-domain training data [1,2,3]. While transformer-based language models such as BERT and RoBERTa demonstrate outstanding performance on standard NLP benchmarks, their effectiveness often notably decreases within specialized domains, where precise understanding of domain-specific linguistic nuances is crucial [2,4]. This performance gap becomes particularly critical in scenarios where traditional fine-tuning methods require substantial computational resources or where domain-specific training data is limited [5,6].

Traditional domain adaptation methods often demand significant computational power and extensive annotated or unlabeled datasets, rendering them impractical for many specialized applications [3,7]. Domain-adaptive pre-training methods, although effective, also incur substantial computational overhead and require considerable domain-specific data, limiting their real-world applicability [8,9]. Therefore, there exists a clear need for more efficient, interpretable, and effective domain adaptation methodologies suitable for resource-constrained deployment scenarios.

Recently, contrastive learning has emerged as a promising strategy for learning robust text representations with minimal manual annotations [10,11,12,13]. Methods such as SimCSE leverage contrastive objectives to cluster semantically similar representations while differentiating dissimilar ones, significantly improving sentence embedding quality [10]. Furthermore, recent advances integrating feedback from large language models (LLMs) have introduced nuanced, graduated similarity signals, further enhancing contrastive learning outcomes [14]. However, existing contrastive learning frameworks frequently rely on generic masking strategies, uniformly masking tokens without considering their semantic importance, potentially overlooking critical domain-specific semantic cues [6,15].

The central challenge lies in the mismatch between generic masking strategies optimized for general domains and the precise semantic structures intrinsic to specialized text corpora. While existing LLM-based methods provide valuable fine-grained supervision [16,17], their generic augmentation often neglects critical domain-specific knowledge that could significantly boost adaptation performance. Moreover, graduated similarity feedback derived from LLMs remains underexplored when systematically combined with domain-aware masking strategies [18,19].

To address these limitations, this paper proposes a novel domain adaptation framework that integrates LLM-guided weighted contrastive learning with a topic-aware masking strategy explicitly tailored for specialized textual domains. The proposed method leverages topic modeling techniques [20] to systematically identify domain-critical terms, guiding precise masking strategies through three targeted approaches: single-keyword masking for precise semantic distinctions, multiple-keyword masking for comprehensive contextual understanding, and partial-keyword masking for nuanced linguistic relationships. These masking strategies ensure meaningful augmentation by preserving critical semantic integrity while systematically introducing controlled semantic variations guided by LLM feedback.

The effectiveness of the proposed framework is evaluated on an early 20th-century science fiction corpus, a specialized domain characterized by unique vocabularies and intricate linguistic patterns. A comprehensive empirical evaluation conducted using the newly developed SF-ProbeEval benchmark demonstrates consistent performance improvements compared to state-of-the-art contrastive learning baselines, including SimCSE and DiffCSE, while achieving these improvements with significantly reduced computational overhead.

The primary contributions of this work are as follows:

Domain-aware weighted contrastive learning framework: Introduction of a systematic topic modeling-based masking approach combined with graduated LLM similarity feedback, significantly outperforming generic masking strategies while maintaining computational efficiency through minimal training requirements (2 epochs vs. traditional 10–40 epochs);
SF-ProbeEval benchmark: Development of the first specialized benchmark for early 20th-century science fiction text understanding, comprising five linguistically targeted probing tasks that address the gap in domain-specific evaluation frameworks for historical literary corpora;
Comprehensive empirical validation: Rigorous experimental evaluation demonstrating consistent improvements across multiple linguistic metrics with practical applicability for specialized text processing scenarios where traditional fine-tuning is resource-intensive or impractical.

This research advances efficient and interpretable domain adaptation methodologies, particularly emphasizing computational efficiency, representation quality, and practical deployment considerations for electronics and technology applications where specialized text understanding is critical.

2. Related Works

2.1. Mathematical Formulation of Domain Adaptation

Domain adaptation for language models can be formalized as learning a mapping function

f_{θ}

:

X_{s} \to X_{t}

, where

X_{s}

represents the source domain (general text) and

X_{t}

represents the target domain (specialized text). The objective is to minimize the domain discrepancy while preserving semantic relationships:

min_{θ} L_{task} (θ) + λ L_{domain} (θ)

(1)

where

L_{task}

represents the task-specific loss and

L_{domain}

captures domain alignment.

The theoretical foundation of domain adaptation rests on understanding the relationship between source and target domain distributions. Classical domain adaptation theory establishes that target error can be bounded by source error plus a domain discrepancy term [21]:

ϵ_{t} (h) \leq ϵ_{s} (h) + \frac{1}{2} d_{H Δ H} (D_{s}, D_{t}) + λ

(2)

where

d_{H Δ H}

measures domain discrepancy and

λ

represents the combined error of the ideal joint hypothesis. This upper bound provides a theoretical guarantee: when domain discrepancy is minimized and source error is low, target performance should improve.

However, this optimistic bound assumes that source and target label distributions are aligned. Zhao et al. rigorously demonstrated a fundamental limitation by establishing a complementary lower bound [21]:

ϵ_{t} (h) \geq \frac{1}{2} | P_{s} (Y = 1) - P_{t} (Y = 1) | - ϵ_{s} (h)

(3)

This lower bound reveals that when source and target label priors differ significantly (i.e.,

| P_{s} (Y = 1) - P_{t} (Y = 1) |

is large), blindly enforcing feature invariance can paradoxically increase target error rather than reduce it. The gap between Equations (2) and (3) illustrates a critical insight: successful domain adaptation requires not only minimizing domain discrepancy (as suggested by Equation (2)) but also accounting for label distribution shifts (as warned by Equation (3)).

This theoretical tension has profound implications for specialized domain adaptation. In contexts such as early 20th-century science fiction texts, where both feature distributions and semantic label structures differ substantially from modern general corpora, naive invariance-based approaches guided solely by Equation (2) may fail as predicted by Equation (3).

These insights led to sophisticated divergence measures like the Margin Disparity Discrepancy (MDD) [22] and f-divergence frameworks [23]. Recent advances address large domain gaps through gradual adaptation [24], open-set scenarios with novel classes [25], and information-theoretic bounds [26]. These theoretical developments establish mathematical foundations that directly inform sophisticated adaptation strategies, including the proposed weighted contrastive learning approach that simultaneously addresses feature alignment and semantic relationship preservation through graduated similarity supervision.

2.2. Domain Adaptation for Specialized Text Collections

Pre-trained models exhibit performance degradation on specialized texts due to unique vocabulary, domain-specific syntax, and specialized semantic relationships that differ significantly from general training corpora [1]. Domain-adaptive pre-training (DAPT) through continued masked language modeling on target corpora remains the primary strategy, with demonstrated improvements across diverse domains [3]. However, DAPT approaches often require substantial computational resources and extensive domain-specific corpora, making them impractical for many specialized applications.

Specialized text processing has witnessed methodological advances addressing resource constraints. Contextualized Construct Representations combine psychometric expertise with transformers for Classical Chinese analysis, employing supervised contrastive learning to address data scarcity [27]. The Adapt–Retrieve–Revise framework achieves substantial gains in legal domains by integrating model adaptation with LLM-based evidence assessment [28].

Purpose-built models also show domain-specific superiority but require extensive resources. For example, MacBERTh, trained on English texts from 1450 to 1950, outperforms adapted variants on period-specific tasks [29], though such models demand substantial computational investment and large domain-specific corpora.

Despite these advances, existing domain adaptation approaches face limitations when applied to specialized text collections. Traditional fine-tuning methods require extensive computational resources, while generic contrastive learning approaches fail to capture domain-specific semantic nuances. The challenge lies in developing efficient adaptation strategies that leverage domain-specific knowledge without demanding prohibitive computational overhead.

2.3. Contrastive Learning and Domain-Aware Augmentation

Contrastive learning has emerged as the dominant paradigm for learning sentence representations. SimCSE established foundational principles using dropout noise for positive pair generation, achieving strong performance with minimal supervision [10]. Building upon this, DiffCSE introduced conditional masking with Replaced Token Detection (RTD), combining contrastive objectives with token-level discrimination [13]. These methods demonstrated that sophisticated augmentation strategies—from adversarial attacks [11] to document structure exploitation [12]—substantially enhance representation quality.

Recent advances incorporating large language model feedback have further evolved the field. CLAIF leverages LLMs to provide graduated similarity scores

y_{i} \in [0, 1]

instead of binary labels, though it abandons the contrastive framework for regression objectives [14]. Contemporary methods like SimCSE++ [30] address technical limitations such as dropout contamination, while prompt-based approaches [31] and whitening techniques [32] explore alternative supervision mechanisms.

However, existing contrastive methods predominantly employ generic augmentation strategies that treat all tokens uniformly [15,33]. Random masking, token shuffling, and syntactic perturbations assume equal importance across tokens, potentially obscuring domain-critical semantic relationships. This limitation becomes particularly acute in specialized domains where certain terms carry disproportionate semantic weight. While LLM-feedback methods offer valuable supervision through synthetic generation [34,35] and constitutional AI approaches [36], their generic masking fails to exploit domain-specific knowledge.

Topic modeling techniques like BERTopic [20] provide systematic identification of semantically important terms within specialized corpora, yet their integration with contrastive learning remains unexplored. This gap motivates the domain-aware approach that combines topic modeling-based masking with LLM-guided similarity assessment, ensuring that domain-critical vocabulary receives appropriate attention during adaptation. By explicitly targeting semantically important terms identified through topic analysis, the proposed framework addresses the fundamental mismatch between generic augmentation and the distinctive semantic structures of specialized text collections.

3. Methodology

3.1. Problem Formulation

Given a specialized domain corpus

D = {x_{1}, x_{2}, \dots, x_{n}}

where each

x_{i}

represents a sentence, the objective is to learn domain-adapted sentence representations

h_{i} = f_{θ} (x_{i})

that capture specialized semantic relationships. This research formulates domain adaptation as an optimization problem where the model parameters are learned through weighted contrastive learning:

θ^{*} = arg min_{θ} \sum_{i = 1}^{N} L_{weighted} (h_{i}, h_{i}^{+}, y_{i}; θ)

(4)

where

h_{i}^{+}

represents the LLM-enhanced positive pair and

y_{i} \in [0, 5]

denotes the graduated similarity score provided by large language model assessment. This formulation differs from traditional contrastive learning by incorporating continuous supervision signals rather than binary similarity judgments, enabling the more nuanced semantic relationship learning essential for specialized domain adaptation.

3.2. Domain-Specific Dataset Construction

This research employs a curated collection of early 20th-century science fiction texts to demonstrate the effectiveness of the proposed framework under challenging adaptation scenarios. The dataset comprises 531 documentswith approximately 18,618 text segments and an estimated 35,000 distinct sentences after comprehensive preprocessing and quality control. The corpus represents a domain characterized by specialized vocabulary, distinctive stylistic conventions, and domain-specific semantic relationships, making it an ideal testbed for evaluating domain adaptation methodologies under resource-constrained conditions.

To ensure reproducibility, this study documents the complete preprocessing pipeline applied prior to topic modeling and contrastive pair generation. Preprocessing is employed primarily to construct the domain lexicon for topic modeling, while the raw sentences used for contrastive training retain their original orthography and punctuation to preserve domain-specific features.

Text segmentation: Stories were segmented into paragraphs using blank-line delimiters (regex pattern: \n\s*\n+), thereby preserving natural discourse boundaries. Paragraphs shorter than ten words were discarded to avoid fragments.

Text normalization (for topic modeling only): For constructing the domain lexicon via BERTopic, the following operations were applied: (i) lowercase conversion and stopword removal using NLTK English stopwords, (ii) lemmatization with WordNetLemmatizer, (iii) minimum word-length filtering (three characters), (iv) enforcement of a minimum alphabetic ratio (70% per paragraph), (v) contraction expansion (e.g., “won’t” → “will not”), and (vi) hyphen normalization for archaic forms (e.g., “to-morrow” → “tomorrow”). These normalizations do not affect the sentences used in contrastive training, thereby ensuring authentic domain representation.

Deduplication: During pair construction, duplicate pair IDs were resolved by retaining only the instance with the highest hybrid similarity score, which combines surface-level and semantic similarity metrics. This method ensures the most representative version of each semantic unit while eliminating redundant variations.

Quality filtering: Several criteria were applied to safeguard the integrity of training pairs: (i) removal of texts containing residual <mask> tokens from failed infilling, (ii) exclusion of low-confidence generations based on configurable infill log-probability thresholds, (iii) enforcement of hybrid similarity score thresholds to remove semantically mismatched pairs, and (iv) length filtering (4–180 tokens) to maintain consistency for model training.

This preprocessing pipeline produces a rigorously curated corpus that balances authenticity with analytical tractability. By retaining original orthography and punctuation in contrastive training data while applying normalization solely for topic modeling, the approach preserves the distinctive stylistic and lexical features of the early science fiction domain. The resulting collection provides a robust foundation for subsequent stages of experimentation—similarity assessment, contrastive pair construction, and model training—ensuring both reproducibility and domain fidelity.

3.3. Language Model Architectures

Two established transformer-based encoders serve as foundation models for the adaptation framework: BERT-base (uncased) and RoBERTa-base. The selection of these architectures demonstrates the model-agnostic nature of the proposed approach, ensuring that the framework’s effectiveness extends across different pre-trained language model architectures. The methodology encompasses two main stages as illustrated in Figure 1: LLM-guided contrastive pair generation and weighted contrastive learning optimization.

Figure 1. Overview of the proposed workflow. (A) LLM-guided contrastive pair generation. A specialized domain corpus is processed using topic modeling (30 topics) to obtain domain-specific vocabulary. Three masking strategies (single-, multi-, partial-keyword) produce masked sentences

h^{'}

, which GPT-4.1-mini rewrites into

h^{+}

and scores for semantic similarity on a 0–5 scale

y_{i}

. (B) Weighted contrastive learning optimization. The triplets

(h_{i}, h_{i}^{+}, y_{i})

are used to fine-tune a pre-trained sentence encoder with the weighted loss so that higher similarity scores pull highly related sentence pairs closer in embedding space, resulting in domain-adapted representations.

3.4. LLM-Guided Contrastive Pair Generation

The framework employs a sophisticated sentence pair generation strategy that leverages domain-specific masking techniques to create meaningful contrastive examples. The process begins with intelligent keyword identification through topic modeling analysis, enabling systematic masking of domain-critical terms that carry significant semantic weight.

3.4.1. Domain-Aware Masking Strategy

Traditional contrastive learning employs random masking with uniform probability. This research introduces domain-aware masking based on topic modeling results. Let

V_{d} = {v_{1}, v_{2}, \dots, v_{k}}

represent domain-critical vocabulary identified through BERTopic analysis. The masking probability for token t is defined as

P_{mask} (t) = {\begin{matrix} α \cdot importance (t) & if t \in V_{d} \\ β & otherwise \end{matrix}

(5)

where

importance (t)

is derived from c-TF-IDF scores, and

α > β

to prioritize domain-critical terms. This formulation ensures that semantically important vocabulary receives preferential attention during augmentation, addressing the limitation of generic masking strategies that treat all tokens equally.

Three masking strategies are implemented to capture different levels of semantic variation:

Single-keyword masking: $M_{1} (x) = mask (x, sample (V_{d}, 1))$ targets individual domain-critical terms.
Multiple-keyword masking: $M_{2} (x) = mask (x, sample (V_{d}, k))$ , where $k \sim Uniform (2, 4)$ for broader contextual variation.
Partial-keyword masking: $M_{3} (x) = partial_mask (x, V_{d})$ for compound terms (e.g., “space-ships” → “<mask>-ships”).

3.4.2. LLM-Based Sentence Generation and Scoring

A key innovation of the proposed framework lies in the design of domain-specific prompts that guide LLM-based augmentation. Unlike generic approaches that apply universal templates across domains, these prompts are deliberately constructed to preserve the distinctive linguistic characteristics of specialized texts. Generic prompting would compromise domain fidelity—for example, by replacing period-specific terms such as “aether” with modern expressions like “Wi-Fi,” or by substituting culturally embedded terms such as “ray-gun” with trivialized alternatives like “laser pointer.” Such distortions would undermine the semantic integrity of early twentieth-century science fiction, producing anachronistic substitutions that yield training pairs which mislead, rather than enhance, the model’s domain understanding.

The domain-aware prompting strategy addresses this challenge by explicitly instructing the LLM to preserve four critical features characteristic of pulp-era science fiction: (1) the core narrative meaning within speculative contexts, (2) the retention of period-appropriate scientific devices and technological concepts, (3) the consistency of temporal and spatial settings distinctive of pulp-era narratives, and (4) fidelity to domain-critical keywords identified through topic modeling. This targeted approach ensures that when a token such as “rocket” is masked in a 1920s text, the model generates period-appropriate alternatives such as “projectile” or “torpedo,” rather than anachronistic terms like “spacecraft” or “shuttle.”

For each masked sentence

h^{'}

, GPT-4.1-mini generates an infilled version through

h^{+} = G_{LLM} (h^{'}, {prompt}_{infill})

(6)

The infilling prompt is designed to preserve domain-specific characteristics:

“Your task is to rewrite the following early 20th-century science fiction sentence by replacing each <mask> token with the most appropriate and coherent word to complete the sentence grammatically while preserving the style and meaning. If you cannot fill in the blanks, just say None as your answer.”

This prompt ensures that replacements maintain period-appropriate vocabulary. For instance, when “apparatus” is masked, the model generates “equipment” (similarity score: 5), preserving the technical terminology of the era rather than introducing modern jargon.

The same LLM then provides graduated similarity assessment:

y = S_{LLM} (h, h^{+}, {prompt}_{score}) \in [0, 5]

(7)

The scoring prompt evaluates semantic preservation across multiple domain-specific dimensions:

“Your task is to rate the similarity of two sentences drawn from early 20th-century science fiction prose. Return a single floating-point score between 0 and 5 inclusive. Consider four aspects when judging similarity: 1. preservation of the core meaning, 2. retention or plausible substitution of SF devices, 3. consistency of temporal, spatial, and technological background, 4. fidelity to key SF topic words.”

The scoring rubric ranges from 5 (fully equivalent with all details preserved) to 0 (no substantive overlap), providing fine-grained supervision that captures semantic nuances. For example, replacing “information” with “technology” in the context of “giving to the Hans” yields a score of 4, while replacing “solar system” with “harbor” drops to 2, reflecting the loss of domain-specific meaning.

This domain-aware prompting strategy represents a key methodological contribution. By explicitly encoding domain knowledge into both generation and evaluation, we ensure that contrastive pairs maintain semantic validity within the specialized domain. The approach creates training triplets

(h, h^{+}, y)

where the similarity scores reflect genuine domain-specific semantic relationships rather than generic surface similarity. This targeted design directly contributes to the strong performance on semantic tasks (Word Contents: 80.86% for RoBERTa), confirming that domain-specific prompting successfully preserves critical semantic information. Full prompt templates with complete scoring rubrics are provided in Appendix A.1 for reproducibility.

3.5. Weighted Contrastive Learning Framework

As shown in Figure 1B, the weighted contrastive learning process utilizes the generated sentence pairs to enhance domain-sensitive representations. The framework addresses fundamental limitations in existing contrastive learning approaches while building upon their structural advantages. Unlike SimCSE [10]’s reliance on generic dropout noise as the sole augmentation mechanism, this research introduces domain-aware masking strategies that systematically target vocabulary elements with high semantic significance within specialized domains. Where SimCSE employs identical sentences with different dropout masks as positive pairs, the proposed approach generates semantically varied but related sentences through intelligent masking and LLM-guided reconstruction.

The framework integrates the dual-objective structure of DiffCSE [13] with graduated supervision mechanisms inspired by CLAIF [14] while addressing critical limitations in both approaches. DiffCSE employs binary token-level supervision without fine-grained similarity guidance, while CLAIF abandons the contrastive learning structure entirely in favor of regression-based objectives. The proposed method preserves contrastive learning’s comparative advantages while incorporating continuous LLM-derived supervision.

Let

(h_{i}, h_{i}^{+}, y_{i})

represent the embedding of an anchor sentence, its LLM-processed variant, and the associated similarity score. The training objective combines two complementary losses:

Weighted contrastive loss:

L contrast = - \frac{1}{N} \sum {i = 1}^{N} y_{i} log \frac{e^{sim (h_{i}, h_{i}^{+}) / τ}}{\sum_{j = 1}^{N} e^{sim (h_{i}, h_{j}^{+}) / τ}}

(8)

where N represents the batch size,

sim (\cdot, \cdot)

denotes the cosine similarity function,

τ

is the temperature parameter, and

y_{i} \in [0, 5]

is the LLM-generated similarity score that serves as a continuous weight rather than a binary supervision signal. This weighted contrastive objective differs from SimCSE’s uniform treatment of positive pairs by incorporating graduated similarity scores

y_{i}

as continuous supervision weights. Unlike CLAIF’s approach of replacing contrastive learning with MSE regression (

L = \frac{1}{N} \sum_{i = 1}^{N} {(cos (h_{i}, h_{i}^{+}) - y_{i})}^{2}

), this formulation maintains the comparative learning structure while providing nuanced guidance through LLM-generated similarity assessments.

Replaced Token Detection (RTD) loss: Following DiffCSE, the framework incorporates a Replaced Token Detection loss to enhance the model’s sensitivity to semantic variations at the token level. The RTD mechanism operates through a three-stage process: (1) domain-aware masking of the original sentence x to create

x^{'}

, (2) LLM-based reconstruction to generate

x^{″}

, and (3) token-level discrimination to identify which positions contain original versus generated tokens.

For a sentence of length T, the RTD loss is formulated as

L_{RTD}^{x} = \sum_{t = 1}^{T} (- ⊮ (x_{(t)}^{″} = x_{(t)}) log D (x^{″}, h, t) - ⊮ (x_{(t)}^{″} \neq x_{(t)}) log (1 - D (x^{″}, h, t)))

(9)

where

T denotes the sentence length (number of tokens);
$x_{(t)}$ represents the t-th token in the original sentence x;
$x_{(t)}^{″}$ represents the t-th token in the LLM-generated sentence $x^{″}$ ;
$⊮ (\cdot)$ is the indicator function that returns 1 if the condition is true, 0 otherwise;
$D (x^{″}, h, t) \in [0, 1]$ is the discriminator’s predicted probability that the token at position t has been replaced;
$h$ is the sentence-level embedding that provides contextual information to the discriminator.

The RTD objective functions as a conditional binary classification task at each token position. When the generated token matches the original (

x_{(t)}^{″} = x_{(t)}

), the loss encourages the discriminator to output a low probability (predicting “not replaced”). Conversely, when the tokens differ (

x_{(t)}^{″} \neq x_{(t)}

), the loss encourages a high probability (predicting “replaced”). This token-level supervision forces the sentence encoder to produce representations

h

that contain sufficient semantic information for accurate token-level discrimination.

The batch-level RTD loss aggregates individual sentence losses:

L_{RTD} = \frac{1}{N} \sum_{i = 1}^{N} L_{RTD}^{x_{i}}

(10)

Combined training objective:

The final training objective combines both losses with a weighting coefficient

λ

:

L = L_{contrast} + λ \cdot L_{RTD}

(11)

where

λ

balances the relative importance of sentence-level contrastive learning and token-level discrimination. The dual-objective design ensures that the sentence encoder learns representations that are both semantically coherent at the global level (through weighted contrastive learning) and sensitive to fine-grained lexical variations (through RTD supervision).

This architectural innovation addresses key limitations in existing approaches: SimCSE’s inability to distinguish between different levels of semantic similarity, DiffCSE’s reliance on binary token supervision, and CLAIF’s abandonment of comparative learning structures. The integration of graduated LLM feedback as similarity weights, rather than regression targets, preserves the contrastive learning framework’s inherent advantages while providing more informative supervision than traditional binary approaches. The RTD component ensures comprehensive supervision at both the sentence and token levels, enabling the framework to capture both global semantic relationships and fine-grained lexical variations essential for specialized domain understanding.

3.6. Implementation Details

The framework is optimized for computational efficiency while maintaining adaptation effectiveness. Training employs minimal epoch requirements (2 epochs for both BERT and RoBERTa) with carefully tuned hyperparameters following established contrastive learning configurations [13]. The learning rates are set to

7 \times 10^{- 6}

for BERT-base and

1 \times 10^{- 5}

for RoBERTa-base, with masking ratios of 0.30 and 0.20, respectively. The principal hyperparameters used for domain adaptation training are summarized in Table 1.

Table 1. Principal hyperparameters for domain adaptation training.

To ensure fairness and reproducibility, the temperature is fixed at

τ = 0.05

, following common practice in contrastive sentence embedding research (e.g., SimCSE; DiffCSE) [10,13]. This study deliberately refrains from tuning

τ

for either the proposed method or the baselines in order to ensure that (i) all systems are compared under identical conditions without conferring a tuning advantage, (ii) experimental settings remain aligned with widely adopted prior work, and (iii) observed improvements can be attributed to the core contributions—weighted contrastive learning with LLM-guided feedback—rather than temperature optimization. This decision also preserves the two-epoch compute budget adopted in this study.

The batch size is fixed at 64 to balance computational efficiency with training stability. All experiments are conducted on a single NVIDIA A100 GPU, thereby demonstrating the framework’s feasibility for deployment under resource-constrained conditions. Reproducibility is further supported through fixed random seeds (seed = 42) and standardized experimental configurations.

4. Experimental Evaluation

This section evaluates the proposed weighted contrastive learning framework through comprehensive linguistic analysis and comparative assessment against established domain adaptation methods. The evaluation emphasizes practical deployment considerations, computational efficiency, and performance gains achievable under resource-constrained scenarios typical of industrial applications.

4.1. SF-ProbeEval: A Domain-Specific Probing Benchmark

To evaluate embedding quality in early science fiction texts, this research constructs SF-ProbeEval, a specialized benchmark comprising five probing tasks tailored to pulp-era linguistic patterns. Each task contains 1000 test items derived from the specialized domain corpus, capturing the unique challenges of historical science fiction text understanding. SF-ProbeEval provides the first standardized benchmark for assessing sentence embedding quality in early 20th-century science fiction texts, addressing the gap in domain-specific evaluation frameworks for historical literary corpora.

SF-ProbeEval construction follows a two-stage process: automated generation using rule-based algorithms and domain-specific linguistic patterns, followed by expert validation. Three experts from the Department of English Language and Literature reviewed and refined the automatically generated dataset, focusing on linguistic accuracy, difficulty appropriateness, and elimination of generation artifacts. Each test item was evaluated independently by at least two reviewers, with disagreements resolved through consensus-building.

Evaluation protocol: Table 2 summarizes the SF-ProbeEval tasks employed in this study.

Table 2. SF-ProbeEval: a probing dataset for early 20th-century science fiction text understanding.

Baseline comparisons: The framework is evaluated against standard contrastive learning approaches to isolate the contribution of AI-guided feedback mechanisms. Four model variants are compared using both BERT-base and RoBERTa-base architectures: (1) unadapted pre-trained encoders serving as domain adaptation baselines, (2) SimCSE fine-tuning with standard positive sampling strategies, (3) DiffCSE fine-tuning with conditional augmentation methods, and (4) the proposed weighted contrastive framework incorporating graduated similarity feedback while maintaining identical computational constraints and training configurations.

4.2. Performance Analysis

Table 3 presents comprehensive performance comparisons on SF-ProbeEval, demonstrating the practical effectiveness of the proposed framework. The results show consistent and meaningful improvements achieved through AI-guided feedback integration, particularly significant given the resource-constrained adaptation scenario and minimal computational requirements.

Table 3. Performance comparison across different domain adaptation approaches on SF-ProbeEval benchmark. Results demonstrate the practical effectiveness of AI-guided feedback under resource-constrained deployment scenarios.

For BERT-based implementations, the proposed framework achieves substantial performance gains with a 7.57% average improvement across evaluation metrics, representing a notable advancement in domain adaptation effectiveness. The framework reaches 63.07% average performance, significantly outperforming BERT-base (55.50%), SimCSE (56.55%), and DiffCSE (55.71%) baselines. The framework shows exceptional performance in domain-specific vocabulary understanding (Word Contents: 79.71%), outperforming DiffCSE (73.71%). These improvements are particularly significant considering the minimal computational requirements—achieved with only two training epochs and single GPU deployment, making the approach viable for resource-constrained enterprise environments.

The gains are especially pronounced in tasks critical for specialized domain understanding: word order sensitivity (BShift: 69.67% vs. 67.50% baseline) and coordinate structure comprehension (Coord_Inv: 80.64% vs. 79.80% baseline). These improvements directly translate to enhanced performance in practical applications requiring precise semantic understanding of complex specialized texts.

RoBERTa-based experiments demonstrate similar effectiveness, with the framework delivering a 4.04% average improvement over baseline approaches. Performance reaches a 58.29% average across tasks, substantially outperforming RoBERTa-base (54.25%), SimCSE (49.35%), and DiffCSE (51.35%). The framework shows remarkable improvements in domain-specific vocabulary understanding (Word Contents: 80.86% vs. 56.00% baseline) and structural linguistic analysis (Coord_Inv: 74.07% vs. 72.90% baseline).

Significantly, the results reveal critical limitations in standard contrastive methods when applied to specialized domains. Traditional approaches (SimCSE and DiffCSE) occasionally underperform compared to unmodified pre-trained models, suggesting that naive contrastive augmentation strategies may introduce noise rather than meaningful supervision in domain-specific contexts. This phenomenon underscores the importance of intelligent, domain-aware adaptation approaches such as the proposed LLM-guided framework.

The consistent performance improvements across both transformer architectures demonstrate the framework’s model-agnostic effectiveness and broad applicability. These results are particularly valuable for practitioners operating under resource constraints, where even modest improvements translate to significant operational advantages in production environments. The computational efficiency requirements—minimal hardware demands with substantial performance gains—make this approach practical for enterprise deployment scenarios where extensive fine-tuning is economically unfeasible.

4.3. Embedding Quality Assessment

To evaluate the semantic organization capabilities of the proposed framework, this research employs t-distributed stochastic neighbor embedding (t-SNE) [37] visualization analysis. The assessment focuses on the top five semantic clusters identified through automated topic modeling, comprising 1962 text segments representing diverse domain-specific content patterns. Sentence embeddings generated by BERT-base, RoBERTa-base, and their corresponding adapted models using the proposed framework are reduced using t-SNE with perplexity = 30 and random state = 42.

Figure 2 illustrates the progressive improvements in semantic clustering quality achieved through the proposed adaptation framework. Both baseline models (top row) exhibit substantial semantic overlap between different content categories, with RoBERTa-base showing particularly diffuse cluster boundaries and BERT-base achieving only partial separation. In contrast, the proposed weighted contrastive framework (bottom row) demonstrates superior semantic discriminability across both architectures, with tight, well-separated clusters and minimal inter-category interference.

Figure 2. t-SNE projections of sentence embeddings for five domain-specific topic clusters: Neptunian Cataclysm (blue circles), Scientific Crime Trial (orange squares), Carnivorous Creatures (green triangles), Rocket Propulsion Engineering (red diamonds), and Court Rituals (purple crosses). The proposed framework (bottom row) demonstrates superior semantic discriminability with tight, non-overlapping clusters compared to baseline models (top row).

The improvements are particularly pronounced for three critical domain-specific categories: Neptunian Cataclysm (blue circles), Scientific Crime Trial (orange squares), and Carnivorous Creatures (green triangles). While these clusters show significant overlap in the baseline models—especially in RoBERTa where they form an indistinct central mass—the proposed framework achieves clear delineation with distinct cluster cores and well-defined boundaries. This enhanced separation indicates the framework’s effectiveness in capturing fine-grained semantic distinctions essential for specialized domain understanding.

The visualization results confirm that AI-guided feedback mechanisms effectively enhance semantic organization within specialized domain contexts. This improvement in embedding quality translates directly to better performance in downstream applications requiring precise semantic understanding, such as information retrieval, content categorization, and similarity-based recommendation systems. For enterprise applications, this enhanced semantic organization can significantly improve system accuracy and user satisfaction in domain-specific deployment scenarios where precise content understanding is critical for operational effectiveness.

4.4. Ablation Study

Table 4 provides controlled comparison results using identical specialized domain training data, isolating the contribution of different contrastive learning methodologies across both BERT and RoBERTa architectures.

Table 4. Controlled comparison of contrastive learning methodologies using identical specialized domain training data. Results isolate the contribution of AI-guided feedback mechanisms.

The proposed weighted contrastive framework achieves substantial improvements across both architectures when all methods utilize identical training data and computational resources. For BERT, the framework reaches 63.07% average performance, outperforming SimCSE (57.56%) and DiffCSE (60.06%) by 5.51% and 3.01%, respectively. RoBERTa shows even more pronounced gains with a 58.29% average performance, exceeding SimCSE (49.18%) and DiffCSE (51.44%) by 9.11% and 6.85%.

Domain vocabulary understanding (Word Contents) demonstrates the most consistent improvements across both models: BERT achieves 79.71% (vs. 69.43% SimCSE, 66.57% DiffCSE) and RoBERTa reaches 80.86% (vs. 71.14% SimCSE, 72.00% DiffCSE). This indicates that graduated similarity scores from LLMs effectively capture domain-specific semantic nuances that binary supervision approaches miss.

Notably, while baseline methods show architectural sensitivity—SimCSE and DiffCSE perform inconsistently across models—the proposed framework maintains robust performance gains regardless of the underlying architecture. RoBERTa’s baseline methods particularly struggle with syntactic tasks (BShift: 47.50% SimCSE, 52.00% DiffCSE), which the proposed framework substantially improves to 66.67%, demonstrating the value of AI-guided feedback for addressing architectural weaknesses.

These controlled comparisons confirm that performance improvements stem from the AI-generated feedback mechanism rather than dataset characteristics or training configurations, validating the framework’s effectiveness for practical domain adaptation where semantic understanding is critical.

5. Discussion

5.1. Effectiveness of AI-Generated Feedback

This research demonstrates the technical effectiveness of integrating graduated similarity feedback from large language models into contrastive learning frameworks. The fine-grained supervision provided by GPT-4.1-mini consistently improves embedding quality across multiple evaluation metrics, representing a methodological advancement over traditional binary contrastive approaches. The graduated similarity assessment addresses a fundamental limitation in contrastive learning by providing continuous rather than discrete supervision signals, enabling more sophisticated representation learning that captures subtle semantic distinctions essential for specialized domain understanding. The framework’s success across both BERT and RoBERTa architectures, with average improvements of 5–7% and statistically significant micro-average gains (

p < 0.001

), validates the robustness of this approach for practical domain adaptation scenarios.

5.2. Computational Efficiency and Resource Constraints

The framework achieves training efficiency by requiring only 2 epochs, compared to 10–40 for typical domain adaptation. However, this benefit comes with quantifiable preprocessing costs from GPT-4.1-mini usage: each sentence requires two API calls (mask infilling and similarity scoring), totaling approximately 70,000 calls for the 35,000-sentence corpus. At current rates (USD 0.15 per 1M input tokens, USD 0.60 per 1 M output tokens), this amounts to roughly USD 2 for the dataset—negligible at research scale but potentially substantial for larger deployments (e.g., approximately USD 600 for 10 M sentences, thousands in USD for 100 M+ sentences).

Beyond direct costs, LLM dependency introduces operational constraints. Typical API latencies (0.5–2 s per call) make real-time adaptation impractical; sensitive domains may be unable to send proprietary data to external services; and model version changes can affect score consistency. These trade-offs highlight the importance of considering total cost of ownership alongside accuracy improvements. Potential mitigations include local LLM deployment (e.g., LLaMA-2-7B, Mistral-7B), selective LLM scoring for high-value examples, and knowledge-distillation pipelines to eliminate runtime LLM dependency while preserving performance gains.

5.3. Framework Design and Transferability

While this framework has been validated on early 20th-century science fiction, its design principles are applicable to other specialized domains characterized by distinctive vocabularies, limited training data, and domain-specific semantic structures. The core components—topic modeling for vocabulary identification, targeted masking strategies, and LLM-guided supervision—are domain-agnostic and can be adapted with minor adjustments.

In technical documentation, masking strategies should preserve hierarchical terminology and acronym–definition relationships. For legal texts, prompts require refinement to maintain statutory precision and cross-reference integrity, while the medical literature demands special handling of clinical terminology and dosage expressions. In each case, topic modeling (e.g., BERTopic) can automatically extract domain-critical vocabulary, and LLM prompts can be instantiated with domain-specific placeholders such as substituting “science fiction devices” with “legal precedents” or “clinical procedures.”

Cross-domain adaptation can follow a reproducible procedure. BERTopic parameters are tuned according to domain characteristics: min_topic_size may be increased (e.g., 10→50) for dense technical vocabularies, whereas nr_topics may be reduced (e.g., 30→10–15) for narrow corpora. Masking ratios are adjusted inversely with vocabulary density, typically around 0.15–0.20 for dense legal texts and 0.25–0.30 for narrative domains. The RTD loss weight

λ

is varied based on lexical diversity, with higher settings (0.01) for highly formulaic corpora and lower settings (0.005) for creative texts. Although this study fixed

τ = 0.05

following standard practice in contrastive sentence embeddings (e.g., SimCSE; DiffCSE) and did not tune it to preserve fairness and reproducibility under the two-epoch compute budget, future work should evaluate sensitivity through a small grid (e.g.,

0.02, 0.05, 0.10

) under the same budget. These adjustments can be determined efficiently through pilot runs of 500–1000 sentences, preserving the framework’s emphasis on computational efficiency.

The consistent improvements across BERT and RoBERTa (5–7% average gains) demonstrate the framework’s architecture independence and suggest robust transferability. At the same time, domains with strongly formulaic language, such as patent documents, may require additional syntactic constraints beyond semantic similarity. Future evaluations should therefore validate the parameter adjustment strategies across technical, legal, and medical corpora and test whether the two-epoch regime remains sufficient in domains with greater structural complexity than literary texts.

5.4. Limitations and Future Research Directions

This study acknowledges several limitations that suggest directions for future work. First, while the proposed framework yields statistically significant improvements on domain-critical tasks, the average gains are modest (5–7%). The accompanying significance analyses (paired McNemar exact tests with Holm–Bonferroni correction) indicate that micro-average improvements are statistically reliable for both BERT and RoBERTa (

p < 0.001

), whereas per-task differences show mixed significance after correction (see Appendix D for detailed results). Notably, syntactic parsing tasks (Tree Depth) show minimal improvements, suggesting that current domain-aware masking strategies may not adequately capture hierarchical linguistic structures. Future work should incorporate syntax-aware augmentation that preserves dependency heads and constituency boundaries during masking, potentially combining semantic similarity with explicit syntactic supervision.

Second, validation with three independent experts on 250 sentence pairs revealed important insights about LLM-generated supervision quality. Expert annotations showed high inter-rater reliability (ICC(2,k) = 0.975; Krippendorff’s

α_{interval} = 0.93

), and comparisons with GPT-4.1-mini scores revealed moderate-to-strong correlations (Pearson

r = 0.61

, Spearman

ρ = 0.64

) with a systematic underestimation bias (−0.35 points on average, RMSE = 1.34). To address concerns about the relatively small validation set, Appendix E reports 95% confidence intervals for the correlation coefficients as well as detailed agreement statistics. These analyses indicate that the sample of 250 pairs provides statistically informative evidence consistent with prior validation practices in contrastive learning (e.g., SimCSE; DiffCSE) while also underscoring the need for larger-scale validation in future work.

Third, the evaluation’s deliberate focus on a single specialized domain enables controlled methodological assessment but limits broader generalizability claims. Systematic cross-domain evaluation—testing models adapted on science fiction against heterogeneous targets such as technical documentation or legal texts—would provide stronger evidence of robustness. Both zero-shot and few-shot transfer regimes merit investigation, as do combinations with domain-adaptive pretraining baselines. Future work should also expand cost–benefit analyses to include comprehensive metrics (GPU hours, API costs, deployment latency) versus accuracy gains, providing clearer guidance for practical deployment decisions.

6. Conclusions

This research contributes a computationally efficient methodology for domain adaptation that integrates graduated LLM feedback into weighted contrastive learning. The framework achieves consistent improvements across linguistic evaluation tasks while requiring only 2 training epochs—a significant reduction from traditional approaches requiring 10–40 epochs—demonstrating that intelligent supervision can substitute for extensive computation.

Experimental validation on early 20th-century science fiction texts reveals both strengths and limitations. The framework excels at domain vocabulary understanding (Word Contents: 80.86% for RoBERTa) and coordinate structure comprehension (Coord_Inv: 80.64% for BERT), with statistically significant micro-average improvements (

p < 0.001

) despite modest absolute gains (5–7%). Validation against human expert annotations confirms that LLM-generated similarity scores, though exhibiting systematic underestimation bias (

- 0.23

points), provide sufficiently informative supervision to outperform traditional binary contrastive approaches.

Current limitations point to clear improvement paths. Syntactic parsing performance remains suboptimal, requiring targeted enhancements through syntax-aware masking that preserves dependency structures, auxiliary parsing objectives, and potentially tree-structured attention mechanisms. The single-domain evaluation, while enabling controlled methodological assessment, necessitates systematic cross-domain transfer experiments to establish broader applicability. Additionally, reliance on external LLM APIs introduces operational constraints that merit investigation of open-source alternatives and knowledge distillation strategies for production deployment.

Future work will pursue two critical extensions to enhance robustness and applicability of the framework. The first involves the implementation of dynamic confidence-aware weighting, whereby sample contributions reflect not only graded similarity but also score reliability, estimated through self-consistency checks across multiple LLM passes, calibration on expert-annotated subsets, and entropy-based uncertainty quantification. This approach retains the advantages of contrastive learning while mitigating noise from uncertain supervision. The second extension concerns the development of low-resource prompt tuning techniques designed to reduce manual design costs and improve cross-domain portability. Potential directions include soft prompt embeddings, few-shot prompt induction from minimal seed pairs, and prompt distillation from larger to smaller models. These extensions aim to address fundamental challenges in the practical deployment of domain adaptation systems.

Comprehensive robustness analyses examining temperature parameter sensitivity, masking ratio variations, and confidence intervals for human validation results will further strengthen the framework’s reliability. The complete implementation, including preprocessing pipelines, domain-specific prompts, calibration scripts, and cross-domain transfer protocols, will be publicly released to facilitate reproducibility and accelerate research in resource-constrained domain adaptation.

Despite these limitations, the proposed approach establishes the viability of AI-assisted contrastive learning for specialized domain adaptation, offering a practical pathway for extending modern NLP capabilities to understudied text collections. By explicitly balancing adaptation quality with computational efficiency and transparently characterizing both benefits and trade-offs, the framework provides a foundation for future research in resource-constrained scenarios where traditional fine-tuning remains impractical. The proposed extensions for confidence-aware weighting and low-resource prompt tuning offer concrete paths toward more robust, cost-effective, and broadly applicable domain adaptation methodologies.

Funding

This research received no external funding.

Data Availability Statement

The datasets generated and analyzed in this study are openly available at https://github.com/sjkkang/SF-ProbeEval (accessed on 19 August 2025). Parts of the corpus were derived from the public-domain Amazing Stories magazine (1926–1931) via the Internet Archive (https://archive.org/details/amazingstoriesmagazine, accessed on 19 August 2025).

Acknowledgments

During the preparation of this study, the author used GPT-4.1-mini (OpenAI) to generate sentence pair similarity scores and to refine aspects of the experimental design. The author has reviewed and edited all AI-generated material and takes full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Methodological Details

Appendix A.1. GPT-4.1-mini Prompts and API Configuration

The following prompts were used to ensure reproducible AI-assisted sentence pair generation and similarity assessment.

Mask infilling prompt template:

Your task is to rewrite the following early 20th-century science fiction sentence by replacing each <mask> token with the most appropriate and coherent word to complete the sentence grammatically while preserving the style and meaning. If you cannot fill in the blanks, just say None as your answer.

Sentence: “{masked_sentence}”

Similarity scoring prompt template:

Your task is to rate the similarity of two sentences drawn from early 20th-century science fiction prose.

Return a single floating-point score between 0 and 5 inclusive, using the rubric below.

Consider four aspects when judging similarity:

1. preservation of the core meaning,

2. retention or plausible substitution of SF devices,

3. consistency of temporal, spatial, and technological background,

4. fidelity to key SF topic words.

Scoring scale:

5-Sentences are fully equivalent; all important and minor details, SF devices, setting, and keywords are preserved.

4-Sentences are nearly equivalent; only trivial or stylistic details differ.

3-Sentences share the main idea and SF context, but at least one important detail (device, setting, or key event) differs.

2-Sentences are largely different; they share a few minor words or a very general SF motif but diverge in key meaning or setting.

1-Sentences are unrelated in meaning; they share only a broad genre topic (both mention ‘space,’ for instance) without common content.

0-Sentences share no substantive or topical overlap at all.

Sentence 1: {sentence1}

Sentence 2: {sentence2}

Based on the above rubric, **provide only the numerical similarity score (0-5)**:

API configuration:

Model: GPT-4.1-mini.
Temperature: 0.1 (for consistency).
Max tokens: 150 for infilling, 10 for scoring.
Retry attempts: 3 for failed requests.
API calls: sequential processing with 1 s delays.

Table A1. Examples of GPT-4.1-mini similarity scoring for sentence pairs from the Amazing Stories corpus.

Original Sentence	Infilled Sentence	Score
We established ourselves with our apparatus in a building which he rented and went ahead with our experiments	We established ourselves with our equipment in a building which he rented and went ahead with our experiments	5
They wouldn’t have the information to give the Hans, nor would they be capable of imparting it.	They wouldn’t have the technology to give the Hans, nor would they be capable of imparting it.	4
The gas for this purpose is drawn from a hole tapped through the cliff.	The pipe for this purpose is drawn from a hole tapped through the cliff.	3
Suppose a great star from outside should come into the solar system?	Suppose a great ship from outside should come into the harbor?	2

Note: Emphasized words indicate tokens masked and substituted using the GPT-4.1-mini prompting procedure.

Appendix A.2. Masking Strategy Examples

The domain-aware masking strategies employ three distinct approaches to generate meaningful positive pairs while preserving the semantic coherence of early 20th-century science fiction prose.

Single-keyword masking:

Original: “At least, I have employed a ray destructive to haemoglobin—the red blood cells.”
Masked: “At least, I have employed a <mask> destructive to haemoglobin—the red blood cells.”

Multiple-keyword masking:

Original: “The orbit of this planet was assuredly interior to the orbit of the earth, because it accompanied the sun in its apparent motion; yet it was neither Mercury nor Venus, because neither one nor the other of these has any satellite at all.”
Masked: “The <mask> of this <mask> was assuredly interior to the orbit of the earth, because it accompanied the sun in its apparent motion; yet it was neither Mercury nor Venus, because neither one nor the other of these has any <mask> at all.”

Partial-keyword masking:

Original: “in four great cones, or space-ships, to establish themselves upon earth.”
Masked: “in four great cones, or <mask>ships, to establish themselves upon earth.”

Appendix B. Extended Related Work Details

Appendix B.1. Evolution of Contrastive Learning Methods

The development of contrastive learning for sentence embeddings has progressed through several key innovations. Early approaches like Skip-Thought vectors [38] and Quick-Thought [39] established the foundation for unsupervised sentence representations. ConSERT [11] demonstrated that sophisticated augmentation including adversarial attacks and cutoff operations significantly improves embedding quality. Self-guided contrastive learning [40] introduced auxiliary encoders to mitigate representation collapse, while semantic re-tuning [41] explored contrastive tension mechanisms.

Appendix B.2. Mathematical Formulations

The unsupervised SimCSE objective employs different dropout masks:

L_{SimCSE} = - log \frac{exp (sim (h_{i}, h_{i}^{+}) / τ)}{\sum_{j = 1}^{N} exp (sim (h_{i}, h_{j}^{+}) / τ)}

(A1)

where

h_{i} = f_{θ} (x_{i}, z_{i})

and

h_{i}^{+} = f_{θ} (x_{i}, z_{i}^{'})

with different dropout masks.

CLAIF’s regression-based approach is as follows:

L_{CLAIF} = \frac{1}{N} \sum_{i = 1}^{N} {(cos (h_{i}, h_{i}^{+}) - y_{i})}^{2}

(A2)

This formulation abandons contrastive structure for direct similarity regression, losing the comparative learning benefits.

Appendix C. Corpus Analysis Results

Appendix C.1. Complete Topic Classification

Table A2 presents the complete topic classification results obtained from BERTopic analysis applied to the Amazing Stories corpus (1926–1931).

Table A2. Complete topic classification results from BERTopic analysis.

Topic ID	Label
0	Asian Servant Stereotype
1	Bombardment Warfare
2	Interplanetary Diplomatic Council
3	Court Rituals
4	Fourth-Dimensional Geometry
5	Ornate Costume Descriptions
6	Optical Invisibility Experiment
7	Green Prism Miracle
8	Opulent Interior Architecture
9	Comic Planetary Voyage
10	Reckless Automobile Chase
11	Antitoxin Research
12	Clinical Hospital Drama
13	Shining One Cult
14	Lunar Surface Exploration
15	Neptunian Cataclysm
16	Scientific Crime Trial
17	Civilization Retrospective
18	Close-Quarters Combat
19	Industrial Capitalism
20	University Academia
21	Philosophy of Discovery
22	Carnivorous Creatures
23	Rocket Propulsion Engineering
24	Atomic Particle Physics
25	Aerial Fleet Warfare
26	Mountain Expedition
27	Luminous Vortex Phenomena
28	Polar Sea Voyage
29	Global Geography Overview

Appendix C.2. Domain Keywords by Topic

Table A3 provides the top domain-specific keywords for selected topics, extracted using BERTopic’s c-TF-IDF weighting scheme. These keywords formed the domain lexicon used for the masking strategies.

Table A3. Top keywords for selected topics from the Amazing Stories corpus.

Topic	Top Keywords
Neptunian Cataclysm	neptune, astronomical, planet, saturn, jupiter, solar system, comet, toward sun, uranus, astronomer, sky, orbit, telescope, sunward, moon
Scientific Crime Trial	crime, trust, priestley, professor fleckner, detective, crime, district attorney, fleckner, prison, mystery, professor kempton, chandler, prisoner, judge, lawyer, murderer
Carnivorous Creatures	prey, spider, insect, caterpillar, carnivorous, tribesman, bee, claw, fern, wasp, monster, spear, edible, mushroom, beetle
Rocket Propulsion Engineering	gyroscope, rocket tube, power unit, apparatus, ship, velocity, generator, speed, pilot, acceleration, cylinder, unit, torpedo, mechanism, control room
Court Rituals	servant, slave, majesty, ceremony, noble, lord, queen, prayer, master, royal, traitor, high priest, robe, monarch, sanctuary

Appendix D. Statistical Significance Testing Results

This work evaluated significance within each architecture by pairing the proposed variant with the original model on identical test items and applying McNemar’s exact test (per task) with Holm–Bonferroni correction across the five probing tasks. Table A4 reports the accuracy differences (

Δ

acc, in percentage points) alongside corrected significance markers.

Table A4. Within-architecture McNemar tests (proposed vs. original).

Δ

acc is percentage points. Per-task significance uses Holm–Bonferroni correction within each architecture (*** p < 0.001). Micro rows show uncorrected McNemar tests (one test per architecture).

Table A4. Within-architecture McNemar tests (proposed vs. original).

Δ

acc is percentage points. Per-task significance uses Holm–Bonferroni correction within each architecture (*** p < 0.001). Micro rows show uncorrected McNemar tests (one test per architecture).

Task	N	BERT $Δ$ acc	RoBERTa $Δ$ acc
BShift	600	−0.10%	+0.17%
SOMO	600	+0.49% ***	+0.78% ***
Coord_Inv	594	−0.09%	−0.05%
TreeDepth	360	+0.00%	+0.00%
WordContents	350	+0.00%	+0.00%
Micro (BERT)	2504	+0.72% ***
Micro (RoBERTa)	2504		+0.98% ***

For BERT, Holm–Bonferroni corrected McNemar tests reveal significant gains on SOMO (

Δ = + 0.49 %

,

p < 0.001

), while the remaining tasks show small, non-significant differences. For RoBERTa, corrected tests similarly reveal significant gains on SOMO (

Δ = + 0.78 %

,

p < 0.001

), with other tasks showing minimal changes.

In micro-average, BERT yields

Δ

acc = +0.72% (McNemar exact

p = 3.5 \times 10^{- 4}

;

N = 2504

), and RoBERTa yields

Δ

acc = +0.98% (McNemar exact

p = 1.54 \times 10^{- 5}

;

N = 2504

). These results confirm a small but statistically significant overall improvement within both architectures, supporting the effectiveness of the proposed weighted contrastive learning framework despite modest absolute gains.

Appendix E. Validation Against Human Annotation

Appendix E.1. Expert Annotation Protocol and Agreement Analysis

To evaluate the reliability of GPT-4.1-mini similarity assessments, this work collected human annotations on a subset of 250 sentence pairs. Three independent experts scored each pair using the same 0–5 similarity rubric described in Appendix A.1.

Inter-rater agreement metrics demonstrate strong consistency:

ICC(2,k): 0.975 (average measures).
Krippendorff’s $α$ : 0.929 (interval), 0.918 (ordinal), 0.640 (nominal).

The high interval and ordinal

α

values (>0.90) indicate excellent agreement on similarity magnitudes, while the lower nominal

α

reflects expected variation in exact score assignment. These metrics establish the expert annotations as a reliable reference standard.

Appendix E.2. GPT vs. Expert Score Comparison

Comparison between GPT-4.1-mini scores and mean expert annotations (

n = 250

) reveals strong but imperfect alignment, as shown in Table A5.

Table A5. Comparison of GPT-4.1-mini similarity scores with human expert annotations (

n = 250

pairs). Confidence intervals computed via Fisher’s z-transform (Pearson) and inverse-sine transformation (Spearman). ***

p < 0.001

.

Table A5. Comparison of GPT-4.1-mini similarity scores with human expert annotations (

n = 250

pairs). Confidence intervals computed via Fisher’s z-transform (Pearson) and inverse-sine transformation (Spearman). ***

p < 0.001

.

Metric	Value [95% CI]
Pearson correlation (r)	0.607 [0.522, 0.680] ***
Spearman correlation ( $ρ$ )	0.638 [0.561, 0.704] ***
Mean Absolute Error	0.987
Root Mean Squared Error	1.343
Mean bias (GPT − Expert)	−0.347

The moderate-to-strong correlations (

r > 0.60

) shown in Table A5 confirm that GPT scores capture relative similarity relationships effectively. The systematic underestimation bias (

- 0.347

) suggests GPT-4.1-mini is more conservative than human annotators but maintains consistent ranking of similarity levels.

Appendix E.3. Sample Size Adequacy and Confidence Intervals

The validation sample of 250 sentence pairs provides statistically informative evidence, though it does not fully resolve the uncertainties inherent in small-scale annotation studies. The observed correlation (

r = 0.607

,

p < 10^{- 26}

) is statistically significant, with 95% confidence intervals [0.522, 0.680] indicating a stable moderate-to-strong relationship. According to Cohen’s guidelines [42], detecting a medium effect size (

r \geq 0.30

) at 80% power requires only about

n \approx 84

pairs; the present validation therefore meets and surpasses this threshold. Prior contrastive learning work such as SimCSE [10] and DiffCSE [13] relied primarily on automatic STS benchmarks, while CLAIF [14] introduced a smaller-scale human evaluation (around

n = 100

) without reporting inter-rater reliability or confidence intervals. In contrast, the present validation extends this line of work by using

n = 250

expert annotations and reporting agreement statistics (ICC = 0.975; Krippendorff’s

α

) together with confidence intervals, thereby providing additional transparency.

At the same time, certain constraints remain. The confidence interval width of approximately 0.16 reflects residual uncertainty in the precise correlation magnitude, and the dataset size limits more detailed subgroup analyses (e.g., narrative versus technical sentences). Larger validation sets of 500–1000 pairs would allow finer calibration and stronger generalization across diverse text types.

In sum, while the

n = 250

validation set has limitations, it provides a statistically grounded check that complements benchmark-based evaluations and supports the claim that GPT-4.1-mini offers informative and reliable supervision signals for domain adaptation. Future work expanding the scale of manual validation would add precision but does not undermine the present study’s conclusions.

References

Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; pp. 4171–4186. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
Qiu, X.; Sun, T.; Xu, Y.; Shao, Y.; Dai, N.; Huang, X. Pre-trained models for natural language processing: A survey. Sci. China Technol. Sci. 2020, 63, 1872–1897. [Google Scholar] [CrossRef]
Ling, C.; Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.; Chowdhury, T.; Li, Y.; Cui, H.; Zhang, X.; et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv 2023, arXiv:2305.18703. [Google Scholar]
Li, B.; Zhou, H.; He, J.; Wang, M.; Yang, Y.; Li, L. On the sentence embeddings from pre-trained language models. arXiv 2020, arXiv:2011.05864. [Google Scholar] [CrossRef]
Manjavacas, E.; Fonteyn, L. Adapting vs. pre-training language models for historical languages. J. Data Min. Digit. Humanit. 2022, NLP4DH. [Google Scholar] [CrossRef]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y.; et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv 2021, arXiv:2107.02137. [Google Scholar]
Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
Yan, Y.; Li, R.; Wang, S.; Zhang, F.; Wu, W.; Xu, W. Consert: A contrastive framework for self-supervised sentence representation transfer. arXiv 2021, arXiv:2105.11741. [Google Scholar]
Giorgi, J.; Nitski, O.; Wang, B.; Bader, G. Declutr: Deep contrastive learning for unsupervised textual representations. arXiv 2020, arXiv:2006.03659. [Google Scholar]
Chuang, Y.S.; Dangovski, R.; Luo, H.; Zhang, Y.; Chang, S.; Soljačić, M.; Li, S.W.; tau Yih, W.; Kim, Y.; Glass, J. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. arXiv 2022, arXiv:2204.10298. [Google Scholar]
Cheng, Q.; Yang, X.; Sun, T.; Li, L.; Qiu, X. Improving Contrastive Learning of Sentence Embeddings from AI Feedback. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; pp. 11122–11138. [Google Scholar] [CrossRef]
Zhang, Y.; He, R.; Liu, Z.; Lim, K.H.; Bing, L. An unsupervised sentence embedding method by mutual information maximization. arXiv 2020, arXiv:2009.12061. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Zhou, K.; Zhang, B.; Zhao, W.X.; Wen, J.R. Debiased contrastive learning of unsupervised sentence representations. arXiv 2022, arXiv:2205.00656. [Google Scholar] [CrossRef]
Kim, Y.; Oh, D.; Huang, H.H. SynCSE: Syntax Graph-based Contrastive Learning of Sentence Embeddings. Expert Syst. Appl. 2025, 287, 128047. [Google Scholar] [CrossRef]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Zhao, H.; Des Combes, R.T.; Zhang, K.; Gordon, G. On learning invariant representations for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2019; Volume 97, pp. 7523–7532. [Google Scholar]
Zhang, Y.; Liu, T.; Long, M.; Jordan, M. Bridging theory and algorithm for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2019; Volume 97, pp. 7404–7413. [Google Scholar]
Acuna, D.; Zhang, G.; Law, M.T.; Fidler, S. f-domain adversarial learning: Theory and algorithms. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, 18–24 July 2021; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2021; Volume 139, pp. 66–75. [Google Scholar]
He, Y.; Wang, H.; Li, B.; Zhao, H. Gradual domain adaptation: Theory and algorithms. J. Mach. Learn. Res. 2024, 25, 1–40. [Google Scholar]
Pham, T.H.; Wang, Y.; Yin, C.; Zhang, X.; Zhang, P. Open-Set Heterogeneous Domain Adaptation: Theoretical Analysis and Algorithm. arXiv 2024, arXiv:2412.13036. [Google Scholar] [CrossRef]
Wang, Z.; Mao, Y. Information-theoretic analysis of unsupervised domain adaptation. arXiv 2022, arXiv:2210.00706. [Google Scholar]
Chen, Y.; Li, S.; Li, Y.; Atari, M. Surveying the Dead Minds: Historical-Psychological Text Analysis with Contextualized Construct Representation (CCR) for Classical Chinese. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; pp. 2597–2615. [Google Scholar] [CrossRef]
Wan, Z.; Zhang, Y.; Wang, Y.; Cheng, F.; Kurohashi, S. Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise: A Case Study on Chinese Legal Domain. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; pp. 5030–5041. [Google Scholar] [CrossRef]
Manjavacas Arevalo, E.; Fonteyn, L. MacBERTh: Development and Evaluation of a Historically Pre-trained Language Model for English (1450–1950). In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, Silchar, India, 16–19 December 2021; Hämäläinen, M., Alnajjar, K., Partanen, N., Rueter, J., Eds.; pp. 23–36. [Google Scholar]
Xu, J.; Shao, W.; Chen, L.; Liu, L. SimCSE++: Improving contrastive learning for sentence embeddings from two perspectives. arXiv 2023, arXiv:2305.13192. [Google Scholar]
Jiang, T.; Jiao, J.; Huang, S.; Zhang, Z.; Wang, D.; Zhuang, F.; Wei, F.; Huang, H.; Deng, D.; Zhang, Q. Promptbert: Improving bert sentence embeddings with prompts. arXiv 2022, arXiv:2201.04337. [Google Scholar] [CrossRef]
Su, J.; Cao, J.; Liu, W.; Ou, Y. Whitening sentence representations for better semantics and faster retrieval. arXiv 2021, arXiv:2103.15316. [Google Scholar] [CrossRef]
Wu, X.; Gao, C.; Zang, L.; Han, J.; Wang, Z.; Hu, S. Esimcse: Enhanced sample building method for contrastive learning of unsupervised sentence embedding. arXiv 2021, arXiv:2109.04380. [Google Scholar]
Schick, T.; Schütze, H. Generating datasets with pretrained language models. arXiv 2021, arXiv:2104.07540. [Google Scholar] [CrossRef]
Meng, Y.; Huang, J.; Zhang, Y.; Han, J. Generating training data with language models: Towards zero-shot language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 462–477. [Google Scholar]
Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional ai: Harmlessness from ai feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Kiros, R.; Zhu, Y.; Salakhutdinov, R.R.; Zemel, R.; Urtasun, R.; Torralba, A.; Fidler, S. Skip-thought vectors. In Proceedings of the Advances in Neural Information Processing Systems 28 (NeurIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 3294–3302. [Google Scholar]
Logeswaran, L.; Lee, H. An efficient framework for learning sentence representations. arXiv 2018, arXiv:1803.02893. [Google Scholar] [CrossRef]
Kim, T.; Yoo, K.M.; Lee, S.G. Self-guided contrastive learning for BERT sentence representations. arXiv 2021, arXiv:2106.07345. [Google Scholar] [CrossRef]
Carlsson, F.; Gogoulou, E.; Ylipää, E.; Cuba Gyllensten, A.; Sahlgren, M. Semantic re-tuning with contrastive tension. In Proceedings of the International Conference on Learning Representations (ICLR 2021), Online, 3–7 May 2021. Poster presentation. [Google Scholar]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences; Routledge: New York, NY, USA, 2013. [Google Scholar]

Figure 1. Overview of the proposed workflow. (A) LLM-guided contrastive pair generation. A specialized domain corpus is processed using topic modeling (30 topics) to obtain domain-specific vocabulary. Three masking strategies (single-, multi-, partial-keyword) produce masked sentences

h^{'}

, which GPT-4.1-mini rewrites into

h^{+}

and scores for semantic similarity on a 0–5 scale

y_{i}

. (B) Weighted contrastive learning optimization. The triplets

(h_{i}, h_{i}^{+}, y_{i})

are used to fine-tune a pre-trained sentence encoder with the weighted loss so that higher similarity scores pull highly related sentence pairs closer in embedding space, resulting in domain-adapted representations.

Figure 2. t-SNE projections of sentence embeddings for five domain-specific topic clusters: Neptunian Cataclysm (blue circles), Scientific Crime Trial (orange squares), Carnivorous Creatures (green triangles), Rocket Propulsion Engineering (red diamonds), and Court Rituals (purple crosses). The proposed framework (bottom row) demonstrates superior semantic discriminability with tight, non-overlapping clusters compared to baseline models (top row).

Table 1. Principal hyperparameters for domain adaptation training.

Encoder	Learning Rate	Masking Ratio	$λ$	Epochs	Batch Size
BERT-base	7 $\times 10^{- 6}$	0.30	0.005	2	64
RoBERTa-base	1 $\times 10^{- 5}$	0.20	0.005	2	64

Table 2. SF-ProbeEval: a probing dataset for early 20th-century science fiction text understanding.

Task	Description
Word Contents	Identify science fiction terminology and archaic vocabulary, evaluating adaptation to period-specific lexicons including scientific devices, astronomical terms, and technological concepts from pulp-era narratives.
Tree Depth	Predict syntactic complexity levels in vintage prose, assessing understanding of elaborate sentence constructions characteristic of 1920s–1930s science fiction writing styles.
BShift (Bigram Shift)	Detect local syntactic perturbations in period-appropriate word sequences, measuring sensitivity to historical word order patterns and archaic grammatical structures.
SOMO (Semantic Odd Man Out)	Identify semantic anomalies within science fiction contexts, evaluating understanding of genre-specific relationships including scientific speculation and technological innovation concepts.
Coord_Inv (Coordinate Inversion)	Detect structural modifications in complex vintage sentences, assessing comprehension of elaborate discourse patterns typical of early science fiction literary style.

Table 3. Performance comparison across different domain adaptation approaches on SF-ProbeEval benchmark. Results demonstrate the practical effectiveness of AI-guided feedback under resource-constrained deployment scenarios.

Model	Method	Word Contents (%)	Tree Depth (%)	BShift (%)	SOMO (%)	Coord_Inv (%)	Avg (%)
BERT	BERT-base	47.14	19.72	67.50	63.33	79.80	55.50
	SimCSE	71.43	17.78	64.17	59.50	69.87	56.55
	DiffCSE	73.71	13.33	62.67	61.67	67.17	55.71
	Proposed	79.71	18.33	69.67	67.00	80.64	63.07
RoBERTa	RoBERTa-base	56.00	20.00	67.17	55.17	72.90	54.25
	SimCSE	74.86	13.89	48.33	60.67	48.99	49.35
	DiffCSE	78.29	15.28	53.33	58.33	51.52	51.35
	Proposed	80.86	13.33	66.67	56.50	74.07	58.29

Note: Results in bold correspond to the best-performing method for each encoder.

Table 4. Controlled comparison of contrastive learning methodologies using identical specialized domain training data. Results isolate the contribution of AI-guided feedback mechanisms.

Model	Method	Word Contents (%)	Tree Depth (%)	BShift (%)	SOMO (%)	Coord_Inv (%)	Avg (%)
BERT	SimCSE	69.43	16.94	64.17	61.33	75.93	57.56
	DiffCSE	66.57	18.61	70.17	66.00	78.96	60.06
	Proposed	79.71	18.33	69.67	67.00	80.64	63.07
RoBERTa	SimCSE	71.14	20.83	47.50	47.00	59.43	49.18
	DiffCSE	72.00	18.89	52.00	48.17	66.16	51.44
	Proposed	80.86	13.33	66.67	56.50	74.07	58.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

LLM-Guided Weighted Contrastive Learning with Topic-Aware Masking for Efficient Domain Adaptation: A Case Study on Pulp-Era Science Fiction

Abstract

1. Introduction

2. Related Works

2.1. Mathematical Formulation of Domain Adaptation

2.2. Domain Adaptation for Specialized Text Collections

2.3. Contrastive Learning and Domain-Aware Augmentation

3. Methodology

3.1. Problem Formulation

3.2. Domain-Specific Dataset Construction

3.3. Language Model Architectures

3.4. LLM-Guided Contrastive Pair Generation

3.4.1. Domain-Aware Masking Strategy

3.4.2. LLM-Based Sentence Generation and Scoring

3.5. Weighted Contrastive Learning Framework

3.6. Implementation Details

4. Experimental Evaluation

4.1. SF-ProbeEval: A Domain-Specific Probing Benchmark

4.2. Performance Analysis

4.3. Embedding Quality Assessment

4.4. Ablation Study

5. Discussion

5.1. Effectiveness of AI-Generated Feedback

5.2. Computational Efficiency and Resource Constraints

5.3. Framework Design and Transferability

5.4. Limitations and Future Research Directions

6. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Methodological Details

Appendix A.1. GPT-4.1-mini Prompts and API Configuration

Appendix A.2. Masking Strategy Examples

Appendix B. Extended Related Work Details

Appendix B.1. Evolution of Contrastive Learning Methods

Appendix B.2. Mathematical Formulations

Appendix C. Corpus Analysis Results

Appendix C.1. Complete Topic Classification

Appendix C.2. Domain Keywords by Topic

Appendix D. Statistical Significance Testing Results

Appendix E. Validation Against Human Annotation

Appendix E.1. Expert Annotation Protocol and Agreement Analysis

Appendix E.2. GPT vs. Expert Score Comparison

Appendix E.3. Sample Size Adequacy and Confidence Intervals

References

Article Metrics

Citations

Article Access Statistics